I felt the article undersold an important technical detail about this work. Many randomized algorithms have a "with high probability" guarantee, which for linear systems might look like:
for any ε > 0 and 0 < δ < 1, using poly(dimension, #nonzeros, condition, 1/ε, 1/δ) floating-point operations, our algorithm finds a solution having less than ε error with probability 1 - δ.
However, that is not the kind of guarantee stated in Theorem 1.1 in the paper. Even though the algorithm uses randomness, the analysis yields a classic big-O complexity guarantee that does not involve probabilities.
This is a big difference, and it makes the result stronger.
Not really. This algorithm is only fast with high probability, and getting a linear equation solver to be always-correct is just a matter of checking the output at the end and trying again if you're (too) wrong. So always-correct and fast-with-high-probability is a feature of any randomized linear equation solver.
It still amazing stuff, I'm just pointing out that this particular aspect of the behaviour is not surprising.
This is a pretty common omission to make, even though it's strictly speaking wrong. But suppose for a moment that the algorithm was always fast. In that case, it would not need randomness anymore, as you could just pick a fixed string as the "random" outcome, and you would be guaranteed success.
So, if an algorithm uses randomness, it will either fail with some probability or be slow with some probability. If neither is true, that algorithm doesn't need the randomness.
Randomized quick sort could pick the wrong pivot at every step (or at every other step, that's still quadratic complexity), so it's only O(n log n) with high probability.
Quicksort is popular, and Quicksort bad inputs can happen in highly plausible specific cases (e.g. almost sorted input vs. pivot on first array element) and can be easily engineered whenever an attacker controls a large input to be sorted and can guess what sorting algorithm is being used. Probability of bad performance is only low on implausibly random input instances.
Is this a theoretical difference, or a difference in terms of an actual implementation? I have no knowledge of linear systems algorithms, but as far as I was aware, many w.h.p algorithms resolve to correctness with probability of 1 - (1/n)^c (with c effectively being a hyperparameter), which would seem like quite a strong result in itself.
Stripping out the fluff: The authors proved that iterative methods for solving sparse linear systems can be guaranteed to be faster than the best methods of solving dense linear systems.
I don't think the article was bad enough to call it "fluff". Also the result is technically "guaranteed to be faster than the *currently best known* methods of solving dense linear systems", no? It might be the case that there's still some better way involving matrices that's faster than this iterative guessing method. Or did they prove that none could possibly exist?
This method converges unconditionally for any input with the required sparsity when using numerical representations with a fixed number of bits. No traditional sparse iterative solver has that property without further conditioning the matrix or increasing the run-time to handle the higher precision. As they lay out in the paper (third paragraph of introduction section on the preprint), for certain (rare in practice) matrices direct solvers still offered better runtime if convergence is demanded for all inputs with a certain sparsity.
This article (not the paper) definitely oversells the immediate practical applications; nobody is going to throw out their CG solver and pick this method up in practice to guarantee convergence in this manner.
Marrix multiplication is fundamental to basically any numerical method you can think of, with applications ranging from machine learning to numerical solutions to differential equations to graph algorithms used in network analysis
With a sparse system, you might run into dependency chains that effectively make it dense.
Like in an adjacency matrix for a graph, you may have a lot of sparsity, but if you're forced to consider neighbors of neighbors, the adjacency matrix you need is A^2. This can be dense even if A is quite sparse.
It's being downvoted because it comes across as someone arrogantly dismissing an important work because they have no idea what they are talking about. If they instead asked "why is this noteworthy? in my ignorance this seems obvious", they might have gotten a better reception.
It does, but the problem this algorithm handles is dealing with demanding convergence for all input matrices without adding an additional “n” factor to handle increased precision. Meanwhile deep learning is moving in the direction of 16-bit floating point and 8-bit integer scalars.
For some easy to grok examples: imagine MNIST, where all the pixels besides the writing are 0; or a recommendation system, where column i has the users rating for movie i, or number of times they listened to song i; or any one-got encoding scheme
I'm super 100% out of my depth on this on the math side of things, so bear with me.
But isn't matrix multiplication one of the core basis assumptions of GPU based computing architectures, in particular for AI? That doing math in parallel (which matrices fit into like a glove) gets the fastest result?
Does this imply that a new future architecture could be, at least in theory, get the natural advantage over what we have at the moment? Because if the answer is yes or even a "possibly, too early to tell, requires more R&D", then the existing GPU manufacturers suddenly have a long term problem and we're very likely about to see a decade of a new arms race in computing architecture.
You (and apparently a few of the other commenters here) have some misconceptions about what this article is about. The paper this article is based on is about solving a linear system (i.e. solve for x in the equation Ax = b), not matrix multiplication, which is completely different. From what I can tell, the authors of the paper just claim that their method for solving a certain class of linear systems is faster than the asymptotic cost of the fastest known matrix-matrix multiplication algorithm.
While solving a system of linear equations and multiplying matrices are different problems they are related.
The straightforward algorithms use N^3 operations for matrix multiplication and (1/3)*N^3 operations for solving dense systems of linear equations (1 operation being a multiply-add).
For both problems, there are more complex algorithms that reduce the value of the exponent from 3 to some lower value, but above 2.
Methods that can accelerate matrix multiplication can also accelerate the solution of dense systems of linear equations (because the reduction of terms in the matrix of the coefficients of the system of equations can be done using matrix multiplications, of some sub-blocks of the complete matrix).
So the paper demonstrates, as another poster already said, that if the system of equations is sparse, then solving it can always be done faster than if it were dense.
In practice this was already usually true, so sparse systems are typically solved using a number of operations proportional with N^2, but there were cases when previous methods could fail.
I assume that their contribution is to show a foolproof method, which is guaranteed to be better.
I don't know if it's an appropriate / relevant question to ask, but how does it impact on a majority of programmers / ML practitioners (if they are not into the mathematical details)? Don't get me wrong, solving linear systems with more efficient methods is very appreciated but in my experience the coder only needs to choose which solver to use in a wrapped up library unless you are implementing a very low-level algorithm yourself.
For a real example, if the solver isn't good at yielding a solution in a practical time frame (say if I'm fitting models on large genomic dataset of tens of thousands of genes without narrowing down the set of interests), then I might be wrong to use it at all. Isn't that right? Or am I wrong as the article might suggest you could do that in the future?
Perhaps it is possible to go backwards from solving to matrix multiplication. Your example Ax = b can be solved using a matrix multiplication with the inverse matrix: x = A^-1 b. So if one can solve a linear system faster than by matrix multiplication, one could try to use the same algorithm for matrix multiplication.
If you want A^-1 b, where A^-1 is a known n×n matrix and b an n-dimensional column vector, you can just use the naive method, which takes O(n^2) steps. If you use the new linear solver, you'll need O(n^2.332) steps just for that, and additionally you'll have to find A somehow. So matrix-vector multiplication is faster.
In the article, they're comparing it to multiplication of square matrices, which is slower, but only slower than solving a single system of equations. If you wanted to multiply square matrices with the new linear solver, you'd have to solve n different systems of equations with shared coefficients, which would end up slower than the naive O(n^3) method, unless you can share work across instances somehow.
But even that would be limited since not all matrices have inverses. They do have 'pseudo-inverse's (also called the Moore–Penrose inverse) but he applicability there might be limited
When solving a linear system, x and b only have one column. So unless the algorithm is general enough to not have that limitation, it won't be applicable to general matrix multiplication.
A matrix is a concatenation of columns yes, but how does that help us multiply matrices by using a sparse linear-system solver? Perhaps it's trivial but I don't see it.
As others mentioned, this algorithm isn’t matrix multiplication.
Even if it were, as the article mentions, we don’t even know what the optimal (in terms of basic operations on numbers) algorithm for multiplying matrices is. See https://en.wikipedia.org/wiki/Strassen_algorithm for a starting point.
(“Optimal in time” is a different problem. For fixed-size matrices, as often used in computer graphics, that may be known, but even then, there’s the effect of caching, time needed to move data to and from the GPU, etc)
Also, Strassen multiplication and its improvements have a “somewhat reduced numerical stability, and the algorithm also requires significantly more memory compared to the naive algorithm” (https://en.wikipedia.org/wiki/Numerical_stability)
This algorithm may have some of the same problems.
Being random, it likely also will have varying running times. That can be bad for GPUs if they are used to generate real-time graphics.
In total, I don’t see GPU manufacturers being worried about this. They already live with the worry that somebody will discover an O(n²) matrix multiplication algorithm that’s implementable in hardware and more efficient for small n, and I don’t think they lose sleep over that.
You are confusing the efficiency of an algorithm with its implementation. Worth repeating the famous Dijkstra's dictum here : “Computer Science is no more about computers than astronomy is about telescopes.”
I don't believe I am confused actually. I get this is just an algorithm hypothetical being answered right now. I just care about the telescopes way more than the astronomy and am thinking steps ahead about the meta game of the telescope market. Just as I personally would care far more about the real world implications of P=NP far more than the act of solving it.
I understand that the answer is no here. Because this method is only suitable for the class of linear system thats the equivalent of sparse matrices. GPUs on the other hand are more optimized for the general purpose matrix multiplication here. Unless it can be shown that there are certain economically high-usage scenarios of this class of problems (e.g. the usage magnitude of bitcoin mining), the investment into this specific research does not seem warranted.
Firstly, faster solutions in fundamental problems can eventually lead to hardware that supports it.
Secondly, this is already happening for sparse matrix multiplication: the nvidia A100 has some sparsity support, to allow making better use of pruned neutral networks, for example.
Thirdly, sparse enough systems, even without the A100, can run faster on cpu than gpu. If you find yourself with one of these problems, you can just choose the correct piece of hardware for the job. Without a sparse algorithm, you are still stuck with the slower dense solution.
Fourthly, giant sparse systems do indeed arise constantly. Just to make one up, consider weather measurements. Each row is a set of measurements from a specific weather station, but there are thousands of stations: it's a sparse set of observations, with some nearby dependencies. Evolving the state in time will often involve solving a giant linear system. (See other comments on the thread about pdes.)
It is absolutely worthwhile research, regardless of how applicable it is to fscking bitcoin.
If this paper whets your appetite (talking mostly to grandparent comment, but also HN at large) OR alternatively if you think that this paper is a bit too advanced for your knowledge level, I implore you to read Scott Aaronson's book "Quantum Computing Since Democritus". The book is one of the finest in the genre of "too casual to be a textbook, too informative to be an easy read for laymen" nonfiction, and it is primarily concerned with philosophy vis-à-vis computational complexity. It does a pretty good job at assuming you only have a bit of computer science background (pun unintended, noticed, left in) and builds your knowledge base up from introductory set theory and first order logic in the earliest pages (In fact, it's based on a course he taught, and his lecture notes are available online and frequently make for fantastic supplemental material [1]). One of my favorite reads of 2020.
Thank you for showing it is “to whet your appetite”. I always thought it was “to wet your appetite”, which made no sense at all. This is the first time I’d read this phrase instead of heard it spoken.
This part of CS is indeed more a Geisteswissenschaft ("science of the mind").
I think what is most fascinating when you compare it to more conventional philosophy is the fact that you can quantify your advances so easily and precise.
One of my favorite little-known algorithmic facts is that a sparse system of n linear equations in n unknowns over a finite field can be solved in time O(mn) if the corresponding matrix has m nonzero entries[1]. I'm surprised at how hard it is to extend this to the reals - it seems like dealing with the precision issues is much tougher than I imagined.
I have no idea what happened in this case, but that doesn't follow - MIT is one of those schools that routinely promote to associate without necessarily giving tenure [0].
The “smart guessing” approach reminds me a lot of SAT solving. Certain search problems become easier when you introduce randomness and clever heuristics to optimize and guide your guesses.
Edit: And here’s a link to the paper discussed in the article for anyone who is curious:
That's really excellent writing. It explains the gist of it really well. I suspect the proof details are very complex though. It's also not clear if this method will practically be a speed up, or just asymptotically.
It's still a very cool result though: I love how my intuition would tell me that solving matrix multiplication is surely the fastest way, and then someone comes along and explains this method and my brain goes "oh yeah, that does make sense it could be faster". That's the beauty of certain results and explanations.
My feeling was just the opposite. I was constantly thinking "stop acting like I'm 12 and have 0 knowledge about linear algebra." As usual, it just depends what your context is, and the author will be happy to know there's an audience that finds it excellent.
I have a master's degree in math and a PhD in STEM so I know how linear algebra works. So when I say it's excellent writing, I don't mean perfectly calibrated for me or you. I mean it's excellent writing for a wide audience to get the gist of it. I think that's very difficult to do for math. An average 12 year old would not understand matrices without a very clear explanation.
Having taught math to both children and undergrads, I think it's very easy to underestimate how easy it is once you already understand it. A sentence like:
> for any ε > 0 and 0 < δ < 1, using poly(dimension, #nonzeros, condition, 1/ε, 1/δ) floating-point operations, our algorithm finds a solution having less than ε error with probability 1 - δ.
Does anyone know how big are the constant terms here?
A few years ago I studied and partly implemented an earlier (related?) work by Peng (on graph laplacians and SDD systems: https://epubs.siam.org/doi/abs/10.1137/110845914 ) and the big-O improvements did not seem to compensate for the larger constant.
The constants are galactic: any 'fast matrix multiplication' algorithm outside of Strassen's algorithm has some incredible constants that are somewhat intrinsic to the recursive framework. The algorithm is primarily of theoretical importance (prior to this no one knew whether sparsity significantly helped methods of this type for solving linear systems), but it is not implementable.
However the block Krylov algorithm itself presented in this paper has a little bit more of a chance of being implementable than fast matrix multiplication (the matrix multipliciation is only used to solve small linear systems to deal with small eigendirections in the Krylov subspace). I am still skeptical that this is a truly practical algorithm due to its complexity, but unlike the case of generic FMM there is no obvious bottleneck.
On the other hand, in recent years Dan Spielman and collaborators have been working on fast implementations of Laplacian solvers: https://github.com/danspielman/Laplacians.jl I believe a lot of the fixed constants and combinatorial routines are changed from what is theoretically provable, but from screwing around with the code in the past it seems very fast in practice.
There are no matrix multiplications in linear solvers in general. There are Matrix vector multiplications, typically far fewer than the size of the matrix. This confusion hides that solving linear systems was already far faster than matrixmatrix multiplication, and makes it seem like a stronger result than it is. I would like to see this compared to e.g. sparse cg, for any non random datamatrix.
This is a theoretical result - for now at least. Theoretically the fastest linear solvers run in time n^w where w ~ 2.3 is the matrix multiplication constant. In practice the fast matrix multiplication algorithms are not that fast, so they aren't used for linear solvers.
In practice, the majority of the really big matrices I encounter are very sparse, because they're often constructed as a reduction from some other much smaller representation so we can use linear algebra methods.
What is the actual sparsity criterion to achieve this big-O bound? (Edit: found in arxiv - o(nω−1/log(κ(A))) non-zeros). Note also dependence on condition number.
“In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any ω>2. This speedup holds for any input matrix A with o(nω−1/log(κ(A))) non-zeros, where κ(A) is the condition number of A. For poly(n)-conditioned matrices with Õ(n) nonzeros, and the current value of ω, the bit complexity of our algorithm to solve to within any 1/poly(n) error is O(n2.331645).
Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC `06 `07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices.”
Hmm, somehow reminds me of my controls classes way back where I learned that injecting some random noise in the measurements of a closed loop can actually help the estimator to converge better.
its really simple... in rough terms it breaks loops that your estimator might get itself in! Imagine you estimator is a dynamic system, and while its converging it get trapped in a cycle between two values. Adding a little noise can break this cycle.
For those that need a concise refresher on - or introduction to - linear equations there's an (as usual) excellent series of 15 short videos from 3blue1brown:
If you only know there are 10 heads and 30 feet, how many chickens are there, and how many pigs? As algebra students learn, there’s a set procedure for figuring it out
If there are 30 feet, then there are 15 chickens. Because pigs don't have feet.
Question from someone who is not super knowledgable about this subject - is this conceptually similar to using ML to solve linear equations? I.e. randomly guess and update weights...
Is there a field of research in using ML approaches for this type of problem?
Ax=b could certainly be solved with gradient descent methods popular in ML, but this is a lot more direct than that. It’s a little more like solving random pieces until it all comes together.
1. Simplex is used to solve linear programs, which are inequalities (not equations). While it's true that your solution will typically be on one of the vertices of your polytope, that just means that some (not necessarily all) of your constraints are hard constraints.
2. The worst-case time complexity of Simplex is exponential.
AFAIR first step of simplex method is to convert system of inequalities to system of equalities by adding weak variables and making tableau (matrix) of those. And then incrementally solve that.
And usually simplex is considered as having polynomial complexity - close to their estimations.
(I'd be interested to see an average-case running-time analysis of Simplex that yields O(n^2.332) additions/multiplications, where n is the number of variables. Or even O(n^3). Because that's what's claimed here.)
That's mostly correct. You end up with a system of equalities for _a subset_ of your points, which yields a single vertex.
Ultimately with a linear program, you're trying to optimize an objective function,
max w'x s.t. Ax <= b, x >= 0 (in canonical form).
The x >= 0 constraint makes this nontrivial to translate from a linear equation Ax = b. If you magically knew the sign of each component of the solution, you'd be able to perform a simple variable substitution, but you generally don't, at least not without solving the system of equations.
(And yes, you need that constraint, or else your LP is unbounded, and your solution ends up with a negative infinity in it.)
Simplex uses solving systems of linear equations as a building block.
Most variants of simplex have exponential runtime on worst case input. (But they do well in practice.)
Recently, some variants of simplex with polynomial runtime have been found.
Solving linear equations was always known to 'only' take polynomial amounts of time at most. But the question is whether that's n^3 or n^2 or something better..
This is a big difference, and it makes the result stronger.