Hacker News new | past | comments | ask | show | jobs | submit login
New algorithm breaks speed limit for solving linear equations (quantamagazine.org)
552 points by ot on March 9, 2021 | hide | past | favorite | 112 comments



I felt the article undersold an important technical detail about this work. Many randomized algorithms have a "with high probability" guarantee, which for linear systems might look like:

    for any ε > 0 and 0 < δ < 1, using poly(dimension, #nonzeros, condition, 1/ε, 1/δ) floating-point operations, our algorithm finds a solution having less than ε error with probability 1 - δ.
However, that is not the kind of guarantee stated in Theorem 1.1 in the paper. Even though the algorithm uses randomness, the analysis yields a classic big-O complexity guarantee that does not involve probabilities.

This is a big difference, and it makes the result stronger.


Not really. This algorithm is only fast with high probability, and getting a linear equation solver to be always-correct is just a matter of checking the output at the end and trying again if you're (too) wrong. So always-correct and fast-with-high-probability is a feature of any randomized linear equation solver. It still amazing stuff, I'm just pointing out that this particular aspect of the behaviour is not surprising.


Thanks for your insight!

> This algorithm is only fast with high probability

Why is this not mentioned in Theorem 1.1?


This is a pretty common omission to make, even though it's strictly speaking wrong. But suppose for a moment that the algorithm was always fast. In that case, it would not need randomness anymore, as you could just pick a fixed string as the "random" outcome, and you would be guaranteed success. So, if an algorithm uses randomness, it will either fail with some probability or be slow with some probability. If neither is true, that algorithm doesn't need the randomness.


Makes sense. Thank you for the thoughtful response. I hope my original comment is not too misleading.


I think that is definition of approximate algorithm, not randomized algorithm. Quicksort is a classic randomized algorithm, but it is not approximate.


Randomized quick sort could pick the wrong pivot at every step (or at every other step, that's still quadratic complexity), so it's only O(n log n) with high probability.


The probability is so low that this has probably never happened in all history (for reasonably sized arrays).

It’s not typical to analyse randomised algorithms this way for that reason.


Quicksort is popular, and Quicksort bad inputs can happen in highly plausible specific cases (e.g. almost sorted input vs. pivot on first array element) and can be easily engineered whenever an attacker controls a large input to be sorted and can guess what sorting algorithm is being used. Probability of bad performance is only low on implausibly random input instances.




Is this a theoretical difference, or a difference in terms of an actual implementation? I have no knowledge of linear systems algorithms, but as far as I was aware, many w.h.p algorithms resolve to correctness with probability of 1 - (1/n)^c (with c effectively being a hyperparameter), which would seem like quite a strong result in itself.


Stripping out the fluff: The authors proved that iterative methods for solving sparse linear systems can be guaranteed to be faster than the best methods of solving dense linear systems.


I don't think the article was bad enough to call it "fluff". Also the result is technically "guaranteed to be faster than the *currently best known* methods of solving dense linear systems", no? It might be the case that there's still some better way involving matrices that's faster than this iterative guessing method. Or did they prove that none could possibly exist?


They proved that any dense algorithm of n^w where w>2 can be made more efficient. Note that 2 is a trivial lower bound for w.


Thanks, this is much more interesting.


Why is this noteworthy?

(I assume this is common knowledge or at least should be)


This method converges unconditionally for any input with the required sparsity when using numerical representations with a fixed number of bits. No traditional sparse iterative solver has that property without further conditioning the matrix or increasing the run-time to handle the higher precision. As they lay out in the paper (third paragraph of introduction section on the preprint), for certain (rare in practice) matrices direct solvers still offered better runtime if convergence is demanded for all inputs with a certain sparsity.

This article (not the paper) definitely oversells the immediate practical applications; nobody is going to throw out their CG solver and pick this method up in practice to guarantee convergence in this manner.


In practice, are there maybe a few problems for which this guarantee unlocks a big performance speedup? How might this be used practically?


Marrix multiplication is fundamental to basically any numerical method you can think of, with applications ranging from machine learning to numerical solutions to differential equations to graph algorithms used in network analysis


Not sure why you’re being downvoted here; I’d like to know why this is noteworthy as well!


With a sparse system, you might run into dependency chains that effectively make it dense.

Like in an adjacency matrix for a graph, you may have a lot of sparsity, but if you're forced to consider neighbors of neighbors, the adjacency matrix you need is A^2. This can be dense even if A is quite sparse.


It's being downvoted because it comes across as someone arrogantly dismissing an important work because they have no idea what they are talking about. If they instead asked "why is this noteworthy? in my ignorance this seems obvious", they might have gotten a better reception.


Interesting, I just took it at face value. I wonder what the % split is.


It wasn't known to anyone before.


deep learning uses a lot lot of sparse linear systems


It does, but the problem this algorithm handles is dealing with demanding convergence for all input matrices without adding an additional “n” factor to handle increased precision. Meanwhile deep learning is moving in the direction of 16-bit floating point and 8-bit integer scalars.


For some easy to grok examples: imagine MNIST, where all the pixels besides the writing are 0; or a recommendation system, where column i has the users rating for movie i, or number of times they listened to song i; or any one-got encoding scheme


> or any one-got encoding scheme

Typo: should be one-hot encoding scheme.

Sparsity of input isn't often relatively easy to achieve, as you suggest.

Sparsity of intermediate layers requires more work. (But is often a good regularization technique.)


Humorously “one-got” is almost a apt description :)


> “Philosophically we didn’t know before if you can go faster than matrix multiplication,” said Vempala.

> Now we do.

This is the part of computer science that I love. It's a branch of philosophy about what is possible with computing machines.

Now all that's left is the matter of making this practical.


I'm super 100% out of my depth on this on the math side of things, so bear with me.

But isn't matrix multiplication one of the core basis assumptions of GPU based computing architectures, in particular for AI? That doing math in parallel (which matrices fit into like a glove) gets the fastest result?

Does this imply that a new future architecture could be, at least in theory, get the natural advantage over what we have at the moment? Because if the answer is yes or even a "possibly, too early to tell, requires more R&D", then the existing GPU manufacturers suddenly have a long term problem and we're very likely about to see a decade of a new arms race in computing architecture.


You (and apparently a few of the other commenters here) have some misconceptions about what this article is about. The paper this article is based on is about solving a linear system (i.e. solve for x in the equation Ax = b), not matrix multiplication, which is completely different. From what I can tell, the authors of the paper just claim that their method for solving a certain class of linear systems is faster than the asymptotic cost of the fastest known matrix-matrix multiplication algorithm.


While solving a system of linear equations and multiplying matrices are different problems they are related.

The straightforward algorithms use N^3 operations for matrix multiplication and (1/3)*N^3 operations for solving dense systems of linear equations (1 operation being a multiply-add).

For both problems, there are more complex algorithms that reduce the value of the exponent from 3 to some lower value, but above 2.

Methods that can accelerate matrix multiplication can also accelerate the solution of dense systems of linear equations (because the reduction of terms in the matrix of the coefficients of the system of equations can be done using matrix multiplications, of some sub-blocks of the complete matrix).

So the paper demonstrates, as another poster already said, that if the system of equations is sparse, then solving it can always be done faster than if it were dense.

In practice this was already usually true, so sparse systems are typically solved using a number of operations proportional with N^2, but there were cases when previous methods could fail.

I assume that their contribution is to show a foolproof method, which is guaranteed to be better.


I don't know if it's an appropriate / relevant question to ask, but how does it impact on a majority of programmers / ML practitioners (if they are not into the mathematical details)? Don't get me wrong, solving linear systems with more efficient methods is very appreciated but in my experience the coder only needs to choose which solver to use in a wrapped up library unless you are implementing a very low-level algorithm yourself.

For a real example, if the solver isn't good at yielding a solution in a practical time frame (say if I'm fitting models on large genomic dataset of tens of thousands of genes without narrowing down the set of interests), then I might be wrong to use it at all. Isn't that right? Or am I wrong as the article might suggest you could do that in the future?


Perhaps it is possible to go backwards from solving to matrix multiplication. Your example Ax = b can be solved using a matrix multiplication with the inverse matrix: x = A^-1 b. So if one can solve a linear system faster than by matrix multiplication, one could try to use the same algorithm for matrix multiplication.


If you want A^-1 b, where A^-1 is a known n×n matrix and b an n-dimensional column vector, you can just use the naive method, which takes O(n^2) steps. If you use the new linear solver, you'll need O(n^2.332) steps just for that, and additionally you'll have to find A somehow. So matrix-vector multiplication is faster.

In the article, they're comparing it to multiplication of square matrices, which is slower, but only slower than solving a single system of equations. If you wanted to multiply square matrices with the new linear solver, you'd have to solve n different systems of equations with shared coefficients, which would end up slower than the naive O(n^3) method, unless you can share work across instances somehow.


But even that would be limited since not all matrices have inverses. They do have 'pseudo-inverse's (also called the Moore–Penrose inverse) but he applicability there might be limited


When solving a linear system, x and b only have one column. So unless the algorithm is general enough to not have that limitation, it won't be applicable to general matrix multiplication.


> When solving a linear system, x and b only have one column

Isn't it linear for whatever amount of columns?


The extension to matrix-matrix is trivial, since a matrix is just a concatenation of columns.


A matrix is a concatenation of columns yes, but how does that help us multiply matrices by using a sparse linear-system solver? Perhaps it's trivial but I don't see it.


A Monte Carlo(?) method surely has no advantage for multiplication.

I assume they use it for solving a overdetermined linear system. E.g. least square fit. The article was not very clear.


As others mentioned, this algorithm isn’t matrix multiplication.

Even if it were, as the article mentions, we don’t even know what the optimal (in terms of basic operations on numbers) algorithm for multiplying matrices is. See https://en.wikipedia.org/wiki/Strassen_algorithm for a starting point.

(“Optimal in time” is a different problem. For fixed-size matrices, as often used in computer graphics, that may be known, but even then, there’s the effect of caching, time needed to move data to and from the GPU, etc)

Also, Strassen multiplication and its improvements have a “somewhat reduced numerical stability, and the algorithm also requires significantly more memory compared to the naive algorithm” (https://en.wikipedia.org/wiki/Numerical_stability)

This algorithm may have some of the same problems.

Being random, it likely also will have varying running times. That can be bad for GPUs if they are used to generate real-time graphics.

In total, I don’t see GPU manufacturers being worried about this. They already live with the worry that somebody will discover an O(n²) matrix multiplication algorithm that’s implementable in hardware and more efficient for small n, and I don’t think they lose sleep over that.


You are confusing the efficiency of an algorithm with its implementation. Worth repeating the famous Dijkstra's dictum here : “Computer Science is no more about computers than astronomy is about telescopes.”


I don't believe I am confused actually. I get this is just an algorithm hypothetical being answered right now. I just care about the telescopes way more than the astronomy and am thinking steps ahead about the meta game of the telescope market. Just as I personally would care far more about the real world implications of P=NP far more than the act of solving it.


I understand that the answer is no here. Because this method is only suitable for the class of linear system thats the equivalent of sparse matrices. GPUs on the other hand are more optimized for the general purpose matrix multiplication here. Unless it can be shown that there are certain economically high-usage scenarios of this class of problems (e.g. the usage magnitude of bitcoin mining), the investment into this specific research does not seem warranted.


Completely ridiculous.

Firstly, faster solutions in fundamental problems can eventually lead to hardware that supports it.

Secondly, this is already happening for sparse matrix multiplication: the nvidia A100 has some sparsity support, to allow making better use of pruned neutral networks, for example.

Thirdly, sparse enough systems, even without the A100, can run faster on cpu than gpu. If you find yourself with one of these problems, you can just choose the correct piece of hardware for the job. Without a sparse algorithm, you are still stuck with the slower dense solution.

Fourthly, giant sparse systems do indeed arise constantly. Just to make one up, consider weather measurements. Each row is a set of measurements from a specific weather station, but there are thousands of stations: it's a sparse set of observations, with some nearby dependencies. Evolving the state in time will often involve solving a giant linear system. (See other comments on the thread about pdes.)

It is absolutely worthwhile research, regardless of how applicable it is to fscking bitcoin.


My question is why did GPGPU stuff take so long to take off then? Computer graphics is basically applied linear algebra aka matrix math


computer graphics is doing a lot of very small (4x4 matrices) linear algebra

GPGPU is often about doing few (or one) very big (10^9 x 10^9) matrix op.

You can split big matrix op into smaller ones, but with a big matrix you need to combine the results; with independent smaller matrices you don't.


Ah interesting, that does make a lot of sense.


You might like Scott Aaronson's 'Why Philosophers Should Care About Computational Complexity' https://www.scottaaronson.com/papers/philos.pdf


If this paper whets your appetite (talking mostly to grandparent comment, but also HN at large) OR alternatively if you think that this paper is a bit too advanced for your knowledge level, I implore you to read Scott Aaronson's book "Quantum Computing Since Democritus". The book is one of the finest in the genre of "too casual to be a textbook, too informative to be an easy read for laymen" nonfiction, and it is primarily concerned with philosophy vis-à-vis computational complexity. It does a pretty good job at assuming you only have a bit of computer science background (pun unintended, noticed, left in) and builds your knowledge base up from introductory set theory and first order logic in the earliest pages (In fact, it's based on a course he taught, and his lecture notes are available online and frequently make for fantastic supplemental material [1]). One of my favorite reads of 2020.

[1] https://www.scottaaronson.com/democritus/


Thank you for showing it is “to whet your appetite”. I always thought it was “to wet your appetite”, which made no sense at all. This is the first time I’d read this phrase instead of heard it spoken.


First time I've seen it written like that. I did not know it should be "whet" rather than "wet".

> which made no sense at all

The way I made some kind of sense of it was as a variation of making your mouth water. But yeah "whet" makes a lot more sense now.


Same as ‘whetstone’, if that helps remember. You’re sharpening/increasing your appetite.


>> “Philosophically we didn’t know before if you can go faster than matrix multiplication,” said Vempala. >> Now we do.

> This is the part of computer science that I love. It's a branch of philosophy about what is possible with computing machines.

Strictly speaking, if w=2 then their bound of n^{(5 w - 4)/(w + 1)} won't be better than n^w.

Since we can't rule out w=2 we still don't know for sure that solving sparse systems is strictly easier. However this is a good indication.


This part of CS is indeed more a Geisteswissenschaft ("science of the mind").

I think what is most fascinating when you compare it to more conventional philosophy is the fact that you can quantify your advances so easily and precise.


"This relationship can be checked experimentally..." Einstein, 1905


One of my favorite little-known algorithmic facts is that a sparse system of n linear equations in n unknowns over a finite field can be solved in time O(mn) if the corresponding matrix has m nonzero entries[1]. I'm surprised at how hard it is to extend this to the reals - it seems like dealing with the precision issues is much tougher than I imagined.

[1] http://www.enseignement.polytechnique.fr/profs/informatique/...


Quick scan: it seems that this result is a real extension of that result.


One of the co-authors, Richard Peng, is well known in the competitive programming community (he has helped train the US IOI team in years past).


Santosh Vempala is also considered to be a genius, at least in the statistical learning community.

I don't know how he didn't get tenure from MIT though


According to his CV, he did get tenure.


At Georgia Tech, not at MIT. Still a great achievement though (go Jackets!).

Edit: my bad, looks like he was an associate prof at MIT, which means he was indeed tenured before moving to Tech!


I have no idea what happened in this case, but that doesn't follow - MIT is one of those schools that routinely promote to associate without necessarily giving tenure [0].

[0] https://thetech.com/2010/06/11/tenure-v130-n28


Interesting, thanks!


There is a position at MIT called associate prof without tenure. So just because he was an associate prof doesn't necessarily mean he had tenure.


Where did you hear that? It doesn't mention it in the article...


The “smart guessing” approach reminds me a lot of SAT solving. Certain search problems become easier when you introduce randomness and clever heuristics to optimize and guide your guesses.

Edit: And here’s a link to the paper discussed in the article for anyone who is curious:

https://arxiv.org/abs/2007.10254


That's really excellent writing. It explains the gist of it really well. I suspect the proof details are very complex though. It's also not clear if this method will practically be a speed up, or just asymptotically.

It's still a very cool result though: I love how my intuition would tell me that solving matrix multiplication is surely the fastest way, and then someone comes along and explains this method and my brain goes "oh yeah, that does make sense it could be faster". That's the beauty of certain results and explanations.


My feeling was just the opposite. I was constantly thinking "stop acting like I'm 12 and have 0 knowledge about linear algebra." As usual, it just depends what your context is, and the author will be happy to know there's an audience that finds it excellent.


I have a master's degree in math and a PhD in STEM so I know how linear algebra works. So when I say it's excellent writing, I don't mean perfectly calibrated for me or you. I mean it's excellent writing for a wide audience to get the gist of it. I think that's very difficult to do for math. An average 12 year old would not understand matrices without a very clear explanation.

Having taught math to both children and undergrads, I think it's very easy to underestimate how easy it is once you already understand it. A sentence like:

> for any ε > 0 and 0 < δ < 1, using poly(dimension, #nonzeros, condition, 1/ε, 1/δ) floating-point operations, our algorithm finds a solution having less than ε error with probability 1 - δ.

Would confuse the shit out of most people.

This video explains really beautifully what I mean: https://www.youtube.com/watch?v=M64HUIJFTZM


It seems like there's a lot of potential for domain specific optimization if we can improve the guessing algorithms.


Does anyone know how big are the constant terms here?

A few years ago I studied and partly implemented an earlier (related?) work by Peng (on graph laplacians and SDD systems: https://epubs.siam.org/doi/abs/10.1137/110845914 ) and the big-O improvements did not seem to compensate for the larger constant.


The constants are galactic: any 'fast matrix multiplication' algorithm outside of Strassen's algorithm has some incredible constants that are somewhat intrinsic to the recursive framework. The algorithm is primarily of theoretical importance (prior to this no one knew whether sparsity significantly helped methods of this type for solving linear systems), but it is not implementable.

However the block Krylov algorithm itself presented in this paper has a little bit more of a chance of being implementable than fast matrix multiplication (the matrix multipliciation is only used to solve small linear systems to deal with small eigendirections in the Krylov subspace). I am still skeptical that this is a truly practical algorithm due to its complexity, but unlike the case of generic FMM there is no obvious bottleneck.

On the other hand, in recent years Dan Spielman and collaborators have been working on fast implementations of Laplacian solvers: https://github.com/danspielman/Laplacians.jl I believe a lot of the fixed constants and combinatorial routines are changed from what is theoretically provable, but from screwing around with the code in the past it seems very fast in practice.


There are no matrix multiplications in linear solvers in general. There are Matrix vector multiplications, typically far fewer than the size of the matrix. This confusion hides that solving linear systems was already far faster than matrixmatrix multiplication, and makes it seem like a stronger result than it is. I would like to see this compared to e.g. sparse cg, for any non random datamatrix.


This is a theoretical result - for now at least. Theoretically the fastest linear solvers run in time n^w where w ~ 2.3 is the matrix multiplication constant. In practice the fast matrix multiplication algorithms are not that fast, so they aren't used for linear solvers.


I may be misunderstanding but they quote

>“It only works when your matrix is sparse enough,” said Williams.

This is still important but only for this subset of problems really.


It’s a big subset though. Lots of linear systems come from PDE discretization and are structurally sparse with a fixed number of non zeros per row.


That's very true. I especially would trust someone with your username!


Yes, also the matrices that come out from Gaussian Processes


In practice, the majority of the really big matrices I encounter are very sparse, because they're often constructed as a reduction from some other much smaller representation so we can use linear algebra methods.


What is the actual sparsity criterion to achieve this big-O bound? (Edit: found in arxiv - o(nω−1/log(κ(A))) non-zeros). Note also dependence on condition number.


This was a very clear exposition and easy to follow, not knowing much about the topic. A very nice piece of writing.


“In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any ω>2. This speedup holds for any input matrix A with o(nω−1/log(κ(A))) non-zeros, where κ(A) is the condition number of A. For poly(n)-conditioned matrices with Õ(n) nonzeros, and the current value of ω, the bit complexity of our algorithm to solve to within any 1/poly(n) error is O(n2.331645). Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC `06 `07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices.”


Hmm, somehow reminds me of my controls classes way back where I learned that injecting some random noise in the measurements of a closed loop can actually help the estimator to converge better.


Interesting, is there a laymen explanation of why this might help?


its really simple... in rough terms it breaks loops that your estimator might get itself in! Imagine you estimator is a dynamic system, and while its converging it get trapped in a cycle between two values. Adding a little noise can break this cycle.


For those that need a concise refresher on - or introduction to - linear equations there's an (as usual) excellent series of 15 short videos from 3blue1brown:

https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFit...


If you only know there are 10 heads and 30 feet, how many chickens are there, and how many pigs? As algebra students learn, there’s a set procedure for figuring it out

If there are 30 feet, then there are 15 chickens. Because pigs don't have feet.

Where do I pick up my algebra prize?


It’s sad that 5 of those chickens don’t have heads


It’s less of a limitation than you’d think:

https://en.wikipedia.org/wiki/Mike_the_Headless_Chicken


Go with the 3-legged chickens instead.


Question from someone who is not super knowledgable about this subject - is this conceptually similar to using ML to solve linear equations? I.e. randomly guess and update weights...

Is there a field of research in using ML approaches for this type of problem?


If the same class of problems need to get solved on a regular basis, could the guessing algorithm get smarter with each iteration using ML techniques?

Seems like if the algorithm is applied to a particular domain, there could be efficiencies gained from optimizing what guesses to rule out.

Not very knowledgeable about the subject though.


Ax=b could certainly be solved with gradient descent methods popular in ML, but this is a lot more direct than that. It’s a little more like solving random pieces until it all comes together.


Isn't going faster a probabilistic property then? If it's random numbers, how could it be always faster?

What about a genetic algorithm? That's the same principle of guess and check.


This reads a little bit like simulated annealing or basin hopping algorithms at the random initial guess part.


Wake me up when it's in LAPACK. Then replace DGBSV, and don't tell me.


What is the memory usage like on this compared to existing dense solvers?


i don't totally underestand this, but looks similar to quadratic sieve. still this is amazing.


is this gonna be on leetcode so we can prep? (jokes)


Give it a few decades


Nice work out of Georgia Tech. Go jackets!


Looks like a description of simplex (and/or montecarlo) method for solving linear equations. Or did I get it wrong?


1. Simplex is used to solve linear programs, which are inequalities (not equations). While it's true that your solution will typically be on one of the vertices of your polytope, that just means that some (not necessarily all) of your constraints are hard constraints.

2. The worst-case time complexity of Simplex is exponential.


AFAIR first step of simplex method is to convert system of inequalities to system of equalities by adding weak variables and making tableau (matrix) of those. And then incrementally solve that.

And usually simplex is considered as having polynomial complexity - close to their estimations.


(I'd be interested to see an average-case running-time analysis of Simplex that yields O(n^2.332) additions/multiplications, where n is the number of variables. Or even O(n^3). Because that's what's claimed here.)

That's mostly correct. You end up with a system of equalities for _a subset_ of your points, which yields a single vertex.

Ultimately with a linear program, you're trying to optimize an objective function,

max w'x s.t. Ax <= b, x >= 0 (in canonical form).

The x >= 0 constraint makes this nontrivial to translate from a linear equation Ax = b. If you magically knew the sign of each component of the solution, you'd be able to perform a simple variable substitution, but you generally don't, at least not without solving the system of equations.

(And yes, you need that constraint, or else your LP is unbounded, and your solution ends up with a negative infinity in it.)


Simplex uses solving systems of linear equations as a building block.

Most variants of simplex have exponential runtime on worst case input. (But they do well in practice.)

Recently, some variants of simplex with polynomial runtime have been found.

Solving linear equations was always known to 'only' take polynomial amounts of time at most. But the question is whether that's n^3 or n^2 or something better..


I believe the simplex method is for solving optimization problems, not linear systems?


Krylov subspace methods. Use a random seed to find an approximate diagonalization.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: