Algorithm efficiency comes from problem information

ssivark · on Jan 4, 2018

Yes, but... that somewhat misses the point of machine "learning"! Maybe the author is being somewhat facetious when making an overly broad interpretation of an important point.

When one wants to use the best ML model "in production" (after all the exploration is done and dusted), or when there is a "domain expert" who can specify the right biases to encode into the algorithm (i.e. problem information), the point made in the post is very valid.

Often, however, the point of using ML is to tease out some property that is not apparent previously. In that case, the hope is to use some class of algorithms (eg: neural networks) with the right general biases (eg: convolutional layers and RELU activations, or share the first few layers of VGG) so that it can pull out the appropriate aspects from the problem specification that you did not know to ask for.

From this perspective, the challenge lies in the 'No Free Lunch' theorem which implies that unless you known enough about the problem to indirectly encode the correct biases (eg: In the case of image classification -- choosing neural networks, convolutions, etc. rather than a decision tree) you can never win the game.

joshuamorton · on Jan 5, 2018

I'm going to quibble with your last point. A decision tree can, absolutely classify images. Depending on exactly how you specify the problem it can do so exactly as well as a neural network. It would just either be a lot less space efficient.

No free lunch isn't about choosing the right biases, it says that a single model cannot solve every problem. And in fact any solver will have accuracy equivalent to chance when averaged across all problems. Here's an example:

There are four possible one bit to one bit functions, I and ~ (the identity and inverse functions), as well as T and F (True and False). If you create a truth table for all four of those functions:

    func  input  output
    I     0    | 0
    I     1    | 1
    ~     0    | 1
    ~     1    | 0
    T     0    | 1
    T     1    | 1
    F     0    | 0
    F     1    | 0

You'll notice that the output column contains an equal number of one bits and zero bits. This is true for any input size on binary functions, and in general if you extend the definition from "one bits and zero bits" to "equal number of each possible output class" it is true for functions of any output size.

That's what the No Free Lunch theorem states: that across all possible functions in your space, you can do no better than chance, because for any function F that you approximate exactly, I can define a function "!F" (or in a non-binary space, a family of functions) which you are exactly 0% accurate on. This is true no matter the algorithm or assumptions you choose.

Similarly, I can define a classification algorithm that has no implicit assumptions about what I am classifying, other than that it can be classified, but still achieves 100% accuracy on any specific problem I apply it to:

For every possible input in your function space, get the ground truth label from some external function, throw those all into a mapping (a very, very large mapping) and then just look up the input in the mapping when requested.

For image classification, this means a mapping of all possible NxN images to a boolean: "Contains dog". You might even, as an optimization make this more space efficient by picking certain pixels that are more impactful and looking at them first, so that you don't have to store every image. That sounds an awful lot like a decision tree ;)

Tl'dr: 'No Free Lunch' has nothing to do with the efficiency of the encoding, but instead with the ability to have a perfect agent that can solve every problem.

joe_the_user · on Jan 5, 2018

Often, however, the point of using ML is to tease out some property that is not apparent previously. In that case, the hope is to use some class of algorithms (eg: neural networks) with the right general biases (eg: convolutional layers and RELU activations, or share the first few layers of VGG) so that it can pull out the appropriate aspects from the problem specification that you did not know to ask for.

Sure but I think the failure of neural nets to solve PDAs was significant as a negative. You'd like a machine learning algorithm to get information you didn't think of but at this point it seems like they succeed only in certain circumstance. Describing the instances would help characterize the limits of present ML and how it can go further.

cgmg · on Jan 5, 2018

Technically, the NFL theorem does not apply to the real world: https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

xiaodai · on Jan 4, 2018

The ideas behind the post are gold. I have blogged about group-by strategies and one of the key themes is that the more you know about the inputs the more you can optimize your algorithms (see https://www.codementor.io/zhuojiadai/an-empirical-study-of-g...)

SilasX · on Jan 5, 2018

>What does my algorithm know about my problem?

I’ve phrased the same idea as “what regularity of the problem space is my algorithm exploiting?” A good first question to demistify any purported fast algorithm.

dahart · on Jan 5, 2018

I've been thinking of this prior knowledge as "domain reduction", or even "dimensionality reduction". The more I know about the problem, the more of the space I can just remove from the search before I start. In high dimensional spaces, each dimension removed can be loosely/conceptually like an order of magnitude.

I like your phrase too, as there are classes of knowledge about a problem that don't necessarily reduce the domain, depending on how you look at it, but can still be exploited.

TeMPOraL · on Jan 5, 2018

My pet phrasing is more like: each bit of domain knowledge I have lets me cut the problem space in half.

grandalf · on Jan 5, 2018

I made essentially this argument in the thread for the google announcement about beating stockfish and was down voted mercilessly.

I think neural nets and deep learning systems are misunderstood so profoundly because humans are bad at many kinds of reasoning and the explanation of the neural net being "smart" is so psychologically appealing.

Strilanc · on Jan 4, 2018

> In fact, you can show that solving a linear systems has to have O(n^3) as a lower bound

Where did the author get this idea? It can be done more efficiently than that: https://mathoverflow.net/a/52557/40272

danbruc · on Jan 4, 2018

I think you are misreading the answers; it states the well-known fact that matrix multiplication and inversion can be done is less than O(n³) but it does not say this about solving linear systems. If you read through the material referenced in the answers it seems that O(n³) is indeed a lower bound and actually hard to obtain in practice.

tmyklebu · on Jan 4, 2018

You can compute the inverse of a matrix A in asymptotically the same amount of time as you can multiply two matrices. You can also Cholesky factorise in that amount of time, and compute T^(-1) A where T is (upper or lower) triangular in that amount of time. You invert by A^(-1) = (A^T A)^(-1) A^T, the latter computed by Cholesky and two triangular solves. You Cholesky and do triangular solves by divide-and-conquer.

So no, the article is wrong.

ChrisRackauckas · on Jan 5, 2018

Yes, I made a mistake. Gaussian elimination is O(n^3), but there are general algorithms which are as fast (or slow, depending on how you look at it) as the current complexity bound for matrix multiplication which is O(n^{2.737}). Not the end of the world, I corrected the article. Reference:

http://theory.epfl.ch/vishnoi/CargeseLectures.pdf

Still, the point stands that you'd never want to actually solve a large system "the generic way" unless you really have to (i.e. you cannot identify a structure for which you have a faster algorithm).

tmyklebu · on Jan 5, 2018

It looks like you also turned the rest of your post into a link.

The best currently known exponent for square matrix multiplication is around 2.3728; Wikipedia's page on matrix multiplication has a reference.

ChrisRackauckas · on Jan 5, 2018

>However, the constant coefficient hidden by the Big O notation is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.

No wonder I never heard about these. I updated it for correctness (and fixed the link, ugh) but yeah, these are details I'd hope to not dwell on in the post since for all intents and purposes you'd still never want to do the worst fallback unless you have to.

tmyklebu · on Jan 5, 2018

Yep. (Largely) irrelevant in practice. But the gap between what we can do and what lower bounds we can prove is very large indeed for many matrix operations.

danbruc · on Jan 5, 2018

I am not an expert but I think you are right and I misread it myself. After reading it again Gaussian elimination seems to be somewhat worse than O(n³) but other algorithms are as fast as fast matrix multiplication.

tmyklebu · on Jan 5, 2018

The details people are talking about in that MathOverflow thread concern the precision needed to represent the numbers that crop up during Gaussian elimination. (They get huge, and if you are counting bit or word operations, then Gaussian elimination is indeed terribly slow.) The post author is using a less detailed model in which he counts arithmetic operations. In this model, matrix multiplication and matrix inversion have the same asymptotic complexity.

Terr_ · on Jan 4, 2018

Alternately, you could say that more information allows you to recognize that you're faced with a different problem, one which is more-specific and may have a more-efficient solution.

The outermost, most -general, and least-tractable problem being "Do what's needed for whatever."

adrianratnapala · on Jan 5, 2018

Yes. And it is interesting that general neural networks can't really solve that one. The closest thing to a solution we have for it is the human animal -- and it is still a pretty specific beast.

Veedrac · on Jan 5, 2018

> However, while these features matter, most decent implementations of the same algorithm tend to get quite close in efficiency to each other (probably <3x if everyone is using a performant enough programming language or constructs).

No, I don't think this is even close for most problems. Neural networks, maybe, but they're an unusually simple case where everyone uses similar, largely GPU accelerated code.

ChrisRackauckas · on Jan 5, 2018

At least in scientific computing this applies. And note I say most decent implementations, so I assume you took it to a language without performance issues like C/Fortran/Rust/Julia/Go/etc. and got rid of temporary allocations, made sure you looped in the correct ordering, etc. After you do all of that, for mathematical and scientific problems you tend to hit a wall. [BLAS libraries tend to be within that range of each other](https://github.com/barche/julia-blas-benchmarks/blob/master/...), which pretty much directly means that regression, linear classification, etc. libraries are all going to be within this amount. Different differential equations solvers which are sufficiently optimized with the same algorithm tend to be within that amount from each other. Neural network libraries tend to be within that amount from each other. General optimization algorithms over a large class of problems tend to be the same. That doesn't mean it's not a worthy enterprise to keep making the "general fallbacks" better, indeed this helps everyone.

If you're looking to make your machine learning software faster, writing a new piece of fundamental GPU code to beat cuDNN could possibly be faster but probably won't get you very far. But changing your model to better match the data could make the problem orders of magnitude easier.

Veedrac · on Jan 5, 2018

You can't prove the effect of "temporary arrays, choice of language, and parallelism" and "pointer indirection, cache efficiency, and SIMD vectorization" don't matter unless you compare one thing that does those things to another thing that doesn't.

BLAS libraries are all within the range of each other because they all store memory in basically the same way, do indirection in basically the same way, care for the cache in basically the same way and do SIMD in basically the same way.

As soon as you step out of those prebuilt blocks and build your own function over squares of numbers, you're going to be losing far more than a factor of 3 until you understand how to handle these issues.

newen · on Jan 5, 2018

>You can't prove the effect of "temporary arrays, choice of language, and parallelism" and "pointer indirection, cache efficiency, and SIMD vectorization" don't matter unless you compare one thing that does those things to another thing that doesn't.

I think he said the exact opposite. He's basically saying algorithmic time complexity matters (and therefore identifying your problem class matters, since you might get a more efficient algorithm out of it) since you can only get so far with the generic algorithm, however good your implementation is.

Veedrac · on Jan 5, 2018

The argument is symmetric; once you've tuned the algorithm the only way to get >10x speedups is by working with those low-level things.

Both the broad-strokes algorithm and the fine details are responsible for huge performance differences.

adrianratnapala · on Jan 5, 2018

Actually scientific code has huge variation in speed.

The same high-level algorithm might be implemented inside an old, highly optimised FORTRAN program and then get transliterated into a use-once MATLAB script which runs 100 times slower because the vectorisation voodoo wasn't quite right.

In the latter case, people might not bother fixing it because the hour it would take is longer than the run-time. Or maybe the run-time cost is weeks, but the people doing it are scientists and understanding computers isn't their job.

ChrisRackauckas · on Jan 5, 2018

I made the caveat that it's sufficiently optimized. I'm not talking about full loops on native Python arrays or something silly like that. If two people who know what they're doing try to optimize some scientific algorithm in a language without extra penalties (C, C++, Fortran, Julia, Go, Rust, etc.) the runtimes end up quite similar, at least within a few x. Of course someone can make it worse (there exist many bad coders in science...), but I'm not talking about that. I'm just saying that spending 2 years to play with some assembler code is not likely to give you some amazing speedup, while specializing algorithms to your problem is a clear and tested way to get something that is much more efficient.

marmaduke · on Jan 5, 2018

Another example of this is found in comparing generic MCMC methods to HMC. The former work for discrete or continuous parameters while the latter requires a gradient. Regardless of how fast a generic MCMC implementation is, the HMC method is asymptotically faster, so for some problem sizes the language or low level optimizations are irrelevant.