Hacker News new | past | comments | ask | show | jobs | submit login
Learning Theory from First Principles [pdf] (ens.fr)
279 points by magnio 10 months ago | hide | past | favorite | 63 comments



> 2.5 No free lunch theorem > > Although it may be tempting to define the optimal learning algorithm that works optimally for all distributions, this is impossible. In other words, learning is only possible with assumptions.

A mention of no free lunch theorem should come with a disclaimer that the theorem is not relevant in practice. An assumption that your data originates from the real world, is sufficient that the no free lunch theorem is not a hindrance.

This book doesn't discuss this at all. Maybe mention that "all distributions" means a generalization to higher dimensional spaces of discontinuous functions (including the tiny subset of continuous functions) of something similar to all possible bit sequences generated by tossing a coin. So basically if you data is generated from an even random distribution of "all possibilities", you cannot learn to predict the outcome of the next coin tosses, or similar.


Yes, most free lunch theorems and results of this kind, which make overly general assumptions, tend to be too pessimistic.

For example, many people naively think that static program analysis is unfeasible due to the halting problem, Rice's theorem, etc.


Just curious, how does program analysis on real-world programs exactly circumvent the problems of the halting problem or Rice’s theorem? In the real world, do we only ever have statically analyze a special subset of all programs?


The main strategy is to build sound but imperfect analyzers, i.e. analyzers that never raise false negatives but that may raise some false positives. See §1.6 in [1], a simplified version of the classic Principles of Program Analysis. Good analyzers are practical, rarely raising false positives for well-written programs. The Astreé analyzer reported zero false positives for the 1 MLOC fly-by-wire A380 code.

Another complementary strategy is to avoid Turing-complete constructs as much as possible, i.e. use DSLs with restricted semantics. This way, advanced semantic properties such as termination are provable.

[1] Program Analysis, An Appetizer. https://arxiv.org/pdf/2012.10086.pdf


Halting problem only applies in the general case. I can trivially tell you that while(1) will never halt.

There are many examples of programs whose halting behavior is not known (collatz conjecture for example) but many others where program analysis works just fine.


If you write a program whose behavior is Collatz-like (say, in some states it queues up more work for itself, and in other states it completes work from the queue, and you believe that in general it should always ultimately complete the queue) it is actually useful to have a static analyzer tell you ‘it’s not entirely clear that this code will ever terminate’.

You can make the analyzer happy by adding a max iteration limit or a recursion depth limit or something to make sure it fails out rather than looping forever.

Which is probably a good idea anyway, if you’re running code that you can’t mathematically prove will always complete.


In the real world, if program analysis hits a blocker like this, you tweak stuff until it don't. Top-level post is correct in that, while theory applies to the general case, data and programs we actually use are not completely random/general - there's lots of properties baked in as a consequence of being real-world, physical entities.


The commercial static analyzers I've seen generate false positives (bogus issues that just aren't there) and false negatives (e.g., unconditional out of bounds pointer writes if that particular piece of C code is ever executed). Some of that comes with the territory because commonly used languages and their libraries are underspecified and commonly used at the boundaries of what is specified. And these tools must always produce some result even if they cannot even parse (or see) the entire code base.

Usually, when people say “static analysis“ they accept unsoundness and use of heuristics. Otherwise, they call the tool a type checker or a verifier. Such tools may run into the theoretical issues you mentioned. For them, the solution is to change the program until it compiles in a reasonable amount of time.


In the real World Turing machines don't exist only finite state Machines. (No infinite tape or infinite time)

I guess something related to this one way or another.


You don't need an infinite tape to make a finite state machine that never halts. As Legend2440 pointed out upthread, while(1) is a simple finite state machine that never halts.


Sure but halting problem is solvable for finite state machines.


Could you expand or provide a link to a good resource for me to understand this?

If the judge program should say terminates yes/no and the program given is `while True: continue`, I guess the argument is that in the finite case, you could in principle just enumerate all programs that don't terminate and identify them as such?


In principle, you can enumerate all possible memory states of the system and determine what the next memory state would be from each one (including multiple possible next states if you account for things like interrupts)

Then you treeshake the unreachable parts of that directed graph from the start state, and look for closed loops in what remains.


Makes sense! Thanks!


The theoretical halting problem is required to return a yes/no answer, whereas in the real world, it's actually really valuable to get back a "maybe doesn't halt" answer, so you can then more specifically iterate on those "here be dragons" areas of your system.


It's analogous to the problem of induction (Hume). Without assumptions, it's impossible to connect past observations to predictions about the future. Observing that the sun rises a thousand mornings does not automatically make it any more or less likely that it will rise tomorrow, unless we make some assumptions, for example that events tend to be similar over time. And those assumptions can't be extracted from the data. Maybe things tended to be similar over time in the past, but that says nothing about the future. The pattern might break tomorrow and to say that it likely won't, is simply an assumption on a higher level.

But it seems like science is "doing fine" despite this problem. Similarly machine learning chugs along fine, because people use their experience and prior knowledge when designing the algorithms. These assumptions are also called inductive biases. They are biasing the learning towards certain patterns (like "things tend to be similar locally").


One just assumes Occam's razor and you're good to go in a vast majority of cases


If you formalize Occam's "simplicity" as description length, then that depends on encoding. By assuming an encoding, you implicitly assume a distribution (according to entropy coding, eg Huffman coding) and hence inject an inductive bias.

I'm not saying that it's bad to assume one. The point is that it is an assumption. My bigger point is that the no free lunch theorem should only bother you as much as the induction problem bothers you. Which in practice means not at all.


I don't see how your disclaimer applies. My interpretation of no free lunch theorems is that no single algorithm works well for all classes of problems, not that some problems are unlearnable. The example in its proof might be contrived, but in actuality, additional assumptions can and do lead to different algorithms being picked, no?


Transformers go brrr


Work pretty well for all classes of problems?


Learning theory is the attempt to formalize natural science up to decision. Natural science's unstated assumption is that a sufficiently sophisticated algorithmic world model can be used to predict future observations from past observations. Since this is the same assumption as Solomonoff's assumption in his proof of inductive inference, you have to start there: with Turing complete coding rather than Rissanen's so-called "universal" coding.

It's ok* to depart from that starting point in creating subtheories but if you don't start there you'll end up with garbage like the last 50 years of confusion over what "The Minimum Description Length Principle" really means.

*It is, however, _not_ "ok" if what you are trying to do is come up with causal models. You can't get away from Turing complete codes if you're trying to model dynamical systems even though dynamical systems can be thought of as finite state machines with very large numbers of states. In order to make optimally compact codes you need Turing complete semantics that execute on a finite state machine that just so happens to have a really large but finite number of flipflops or other directed cyclic graph of universal (eg NOR, NAND, etc.) gates.


Have they figured out what causes double descent yet?


I don't know if it's a generalized result, but the Circuits team at Anthropic has a very compelling thesis: the first phase of descent corresponds to the model memorizing data points, the second phase corresponds to it shifting geometrically toward learning "features".

Here a "feature" might be seen as an abstract, very, very high dimensional vector space. The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts. They experiment with a toy model and toy data set where the latent features are represented explicitly and then compressed into a small set of data dimensions. This forces superposition. Then they show how that superposition looks under varying sizes of training data.

It's obviously a toy model, but it's a compelling idea. At least for any model which might suffer from superposition.

https://transformer-circuits.pub/2023/toy-double-descent/ind...


> The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts.

Wonder if it's a matter of perspective - that is, of transform. Consider an image. Most real-world images have pixels with high locality - distant pixels are less correlated than immediate neighbours.

Now take an FFT of that. You get an equivalent 2D image containing the same information, but suddenly each pixel contains information about every pixel of the original image! You can do some interesting things there, like erasing the centre of the picture (higher frequencies), which will give you blurred original image when you run FFT on the frequency-image to get proper pixels again.


I think that’s basically correct, the FFT representation is a better feature representation.


There was actually a very recent blog post claiming that statistical mechanics can explain double descent https://calculatedcontent.com/2024/03/01/describing-double-d...

Some more detail here: https://calculatedcontent.com/2019/12/03/towards-a-new-theor...


Not an expert, but this paper explores double descent with simple models. The interpretation there: when you extend into the overparameterized regime, that permits optimization towards small-norm weights, which generalize well again. Does that explain DD generally? Does it apply to other models (e.g. DNNs)?

https://arxiv.org/pdf/2303.14151.pdf


No. We don't know. My favorite hypothesis: SGD is...well, stochastic. Meaning you're not optimizing w.r.t the training corpus, but a tiny subset, so your gradient isn't quite right. Over-training allows you to bulldoze over local optima and recurse toward the true distribution rather than drive around a local over-fitting basin.


You can get it with full gradient descent though... https://www.nature.com/articles/s41467-020-14663-9

Honestly the fact that there doesn't seem to be a good explanation for this makes me think that we just fundamentally don't understand learning.


There are so many great mathematical PDFs available for free on the Internet, written by academics/educators/engineers. A problem is that there is a huge amount of overlap. I wonder if an AI model could be developed that would do a really good job of synthesizing an overlapping collection into a coherent single PDF without duplication.


One could also just pick the books used in the corresponding university courses.


In more advanced undergraduate math, PDF lecture notes written by professors but not published in book form often contain excellent explanations and proofs that are not available in books. Also, making undergraduate buy expensive text books is more of a thing in first year American classes. Look at the lecture notes at Oxford for example.

https://courses.maths.ox.ac.uk/


No need for an AI model. Probabilistic Machine Learning by Murphy is an excellent reference and resource.


I'm not talking about this specific subject.


Minor nitpick: The title is confusing. The first word should be substituted with "Machine-Learning" or "Statistical-Learning". Ideally, the author of the piece should do that at some point.


This is pretty hard to read. For example, on the first page of chapter 1, it talks about "minimization of quadratic forms" and shows what looks like the formula for linear least squares. Is that right? It doesn't say anything about this. Some more exposition would help.

I do like that there are lots of exercises.


I think the text is geared towards people with some mathematical background who want to understand learning theory. Besides it is clearly stated that this chapter is a review (so its assumed that you learned or will learn these things elsewhere).


Well I have some math background but that section is brisk and slow at the same time, as it were. Such as how it explains how to find inverses of 2x2 matrices.

This is older but is supposed to be good: https://www.deeplearningbook.org/


First principles doesn't mean easy to read unfortunately


The sibling comment is right in that this is clearly not intended for first timers.

But your instincts are correct here. When you write out the objective function for ordinary least squares, it turns out to be a quadratic form. The choice of the word "quadratic" here is not a coincidence: it is the generalization of quadratic functions to matrices. That section covers the vector equivalent of minimizing quadratic functions.


Certainly doesn't seem like first principles...


"First principles" doesn't mean "introduction". It is to contrast with anecdotal experience / tacit knowledge / empirical best practice approaches.


Least squares is quadratic.

Quadratic means square terms.


Fascinating, does anyone have any good books on this?


Interesting! I’ll have to look over it when I have more time.

From a quick glance, it looks like it covers much of the same material as this text [1]. I wonder how they compare.

[1]: https://www.cambridge.org/core/books/understanding-machine-l...


A 2014 book on machine learning sounds quaint and historical.


Depends on what you want from a (text)book. In my mind books should be authoritative, they should include things that have had some thought put into them and are fairly well studied/verified. Modern advances in deep learning/ML are exciting but are very often not this. I would not a read a book which is just some recent hype papers from NeurIPS/ICML stapled together.


It depends on the particular subtopics it covers. Machine Learning: A Probabilistic Perspective is from 2012 and it's still a great resource, although Murphy's newer book will certainly cover more up-to-date material.


I can’t wait until I tell GPT-5 “I have this idea I want to try, read this book and tell me there’s anything relevant there to make it work better”.


I've never heard of an LLM making up a new idea. Shouldn't this only work if your thing has already been tried before?


How people define "new" varies a lot in this context. I've spent a lot of time talking with ChatGPT exploring interdisciplinary ideas and while I think it frequently says things that qualify as new, I've only run into one situation where I was trying to find some way of doing something technical and it just invented something non-trivially original to handle it: https://x.com/Westoncb/status/1763733064478335326?s=20


In my example I already have an idea, but I’m not sure how good or novel it is, or perhaps I tried it and it didn’t work but I don’t understand why.


Have you heard of a person making up a new idea? Can you definitely state that it's not just a combination of a few previous ideas?


The brain is able to make better associations than an LLM. From what I’ve seen, LLMs always stay on topic, even when that means outputting only platitudes. A human can go past that and try different approaches to a question.

It can also take part of a problem, choose a lead at random, think through the results, and step back if that doesn’t work. A forward pass in an LLM doesn’t do that, yet.

Basically it seems to me that LLMs may have part of what makes us good at inventing ideas, but they’re missing part of that process.


I don't think this is a fundamental limitation. If the LLM is trained (through RLHF or something else) to go on a chain-of-internal-monologue (which may take arbitrarily long) to figure out what it will answer, then this kind of adaptive amount of compute can be achieved.


Sure, but then how good are Transformers at making connections between distant concepts? The brain seems way better at this, and it doesn't feel like conversation.


One of the big hopes of this "train on all text in existence" (including academic works and everything else) is that this enables connections that no human could make before, since there probably a total of zero people who are specialists in medieval history and molecular biology and also music theory and quantum computational complexity and what have you. However current models seem not to do much of this in my experience. Typically they remain in the specialist "persona" that rarely brings up connections that would be known by other human specialists (whom it would otherwise have no issue impersonating). But I'd imagine this can also be improved somehow.

Think about how often narrowly specialized academics reinvent a bad version of something that already has an elaborately understood theory in another field. (a meme example is when a medical researcher working on insulin metabolism reinvented the trapezoid rule for integration - but of course many more mundane cases happen every day). It would be a great opportunity to avoid this by checking with an LLM. Arguably if that researcher fed his idea to an LLM as it exists today, the LLM would have recognized that this is just the trapezoid rule. In fact I also use it for such "sounding board" where it gives me good googleable terms like, "what you're looking for is called an XYZ".


I am realising that passing context for $this is the tricky part as-

1. It is very difficult for me to tell you about my context as a user within low dimension variables.

2. I do not understand my situation in the universe to be able to tell AI.

3. I dont have a vocabulary with AI. Internet i feel aced this with shared HTTP protocol to consistently share agreed upon state. For ex within Uber I am a very narrow request response universe with.. POST phone, car, gps(a,b,c,d), now, payment.

But as a student wanting to learn algorithms how do I pass that I'm $age $internet-type from $place and prefer graphical explanations of algorithms, have tried but gotten scared of that thick book and these $milestones-cs50, know $python upto $proficiency(which again is a fractal variable with research papers on how to define for learning).

Similarly how do I help you understand what stage my startup idea is beyond low traction, but want to know have $networks/(VC, devs, sales) APIs, have $these successful partnerships with such evidence $attendance, $sales. Who should I speak to? Could you pls write the needful in mails and engage in partnership with other bots under $budget.

Even in the real world this vocabulary is in smaller pockets as our contexts are too different.

4. Learning assumes knowledge exists as a global forever variable in a wider than we understand universe. $meteor being a non maskable interrupt to the power supply at unicorn temperatures in a decade. Similarly one time trends in disposable $companies that $ecosystem uses to learn. I'm in a desert village with with absent electricity might mean those machines never reach me and perhaps most people don't have a basic phone in the world to be able to share state. Their local power mafia politics and absent governance might mean the pdf AI recommends i read might or might not help.

I don't know how this will evolve but to think of the possibilities has been so interesting. It's like computers can talk to us easily and they're such smart babies on day 1 and "folks we aren't able to put right, enough, cheap data in" is perhaps the real bottleneck to how much usefulness we are being able to uncover.


What you described is what some of the recent startups are working on: https://www.rewind.ai (I’m not associated with them). To me it seems like a rather trivial problem to solve, compared to creating an LLM in the first place.


Ypur idea would become even better so if you read the book yourself... You might even learn something new along the way.


we are optimizing for time here, not learning. I'm gonna die anyways and anything I learn will be dust. If I just need the info to make something work, spending months (of which I have limited number of) to see if something has something useful in it for me vs. spending and afternoon to probe it to get most of the benefits is a no-brainer.


Maybe, but dont expect otherwise from a book explicitly named "Learning Theory from First principles", not "learn large language models in 21 days".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: