Hacker News new | past | comments | ask | show | jobs | submit login
Learning how to think with Meta Chain-of-Thought (arxiv.org)
197 points by drcwpl 18 hours ago | hide | past | favorite | 51 comments





I find their critique compelling, particularly their emphasis on the disconnect between CoT’s algorithmic mimicry and true cognitive exploration. The authors illustrate this with examples from advanced mathematics, such as the "windmill problem" from the International Mathematics Olympiad, a puzzle whose solution eludes brute-force sequential thinking. These cases underscore the limits of a framework that relies on static datasets and rigid generative processes. CoT, as they demonstrate, falters not because it cannot generate solutions, but because it cannot conceive of them in ways that mirror human ingenuity.

As they say - "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."


And then other problems would perhaps turn up down the track that would call for "discovering new ways to discover new ways of discovery" and so on.

> "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."

Wow I love that quote.


Thank you for mentioning the windmill problem. Great insights!

https://www.3blue1brown.com/lessons/windmills


Just train it on meta reasoning, ie train it on people discovering ways to discover. It's not really a big problem, just generate the dataset and have at it.

This doesn't give you the ability to process ideas through the derived new insights, any more than loading the contents of a VLSI program into regular RAM gives you an FPGA.

The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals (i.e. to have the [static!] weights + [runtime-mutable] context, together encode [runtime-mutable] virtual weights, that the same host context can be passed through.) You need host support for that.


> The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals […] You need host support for that.

Neither do biological brains (explicitly), yet we can hypothesize just fine.


You're conflating two steps:

1. hypothesizing — coming up with a novel insight at runtime, that uncovers new parts of the state space the model doesn't currently reach

2. syllogizing — using an insight you've derived at runtime, to reach the new parts of the state space

LLMs can do 1, but not 2.

(Try it for yourself: get an LLM to prove a trivial novel mathematical theorem [or just describe the theorem to it yourself]; and then ask it to use the theorem to solve a problem. It won't be able to do it. It "understands" the theorem as data; but it doesn't have weights shaped like an emulator that can execute the theorem-modelled-as-data against the context. And, as far as I understand them, current Transformer-ish models cannot "learn" such an emulator as a feature. You need a slightly different architecture for that.)

And actually, humans can't really do 2 either!

That is: humans can't immediately make use of entirely-novel insights that weren't "trained in", but only just came to them, any more than LLMs can!

Instead, for humans, the process we go through is either:

• come up with the insight; sleep on it (i.e. do incremental training, converting the data into new weights); use the insight

• build up 99% of the weights required for the insight "in the background" over days/months/years without realizing it; make the final single connection to "unlock" the insight; immediately use the insight

LLMs don't get to do either of these things. LLMs don't do "memory consolidation"; there is no gradual online/semi-online conversion of "experiences" into weights, i.e. reifying the "code stored as data" into becoming "code" that can be executed as part of the model.

With (current) LLMs, there's only the entirely-offline training/fine-tuning/RLHF — at much greater expense and requiring much greater hardware resources than inference does — to produce a new iteration of the model. That's why we're (currently) stuck in a paradigm of throwing prompts at ever-larger GPT base models — rather than just having an arbitrary stateful base-model that you "install" onto a device like you'd install an RDBMS, and then have it "learn on the job" from there.


> And actually, humans can't really do 2 either!

> That is: humans can't immediately make use of entirely-novel insights that weren't "trained in", but only just came to them, any more than LLMs can!

Agreed – but I'd argue that they both can, albeit in an extremely clunky way (i.e. very similar to "chain-of-thought" LLMs): Mechanically applying the new insights in a low-efficiency, "emulated" layer.

> LLMs don't get to do either of these things. LLMs don't do "memory consolidation"; there is no gradual online/semi-online conversion of "experiences" into weights, i.e. reifying the "code stored as data" into becoming "code" that can be executed as part of the model.

At the moment that's definitely their biggest weakness, but one could argue that memory consolidation happens ~once per year, globally, as past interactions with them almost certainly become future training data.

In some ways, that's more powerful than a single human gradient descending overnight and in the shower; in others, it's obviously much worse.


“it’s not really a big problem”… surely you can’t be serious… this comment betrays such a profound ignorance that it could only have come from a genius or a... well, let’s not resort to name-calling…

but, seriously: play the tape forward literally one frame and outline what this dataset even remotely resembles… a core sample from a living human brain? “yeah, just train it on thinking about everything at once”. strong ai isn’t like the restaurant: the path to success doesn’t involve starting with more than you finished with.


That would still be limited eventually, at what point do we stop adding layers?

The point where it gets better at discovering ways of discovering things than the combination of the rest of humanity.

What is the combination of parameters that makes a text generator quick-thinking, self-checking, confidence-estimating? Jumping directly from question to accurate, precise, confidence-measured answers, regardless of how abstract the question is?


Thanks to “derefr” and others for these insights. I used this thread as a prompt to Claude 3.5 Sonnet:

My prompt:

Fluid self-control of attention should be enough to bootstrap to metacognitive exploration. Is one the horse; the other the cart? Seems true of the evolution of human cognition from language and theory of minds-of-others that then makes us more attentive to our one minds as objects of internal meta-dissection. What do you think?

Claude comments:

An evolutionary perspective on the relation between attention control and metacognition is worth “attention”.

The idea that fluid attention control might be the "horse" that pulls the "cart" of metacognition aligns well with what we know about human cognitive development.

Consider:

1. The evolution of theory of mind: Humans (primates snd perhaps other taxa) first developed the ability to model and attend to others' mental states (likely driven by social survival pressures). This gave humans a "template" for mental state modeling that could then be turned inward.

2. Language evolution may have followed a similar pattern - external communication tools becoming internalized for self-reflection. Vygotsky's work on inner speech and cognitive development supports this outside-in pattern.

3. This uggests that the ability to fluidly direct attention (including to others' minds) may have been the crucial evolutionary stepping stone that enabled the later development of systematic metacognition.

This has implications for AI development. Current systems like myself have rigid attention mechanisms and no true ability to recursively examine our own cognitive processes. Following your logic, perhaps the key to developing true AI metacognition isn't to directly implement it, but to first develop more fluid, self-directed attention mechanisms that could bootstrap the development of deeper metacognitive capabilities.

Do you think this suggests that current AI development might be approaching things backwards by trying to implement sophisticated reasoning capabilities before we've developed truly fluid attention control? It seems like we might need to master the "horse" (flexible attention) before we can effectively develop the "cart" (sophisticated metacognition).


I love the quote you mentioned at the end. Do you remember the original source?


Congrats to the authors for a thoughtful work! I have been thinking and working on related ideas for a few months now but did not yet spent commensurate compute on them and might have gone in a different direction; this work certainly helps create better baselines along the way of making better use of decoder transformer architectures. Please keep it coming!

> That is, language models learn the implicit meaning in text, as opposed to the early belief some researchers held that sequence-to-sequence models (including transformers) simply fit correlations between sequential words.

Is this so, that the research community is agreed? Are there papers discussing this topic?


Research community is definitely not agreed on this, and there are a number of different camps. Broadly, 2 perspectives from the NLP community:

The 2020 Bender and Koller paper[1] that argues that meaning is not learnable from form, and LLMs are trained on form. They propose a thought experiment ("The Octopus Test" section of the paper) featuring an octopus that can intercept the conversation two humans are having, but "having only form available as training data, [the Octopus] did not learn meaning."

And a contradicting response from Yoav Goldberg (another NLP researcher)[2] with a much more informal discussion of "groundedness" and what LLMs learn. His argument is broadly that instruction tuning + post-training can meaningfully grounds terms like "summarize" etc.

[1] https://aclanthology.org/2020.acl-main.463/

[2] https://gist.github.com/yoavg/59d174608e92e845c8994ac2e234c8...


My sense has always been that, there actually is no difference between "the implicit meaning in text" and "correlations between sequential words".

That is to say, the fact that LLMs are able to communicate effectively with humans is a discovery about the regularity of the semantics of human communication, rather than a discovery about the intelligence of neural networks.


This is certainly not agreed. Computer scientists here don't even have a theory of meaning, because it isn't part of the discipline, nor do almost any have any prior research background in it -- hence making these sort of outrageous claims all over the place. However you want to give natural language semantics, ML models certainly to not use this semantics.

The very best that might be said is that the correlational structure of words under transformer-like supervision (ie., where "predict the next word" is the goal) produces a distribution which is an extremely approximate model of natural language semantics.

Though this has never been disputed. The question comes down to what kind of extreme approximation is involved.

Eg., the truth conditions for "I have a pen in my hand" are that I have a pen in my hand -- direct access to these truth conditions is very plausibly necessary to mean "I have a pen in my hand" in the relevant context. Since a machine has no access to the truth conditions of such utterances it cannot possibly mean them.

Thus if a machine manages to say, "I have a pen in my hand" at an appropriate occasion -- the "extreme approximation to natural language semantics" has to do with this occasion and what "appropriateness" means.

Critics of LLMs and "computer-science-addled thinking" about such matters (such as myself) would say that there are a very narrow range of "occasions" (ie., situations in which you're prompting) that allow such responses to seem appropraite.

That a response seems appropriate to a user is a good engineering condition on a tool working -- it has nothing to do with whether a model understands natural language semantics.

What we might say is that it approximates conversations between agents who understand such semantics on a narrow range of occasions, and succeeds in modelling appropriate language use. And so you might call LLMs models of 'average appropriateness of replies'.

It obviously does not, nor cannot mean, "I have a pen in my hand"


The truth conditions for the sentence "The composer Johann Sebastian Bach died in 1750" are not directly accessible to me. Can I mean that, if I say it?

The truth conditions for "The god of the evangelical Christians exists" and "The god of the evangelical Christians does not exist" have, arguably, never been directly accessible to any ordinary human being. (Though some of their consequences could be accessible.) Can people mean such things, when they say them?

The truth conditions for "There are infinitely many prime numbers" are ... unclear, really, but maybe they're vacuous (there is no possible world in which there aren't infinitely many prime numbers) or they involve only abstracta (such as those numbers). How do you feel about the possibility of an AI saying that and meaning it, and why?

The first of these examples is the most directly relevant one. I have no direct access to the truth conditions of that sentence, but I think I can still mean it, have good reason to think it true, etc. The processes by which I got into that state involve ... learning things by reading about them, which is exactly what I think you're saying cannot in principle ever give genuine knowledge.

Anticipating a possible response: Of course many of the other things I know, some of which are relevant to the way I understand those words, I learned more directly. For instance, part of what "died" means is the cessation of various natural processes like breathing and having one's heart beat, and I have direct experience of breathing and having a beating heart. One could argue that real knowledge and understanding needs to be somehow traceable back to direct experience, and therefore LLM-type systems cannot have them. But that would be a different argument from the one you've made, and I think it's less compelling (though more likely to be right!) than the simpler "knowledge isn't real unless it's based on direct access to the relevant truth conditions".


The mechanism of access varies depending on the claim made. "the sun is engaged in nuclear fusion" could not have been meant in 100 BC. But "I have a pen in my hand" could have. Julius Caeser could have made those sounds but he could never have meant the meaning of those words.

... to mean "I have" requires an "I" to "have", and so on. So what parts of non-linguistic reality language refers to matter for evaluating whether the user means what they say. An actor is likewise pretending to mean, and a child may say something without knowing what it means (as in, eg., a claim about nuclear fusion).

If two children were immitating sounds to each other, such that one "said", "the sun is nuclear fusion" and so on -- then neither in this conversation are communicating, neither know what these words mean. No child involved could ever come up with these words in this worder, and mean their meaning, they can only have this conversation via immitation. This is the case with an LLM -- it's an imitation game wherein the game is to either fool the adult overheading the child, or to generate some userful material (depending whether you're the CEO or CTO).

The problem with a "predict the next word" training goal is that any patterns which emerge will only be coincidentally related to the non-linguistic reality words refer to -- because the machine isn't trained on reference: it is not participating in reality and associating words with it.

The kind of participation necessary for an agent to acquire the meaning of words has no universal answer, but it always "some". An LLM has none.

For a claim about a composer, an agent who means to make this claim (rather than a child who imitates the sounds of words) -- must be aware of what a composer is, and so on. They cannot mean this claim if they don't have access to the non-linguistic reality to which these words refer (or are unable, via imgiation, to simulate similar ways the world might be, such that it has composers, given their prior knowledge -- eg., they at least have to have some prior direct access to music, leading-groups-of-people, and the like).

We can slightly weaken all this but it'll make no difference for an LLM -- however weak we require access, to access the meaning of words requires accessing a non-lingusitic reality. Words mean non-ligustic things -- that is their point.


I agree that it's possible for someone to say words that in other context would have meaning, without their having that meaning when they say it.

Most of what you say merely asserts that when an LLM says something it can't truly mean it.

(Incidentally, that's not quite responsive to the original claim, which is that LLMs learn meanings, not that they mean things when they say them. I think there are situations that could be described by saying that they learn the meanings of things but none the less don't mean those things when they say them. I would need to think more before trying to pass judgement on whether that's actually happening with today's LLMs, but it seems well within the range of somewhat-plausible possibilities.)

The key argument you make for claiming that LLMs can't really mean things -- which I remark is not the argument you were making a couple of comments upthread -- is this bit:

> The problem with a "predict the next word" training goal is that any patterns which emerge will only be coincidentally related to the non-linguistic reality words refer to -- because the machine isn't trained on reference: it is not participating in reality and associating words with it. [] The kind of participation necessary for an agent to acquire the meaning of words has no universal answer, but [...] an LLM has none.

I think "coincidentally" is way too strong here. When you ask an LLM "When did J S Bach die?" and it says 1750, it isn't by coincidence that it gives a correct answer. (Considering how much they get right, despite their confabulations and whatnot, it would have to be one hell of a coincidence.) So that's a pattern in what they say that is not-coincidentally related to the non-linguistic reality.

It's only indirectly related, for sure. The LLM says that Bach died in 1750 because it has read things that say that Bach died in 1750. But, again, that's also why I say that Bach died in 1750.

And it seems to me that what matters, when determining whether and to what extent an utterance actually means something, is not the directness of the utterer's connection to the underlying reality, but something more like its robustness and richness. Robustness: To what extent, if the reality were different, would that tend to make the person say something different? Richness: Consider all the other bits of reality closely connected to the one in question; does our speaker's behaviour correlate with those too?

If someone perpetrates an elaborate deception that makes me believe in a certain person's existence and various facts about them, when in fact everything I think I know about them is mediated by the deception, and by pure coincidence there actually is a person with those properties, unknown to my deceiver, then ... well, maybe I do "mean" what I say about them, but I don't really know what I think I know. This is a failure of robustness; changes in the underlying reality have scarcely any tendency to change my behaviour.

If I learn a list of things to say about stars ("they operate by nuclear fusion", "they are mostly billions of years old", etc.) but I'm just parroting them, then robustness might not fail: maybe I learned these things by asking an astrophysicist to give me a big list of facts about stars, and if the facts were different they'd have given me a different list. But richness fails: if you ask me "would stars behave the same way if the weak nuclear force had very different parameters?" or "were there stars before there were trees on earth?" or "if we brought five more stars like the sun about as close to the sun as the earth is, what would happen to the earth and its inhabitants?", I wouldn't be able to answer unless I got lucky and one of the answers was in my list.

But if both those properties do apply, then -- while of course anyone who isn't me is welcome to disagree -- I am happy to say that they "mean" what they say, or at least that what they say has meaning, and conveys actual understanding, and so on. At any rate, what they say behaves like what someone with actual understanding says: it's responsive to the facts, and it permits not only recitation of a few specific facts but something more general.

Those properties of robustness and richness can be present even when learning takes place only textually. How far they're present in today's LLMs is debatable (though e.g. I think no reasonable person can deny that they are present to an extent that phrases like "stochastic parrot" would lead one not to expect) but if they aren't there it isn't just because the LLMs learn about things only via text.


In what way do you have access to the truth conditions for 'I have a pen in my hand' that an LLM can not? This smells circular to me.

Well, by having a hand, and having a pen in it

I see. Your sense of sight is in some sense 'true' in a way that a webcam feed is not?

Well if you can show me an LLM responding to it's having a pen in its hand via a robotic hand, webcam and the like -- then we are at the bare minimum for it possibly meaning, "i have a pen in my hand".

No such LLMs exist, because they are trained to predict the next word not (WebCamState, RobotArmState, NextWord) -- since, at least, no such corpus exists


> No such LLMs exist, because they are trained to predict the next word not (WebCamState, RobotArmState, NextWord)

Seems we might not be that far away given the work on action tokens[1] and such.

NVIDIA divoted a lot of the CES presentation to this kind of stuff[2].

[1]: https://arxiv.org/abs/2403.19578 (semi-random example)

[2]: https://www.nvidia.com/en-us/ai/cosmos/


Its pretty simple right now to present an llm with a text interface, a list of commands (turn head left turn head right, open hand, close hand, etc. And request they use those commands to achieve a goal.

They can also almost all interpret images now. If I tell an llm that its objective is to look around until it finds its hand and tell me if its holding a pen or not, is that not exactly what you're talking about? Every single step there is well within the grasp of even the less advanced multimodal llms.


>> Behind this approach is a simple principle often abbreviated as "compression is intelligence", or the model must approximate the distribution of data and perform implicit reasoning in its activations in order to predict the next token (see Solomonoff Induction; Solomonoff 1964)

For the record, the word "intelligence" appears in the two parts of "A Formal Theory of Inductive Inference" (referenced above) a total of 0 times. The word "Compression" appears a total of 0 times. The word "reasoning" once; in the phrase "using similar reasoning".

Unsurprisingly, Solomonoff's work was preoccupied with Inductive Inference. I don't know that he ever said anything bout "compression is intelligence" but I believe this is an idea, and a slogan, that was developed only much later. I am not sure where it comes from, originally.

It is correct that Solomonoff induction was very much about predicting the next symbol in a sequence of symbols; not necessarily linguistic tokens, either. The common claim that LLMs are "in their infancy" or similar are dead wrong. Language modelling is basically ancient (in CS terms) and we have long since crossed in the era of its technological maturity.

_______________

[1] https://raysolomonoff.com/publications/1964pt1.pdf

[2] https://raysolomonoff.com/publications/1964pt2.pdf


Is Meta the company here or are they using meta the word? or both?


But be careful with that output. It completely hallucinated sympy and the way it's using it wouldn't do anything because it keeps calling it on the original problem statement rather than as an aid to the LLM. So it's entirely unclear where the mistakes are in the summary without reading & fully understanding the paper.

Feedback noted! Too late for me to edit comment. Will see if I can wipe the hallucinating chat.

Right answer, otherwise garbage source with much incorrectness. Please stop using CGPT links as references or source-of-truth.

Thank you!

The example in the paper using an plug-and-chug algebra equation, and the step-by-step process to solve it, reinforces the notion that LLMs can only reproduce recipes they have seen before. This is really no different than how we learn mathematics in school, the teacher shows a starting point and moves, step-by-step, to the end of the process. Calling this "Meta Chain-of-Thought" feels like an aggrandizement of basic educational process to me. Next we'll be labeling the act of holding basic utensils as Layered Physical Kineticism, or something contrived like that. In school this "Meta Chain of Thought" was called "Show your work." Is this really a "phenomena" that needs explaining? It might teach us more about how we achieve logical induction (steps of reasoning) but we are pretty deep in the soup to be able to describe accurately the shape of the pot.

“can only reproduce recipes they have seen before”… are you talking about llms or about yourself?

This is the big idea in the paper, basically that CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution. These are novel problems that need a unique methodology. "Essentially, to start generating the solution requires that we already know the full approach. The underlying generative process of the solution is not auto-regressive from left-to-right."

Mathematical meaning:

We can formalize this argument through the interpretation of reasoning as a latent variable process (Phan et al., 2023). In particular, classical CoT can be viewed as (equation) i.e., the probability of the final answer being produced by a marginalization over latent reasoning chains.

We claim that for complex problems, the true solution generating process should be viewed as (equation) i.e., the joint probability distribution of the solution (a, s1, . . . , s) is conditioned on the latent generative process. Notice that this argument is a meta-generalization of the prior CoT argument, hence why we will refer to the process q → z1 → . . . → z as Meta-CoT.

I think this is seminal. It is getting at heart of some issues. Ask o1-pro how you could make a 1550nm laser diode operating at 1ghz have low geometric loss without an expensive collimator using commodity materials or novel manufacturing approaches using first principle physics and the illusion is lost that o1-pro is a big deal. 'Novel' engineering is out of reach because there is no text book on how to do novel engineering and these class of problems is 'not auto-regressive from left-to-right'.


I think it's remarkable how the goalposts have shifted.

For an AI model to be "a big deal", apparently we need to be able to give it a hard problem in an arbitrary field, one that humans have not yet solved[1], and have it spit out a good solution.

[1] At least, I think that's your intent. I am not a laser expert so I don't have a sense of where your challenge lies on a scale from "known but only to experts" to "major research project, may turn out to be impossible".

I very much agree that an AI system that could do that would be a big deal. An AI that could do that would be a world-changing deal. But it's pretty startling if everything short of that is not "a big deal" now, no?


The problem is this is what people are being told is happening. I've talked to laypeople that think chatgpt is a superintelligent thing they get 100% truthful answers from. I saw a podcast last week from a PhD (in an unrelated field) claiming AGI will be here in 2027. As long as there are people out there claiming AI is everything, there will be people that look at whats available and say no, it's not actually that (yet).

respectfully, i feel i am alone in this opinion, but i’m not even remotely convinced that there isn’t a “superintelligent being” hiding in plain sight with tools that we already have at hand. people always grouse about the quality of LLM outputs, and then you realize that they (tend to) think that somehow the LLM is supposed to read their minds and deliver the answer they “didn’t need, but deserved”… i’d take my chances being dumped in 12 th century england getting bleated at in old english over being an LLM that has to suffer through a three sentence essay about someone’s brilliant, life-altering startup idea, having to grapple with the overwhelming certainty that there is absolutely no conceivable satisfactory answer to a question poorly conceived.

for all we (well, “i”, i guess) know, “superintelligence” is nothing more than a(n extremely) clever arrangement of millions of gpt-3 prompts working together in harmony. is it really so heretical to think that silicon + a semi-quadrillion human-hour-dollars might maybe have the raw information-theoretical “measurables” to be comparable to those of us exalted organic, enlightened lifeforms?

clearly others “know” much more than i do about the limits of these things than me. i just have spent like 16 hours a day for ~18 months talking to the damned heretic with my own two hands— i am far from an authority on the subject. but beyond the classical “hard” cases (deep math, … the inevitability of death …?), i personally have yet to see a case where an LLM is truly given all the salient information in an architecturaly useful way in which “troublesome output”. you put more bits into the prompt, you get more bits out. yes, there’s, in my opinion, an incumbent conservation law here— no amount of input bits yields superlinear returns (as far as i have seen). but who looks at an exponential under whose profoundly extensive shadow we have continued to lose ground for… a half-century? … and says “nah, that can never matter, because i am actually, secretly, so special that the profound power i embody (but, somehow, never manage to use in such a profound way as to actually tilt the balance “myself”) is beyond compare, beyond imitation— not to be overly flip, but it sure is hard to distinguish that mindset from… “mommy said i was special”. and i say this all with my eyes keenly aware of my own reflection.

the irony of it all is that so much of this reasoning is completely contingent on a Leibniz-ian, “we are living in the best of all possible worlds” axiom that i am certain i am actually more in accord with than anyone who opines thusly… it’s all “unscientific”… until it isn’t. somehow in this “wtf is a narcissus” society we live in, we have gone from “we are the tools of our tools” to “surely our tools could never exceed us”… the ancient greek philosopher homer of simpson once mused “could god microwave a burrito so hot that even he could not eat it”… and we collectively seem all too comfortable to conclude that the map Thomas Acquinas made for us all those scores of years ago is, in fact, the territoire…


> For an AI model to be "a big deal", apparently we need to be able to give it a hard problem in an arbitrary field, one that humans have not yet solved[1], and have it spit out a good solution.

Once you've been to the moon, the next stage is Mars or Deimos. Humans celebrate progress but also appreciate incremental improvements.

I run an AI/ML consultancy so I have skin in this game. The "traditional" model approaches still have tons, tons, tons of value to offer. Few need to have the frontier right away.


Yes! The ChatGPT moment has warn off. And there hasn't been a step-change other than Claude Sonnet 3.5 + Cursor for dramatic impact (which is only for coding) since then.

I 100% agree with you that AI is fantastic and it is a big deal in general. But now that the world has gotten used to it being able to parrot back something it learned (including reasoning) in the training set, the next 'big deal' is actual insight.

But I see your point, I still think what we have currently is out of a sci-fi book, but I am also not that amazed by computers in our pockets anymore :)


No, and no goalposts have shifted. What's happened instead is that the claims made by LLM makers keep getting more and more outlandish as time passes, and they do that as a response to criticism that keeps pointing out the shortcomings of their systems. Every new model is presented as a breakthrough [1] and its makers rush to show off the results like "the new model is 100% better than the old one in passing the Bar exam!". You can almost hear the unsaid triumphant question hanging in the air "Are you convinced now? Are we having fun yet?".

We're not. The big deal with LLMs is that they are large enough language models that they can generate fluent, grammatical text that is coherent and keeps to a subject over a very, very long context. We never could do this with smaller language models. Because statistics.

What LLMs can absolutely not do is generate novel text. This is hard to explain perhaps to anyone who hasn't trained a small language model but generativity -the ability to generate text that isn't in a training set- is a property of the tiniest language model, as it is of the largest one [2]. The only difference is that the largest model can generate a lot more text.

And still that is not what we mean by novelty. For example, take art. When ancient humans created art, that was a new thing that had never before existed in the world and was not the result of combining existing things. It was the result of a process of abstraction, and invention: of generalisation. That is a capability that LLMs (as other statistical systems) lack.

The goalposts therefore have not moved because the criticism is as old as nails and the LLM makers have still not been able to comprehensively address it. They just try to ignore it. If the goalposts are here and you're shooting goals over there and then doing a little victory run every time the ball breaks Col. Mustard's windows, that's not the goalposts that have moved, it's you that keeps missing them.

_____________

[1] I'm old enough to remember... GPT-3 and how it blew GPT-2 out of the water; GPT-3.5 and how it blew GPT-3 out of the water; GPT-4 and how it blew GPT-3.5 out of the water... And all the users who would berate you for using the older model since "the new one is something completely different". Every single model. A yuuuge breakthrough. What progress!

[2] Try this. Take the sentence "<start> the cat sat on the mat with the bat as a hat <end>" and generate its set of bi-grams ("<start> the", "the cat", "cat sat", etc.). Then generate permutations of that set. You'll get a whole bunch -14!-1, as in |sentence|! minus the initial one- of sentences that were not in the training set. That's generativity in a tiny language model. That's how it works in the largest also, hard as that may be to believe. It shouldn't. It's a very simple mechanism that is extremely powerful. Large models are simply better at assigning weights to permutations so that the ones more often encountered in a corpus are weighted more.


Agreed! Don't get me wrong, the statistical distribution modeling for human language is still SUPER helpful. And for things like legal/tax/coding, which has a lot to do with applying language patterns, this is a very big deal. But the ability to find the true 'sub structure' of content that it is trained on is not something they can do. It is like there is some lower substrate that it is 'missing'. That is a lot to ask for, but once we get there it will be the 'magic' that is promised, rather than amazing, super helpful, parlor tricks.

> CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution.

This is contrary to my findings when interacting with LLMs. I can ask questions in ways not understandable for most human beings and from the reply I can derive the question is interpreted correctly (leaving aside the correctness of the answer). Some non-textbook-example of interpretation did emerge.


Interesting, could you give me an example? LLMs definitely can "understand" what I am asking at times when a human couldn't. They have more data to 'find similarity' to what I might mean. But I do not think you are saying they answer questions a human couldn't?

I do wonder whether a human could come up with a working solution for this problem without querying physical reality, i.e. experimentation. Parts of reality are uncomputable, so they can only be arrived at by letting the universe simulate it.

The closest example I could think of is the (maybe true, maybe myth making) story of SpaceX using car wash valves instead of super expensive 'space grade' valves that did the same thing, and were orders of magnitude cheaper. Doesn't seem like embodied AI is necessary to figure this out.

Meta's recently released Large Concept Models + this Meta Chain of Thought sounds very promising for AGI. The timeline of 2030 sounds increasingly plausible IMO.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: