I think we need to start moving away from this explanation, because the truth is more complex. Anthropic's own research showed that Claude does actually "plan ahead", beyond the next token.
> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.
I'm not sure if this really says the truth is more complex? It is still doing next-token prediction, but it's prediction method is sufficiently complicated in terms of conditional probabilities that it recognizes that if you need to rhyme, you need to get to some future state, which then impacts the probabilities of the intermediate states.
At least in my view it's still inherently a next-token predictor, just with really good conditional probability understandings.
That's entirely an implementation limitation from humans. There's no reason to believe a reasoning model could NOT be trained to stream multimodal input and perform a burst of reasoning on each step, interjecting when it feels appropriate.
Not sure training on language data will teach how to experiment with the social system like being a toddler will, but maybe. Where does the glance of assertive independence as the spoon turns get in there? Will the robot try to make its eyes gleam mischeviously as is written so often.
But then so are we? We are just predicting the next word we are saying, are we not? Even when you add thoughts behind it (sure some people think differently - be it without an inner monologue, or be it just in colors and sounds and shapes, etc), but that "reasoning" is still going into the act of coming up with the next word we are speaking/writing.
We're not predicting the next word we're most likely to say, we're actively choosing the word that we believe most successfully conveys what we want to communicate. This relies on a theory of mind of those around us and an intentionality of speech that aren't even remotely the same as "guessing what we would say if only we said it"
When you talk at full speed, are you really picking the next word?
I feel that we pick the next thought to convey. I don't feel like we actively think about the words we're going to use to get there.
Though we are capable of doing that when we stop to slowly explain an idea.
I feel that llms are the thought to text without the free-flowing thought.
As in, an llm won't just start talking, it doesn't have that always on conscious element.
But this is all philosophical, me trying to explain my own existence.
I've always marveled at how the brain picks the next word without me actively thinking about each word.
It just appears.
For example, there are times when a word I never use and couldn't even give you the explicit definition of pops into my head and it is the right word for that sentence, but I have no active understanding of that word. It's exactly as if my brain knows that the thought I'm trying to convey requires this word from some probability analysis.
It's why I feel we learn so much from reading.
We are learning the words that we will later re-utter and how they relate to each other.
I also agree with most who feel there's still something missing for llms, like the character from wizard of Oz that is talking while saying if he only had a brain...
There is some of that going on with llms.
But it feels like a major piece of what makes our minds work.
Or, at least what makes communication from mind-to-mind work.
It's like computers can now share thoughts with humans though still lacking some form of thought themselves.
But the set of puzzle pieces missing from full-blown human intelligence seems to be a lot smaller today.
Humans and LLMs are built differently, it seems disingenuous to think we both use the same methods to arrive at the same general conclusion. I can inherently understand some proofs of pythagorean's theorem but an LLM might apply different ones for various reasons. But the output/result is still the same. If a next token generator run in parallel can generate a performant relational database that doesn't directly imply I am also a next token generator.
At this point you have to start entertaining the question of what is the difference between general intelligence and a "sufficiently complicated" next token prediction algorithm.
A sufficiently large lookup table in DB is mathematically indistinguishable from a sufficiently complicated next token prediction algorithm is mathematically indistinguishable from general intelligence.
All that means is that treating something as a black box doesn't tell you anything about what's inside the box.
Of course it can. Reasoning is algorithmic in nature, and algorithms can be encoded as sufficiently large state transition tables. I don't buy into Searle's "it can't reason because of course it can't" nonsense.
We were talking about a "sufficiently large" table, which means that it can be larger than realistic hardware allows for. Any algorithm operating on bounded memory can be ultimately encoded as a finite state automaton with the table defining all valid state transitions.
If we're at the point where planning what I'm going to write, reasoning it out in language, or preparing a draft and editing it is insufficient to make me not a stochastic parrot, I think it's important to specify what massive differences could exist between appearing like one and being one. I don't see a distinction between this process and how I write everything, other than "I do it better"- I guess I can technically use visual reasoning, but mine is underdeveloped and goes unused. Is it just a dichotomy of stochastic parrot vs. conscious entity?
Then I'll just say you are a stochastic parrot. Again, solipsism is not a new premise. The philosophical zombie argument has been around over 50 years now.
It reads to me like they compare the output of different prompts and somehow reach the conclusion that Claude is generating more than one token and "planning" ahead. They leave out how this works.
My guess is that they have Claude generate a set of candidate outputs and the Claude chooses the "best" candidate and returns that. I agree this improves the usefulness of the output but I don't think this is a fundamentally different thing from "guessing the next token".
UPDATE: I read the paper and I was being overly generous. It's still just guessing the next token as it always has. This "multi-hop reasoning" is really just another way of talking about the relationships between tokens.
That's not the methodology they used. They're actually inspecting Claude's internal state and suppression certain concepts, or replacing them with others. The paper goes into more detail. The "planning" happens further in advance than "the next token".
Okay, I read the paper. I see what they are saying but I strongly disagree that the model is "thinking". They have highlighted that relationships between words is complicated, which we already knew. They also point out that some words are related to other words which are related to other words which, again, we already knew. Lastly they used their model (not Claude) to change the weights associated with some words, thus changing the output to meet their predictions, which I agree is very interesting.
Interpreting the relationship between words as "multi-hop reasoning" is more about changing the words we use to talk about things and less about fundamental changes in the way LLMs work. It's still doing the same thing it did two years ago (although much faster and better). It's guessing the next token.
It’s a next letter guesser. Put in a different set of letters to start, and it’ll guess the next letters differently.