The "next token" thing is literally true, but it might turn out to be a red herring, because emergence is a real phenomenon. Like how with enough NAND-gates daisy-chained together you can build any logic function you like.
Gradually, as these LLM next-token predictors are set up recursively, constructively, dynamically, and with the right inputs and feedback loops, the limitations of the fundamental building blocks become less important. Might take a long time, though.
> Like how with enough NAND-gates daisy-chained together you can build any logic function you like.
The version of emergence that AI hypists cling to isn't real, though, in the same way that adding more NAND gates won't magically make the logic function you're thinking about. How you add the NAND gates matters, to such a degree that people who know what they're doing don't even think about the NAND gates.
But isn't that what the training algorithm does? (Genuinely asking since I'm not very familiar with this.) I thought it tries anything, including wrong things, as it gradually finds better results from the right things.
Better results, yes, but that doesn't mean good results. It can only find local optima in a predetermined state space. Training a neural network involves (1) finding the right state space, and (2) choosing a suitable gradient function. If the Correct Solution isn't in the state space, or isn't reachable via gradual improvement, the neural network will never find it.
An algorithm that can reason about the meaning of text probably isn't in the state space of GPT. Thanks to the https://en.wikipedia.org/wiki/Universal_approximation_theore..., we can get something that looks pretty close when interpolating, but that doesn't mean it can extrapolate sensibly. (See https://xkcd.com/2048/, bottom right.) As they say, neural networks "want" to work, but that doesn't mean they can.
That's the hard part of machine learning. Your average algorithm will fail obviously, if you've implemented it wrong. A neural network will just not perform as well as you expect it to (a problem that usually goes away if you stir it enough https://xkcd.com/1838/), without a nice failure that points you at the problem. For example, Evan Miller reckons that there's an off-by-one error in everyone's transformers. https://www.evanmiller.org/attention-is-off-by-one.html
If you add enough redundant dimensions, the global optimum of a real-world gradient function seems to become the local optimum (most of the time), so it's often useful to train a larger model than you theoretically need, then produce a smaller model from that.
> But isn't that what the training algorithm does?
It's true that training and other methods can iteratively trend towards a particular function/result. But in this case the training is on next token prediction which is not the same as training on non-verbal abstract problem solving (for example).
There are many things humans do that are very different from next token prediction, and those things we do all combine together to produce human level intelligence.
> There are many things humans do that are very different from next token prediction, and those things we do all combine together to produce human level intelligence.
Exactly
LLMs didn't resolve knowledge representation problems. We still don't know how it's going in our brains, but at least we know, we may do internal symbolic knowledge representation and reasoning. LLMs don't. We need a kind of different math for ANNs, a new convolution but for text where layers extract features through the lexical analysis and ontology utilisation, and then train the network.
This presupposes that conscious, self-directed intelligence is at all what you're thinking it is, which it might not be (probably isn't). Given that, perhaps no amount of predictors in any arrangement or with any amount of dynamism will ever create an emergent phenomenon of real intelligence.
You say emergence is a real thing, and it is, but we have not one single example of it taking the form of sentience in any human-created thing of complexity.
Bill Gates was famous for this, about thirty years ago; Joel Spolsky wrote about it in https://www.joelonsoftware.com/2006/06/16/my-first-billg-rev.... Maybe it was over the top, and I'm sure it didn't work all the time, but I feel like it would have contributed to Microsoft's success.
Gradually, as these LLM next-token predictors are set up recursively, constructively, dynamically, and with the right inputs and feedback loops, the limitations of the fundamental building blocks become less important. Might take a long time, though.