> ‘One token at a time’ is how a model generates its output, not how it comes up with that output.
I do not believe you are correct.
Now, yes, when we write printf("Hello, world\n"), of course the characters 'H', 'e', ... are output one at a time into the stream. But the program has the string all at once. It was prepared before the program was even run.
This is not what LLMs are doing with tokens; they have not prepared a batch of tokens which they are shifting out left-to-right from a dumb buffer. They output a token when they have calculated it, and are sure that the token will not have to be backtracked over. In doing so they might have calculated additional tokens, and backtracked over those, sure, and undoubtedly are carrying state from such activities into the next token prediction.
But the fact is they reach a decision where they commit to a certain output token, and have not yet committed to what the next one will be. Maybe it's narrowed down already to only a few candidates; but that doesn't change that there is a sharp horizon between committed and unknown which moves from left to right.
Responses can be large. Think about how mind boggling it is that the machine can be sure that the first 10 words of a 10,000 word response are the right ones (having put them out already beyond possibility of backtracking), at a point where it has no idea what the last 10 will be. Maybe there are some activations which are narrowing down what the second batch of 10 words will be, but surely the last ones are distant.
I do not believe you are correct.
Now, yes, when we write printf("Hello, world\n"), of course the characters 'H', 'e', ... are output one at a time into the stream. But the program has the string all at once. It was prepared before the program was even run.
This is not what LLMs are doing with tokens; they have not prepared a batch of tokens which they are shifting out left-to-right from a dumb buffer. They output a token when they have calculated it, and are sure that the token will not have to be backtracked over. In doing so they might have calculated additional tokens, and backtracked over those, sure, and undoubtedly are carrying state from such activities into the next token prediction.
But the fact is they reach a decision where they commit to a certain output token, and have not yet committed to what the next one will be. Maybe it's narrowed down already to only a few candidates; but that doesn't change that there is a sharp horizon between committed and unknown which moves from left to right.
Responses can be large. Think about how mind boggling it is that the machine can be sure that the first 10 words of a 10,000 word response are the right ones (having put them out already beyond possibility of backtracking), at a point where it has no idea what the last 10 will be. Maybe there are some activations which are narrowing down what the second batch of 10 words will be, but surely the last ones are distant.