Just because GPT exhibits a behavior does not mean it performs that behavior. You are using those weasel words for a very good reason!
Language is a symbolic representation of behavior.
GPT takes a corpus of example text, tokenizes it, and models the tokens. The model isn't based on any rules: it's entirely implicit. There are no subjects and no logic involved.
Any "understanding" that GPT exhibits was present in the text itself, not GPT's model of that text. The reason GPT can find text that "makes sense", instead of text that "didn't make sense", is that GPT's model is a close match for grammar. When people wrote the text in GPT's corpus, they correctly organized "stuff that makes sense" into a string of letters.
The person used grammar, symbols, and familiar phrases to model ideas into text. GPT used nothing but the text itself to model the text. GPT organized all the patterns that were present in the corpus text, without ever knowing why those patterns were used.
In what sense is your "experience" (mediated through your senses) more valid than a language model's "experience" of being fed tokens? Token input is just a type of sense, surely?
It's not that I think multimodal input is important. It's that I think goals and experimentation are important. GPT does not try to do things, observe what happened, and draw inferences about how the world works.
I would say it's not a question of validity, but of the additional immediate, unambiguous, and visceral (multi sensory) feedback mechanisms to draw from.
If someone is starving and hunting for food, they will learn fast to associate cause and effect of certain actions/situations.
A language model that only works with text may yet have an unambiguous overall loss function to minimize, but as it is a simple scalar, the way it minimizes this loss may be such that it works for the large majority of the training corpus, but falls apart in ambiguous/tricky scenarios.
This may be why LLMs have difficulty in spatial reasoning/navigation for example.
Whatever "reasoning ability" that emerged may have learned _some_ aspects to physicality that it can understand some of these puzzles, but the fact it still makes obvious mistakes sometimes is a curious failure condition.
So it may be that having "more" senses would allow for an LLM to build better models of reality.
For instance, perhaps the LLM has reached a local minima with the probabilistic modelling of text, which is why it still fails probabilistically in answering these sorts of questions.
Introducing unambiguous physical feedback into its "world model" maybe would provide the necessary feedback it needs to help it anchor its reasoning abilities, and stop failing in a probabilistic way LLMs tend to currently do.
You used evolution, too. The structure of your brain growth is the result of complex DNA instructions that have been mutated and those mutations filtered over billions of iterations of competition.
There are some patterns of thought that are inherent to that structure, and not the result of your own lived experience.
For example, you would probably dislike pain with similar responses to your original pain experience; and also similar to my lived pain experiences. Surely, there are some foundational patterns that define our interactions with language.
> The model isn't based on any rules: it's entirely implicit. There are no subjects and no logic involved.
In theory a LLM could learn any model at all, including models and combinations of models that used logical reasoning. How much logical reasoning (if any) GPT-4 has encoded is debatable, but don’t mistake GTP’s practical limitations for theoretical limitations.
> In theory a LLM could learn any model at all, including models and combinations of models that used logical reasoning.
Yes.
But that is not the same as GPT having it's own logical reasoning.
An LLM that creates its own behavior would be a fundamentally different thing than what "LLM" is defined to be here in this conversation.
This is not a theoretical limitation: it is a literal description. An LLM "exhibits" whatever behavior it can find in the content it modeled. That is fundamentally the only behavior an LLM does.
Language is a symbolic representation of behavior.
GPT takes a corpus of example text, tokenizes it, and models the tokens. The model isn't based on any rules: it's entirely implicit. There are no subjects and no logic involved.
Any "understanding" that GPT exhibits was present in the text itself, not GPT's model of that text. The reason GPT can find text that "makes sense", instead of text that "didn't make sense", is that GPT's model is a close match for grammar. When people wrote the text in GPT's corpus, they correctly organized "stuff that makes sense" into a string of letters.
The person used grammar, symbols, and familiar phrases to model ideas into text. GPT used nothing but the text itself to model the text. GPT organized all the patterns that were present in the corpus text, without ever knowing why those patterns were used.