That's a very sloppy post. He does a single example, not even running locally or changing sampling parameters, and then concludes that GPT-2 is doing nothing but pattern-matching? A lot of people underestimate NNs because the sampling from them (top-k! how much dumber and cruder can you get? nucleus works better, but is still obviously suboptimal) destroys a lot of dark knowledge. I noticed this with Gary Marcus's claims about GPT-2 too: he would try once, without changing any sampling settings, and conclude that it wasn't doing anything, but if you tried, you would get different results. I'm not the only one to notice that: https://www.quantamagazine.org/common-sense-comes-to-compute... Such tests can prove the presence of knowledge, but not the absence... And of course, GPT-3 does extensive arithmetic tricks: https://arxiv.org/pdf/2005.14165.pdf#page=22
The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves. This is more than I can say about most chatter about ML.
>Such tests can prove the presence of knowledge, but not the absence...
This sounds like a setup for non-falsifiable beliefs.
> The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves.
And I did (using my own local GPT-2-1.5b install which let me set the hyperparameters rather than restricting it to inappropriate hardwired ones of an online service), I linked to another person demonstrating the same thing, I pointed out the extensive GPT-3 evaluation OA did, and here, have another link about how bad querying of language models leads to highly misleading results about how much they know: https://arxiv.org/abs/1911.12543 Measurement error in general biases estimates towards zero.
> This sounds like a setup for non-falsifiable beliefs.
It's just as non-falsifiable as, say, concepts like 'lower bounds' or 'bugs'.
The paper you link to claims that hand-crafted queries used to evaluate the
knowledge and understanding of language models are "sub-optimal" because they
do not take into account the context in which a LM was trained. For example:
These manually created prompts (e.g. “Barack Obama was born in _”) might be
sub-optimal because LMs might have learned target knowledge from
substantially different contexts (e.g. “The birth place of BarackObama is
Honolulu, Hawaii.”) during their training.
In other words, the paper considers hand-crafted prompts like in the example
to be "sub-optimal" because they are not in the right format. To paraphrase
them a bit, such prompts are like making a mis-formed query to a database.
It is difficult to see how this is an argument for the ability of LMs to
demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and
getting a correct answer; then asking "how much is 2+4?" and getting a wrong
answer. Most people would probably not take that as evidence that the second
question was "wrong". They would instead conclude that the child does not
"understand" addition and has only learned to reproduce specific answers to
specific questions.
To be fair the ability to return a correct answer given a question in the
right format is not without use. That, indeed, is how databases work. But it
shows none of the "understanding" or "knowledge" the paper claims is acquired
by Language Models.
> It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.
To use your database analogy, in what sense should we claim a database doesn't know a record when you are using a malformed SQL query? If we fixed the query and it emitted the right answer, then obviously it did store the information. The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way. Since LMs can get much better results just by tailoring the prompts (increased by a third in that paper! and there's no reason to think that that is the very best possible performance either!), that shows that existing practices drastically underestimate what knowledge the model has been able to learn. Learning about the real world or text is very different from learning your particular dumb broken query method.
The problem is that nobody claims that databases "know" anything. They store data. Data can be retrieved from storage. That's all they do.
>> The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way.
Oh, yes, absolutely. A query encodes the answer. Queries are patterns that are matched by the data stored in the database. If a query fails it's because it does not correctly represent the information it is trying to retrieve. For example, if I SELECT * FROM TABLE PEOPLE and there is no table "PEOPLE", then I don't get an answer because the query does not correctly represnt the structure of the database. You cannot retrieve any data from a database unless you have some idea about the structure of that data.
But that's not the point here. I don't disagree that a language model can learn (i.e. it can represent some elements of its training dataset). I disagree that it "understands" anything and I find the fact that it needs specific queries to retrieve the data it is representing to be evidence that it does not.
And so it's not more useful than a traditional database at this kind of task. Except it's much less precise than a traditional database and costs considerably more to create.
>> Learning about the real world or text is very different from learning your particular dumb broken query method.
I'm sorry, I don't understand what you mean here. What is my "particular dumb borken query method"? Is that meant as a personal attack?