That's a very sloppy post. He does a single example, not even running locally or...

gambler · on May 29, 2020

The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves. This is more than I can say about most chatter about ML.

>Such tests can prove the presence of knowledge, but not the absence...

This sounds like a setup for non-falsifiable beliefs.

gwern · on May 29, 2020

> The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves.

And I did (using my own local GPT-2-1.5b install which let me set the hyperparameters rather than restricting it to inappropriate hardwired ones of an online service), I linked to another person demonstrating the same thing, I pointed out the extensive GPT-3 evaluation OA did, and here, have another link about how bad querying of language models leads to highly misleading results about how much they know: https://arxiv.org/abs/1911.12543 Measurement error in general biases estimates towards zero.

> This sounds like a setup for non-falsifiable beliefs.

It's just as non-falsifiable as, say, concepts like 'lower bounds' or 'bugs'.

YeGoblynQueenne · on May 30, 2020

The paper you link to claims that hand-crafted queries used to evaluate the knowledge and understanding of language models are "sub-optimal" because they do not take into account the context in which a LM was trained. For example:

  These manually created prompts (e.g. “Barack Obama was born in _”) might be
  sub-optimal because LMs might have learned target knowledge from
  substantially different contexts (e.g. “The birth place of BarackObama is
  Honolulu, Hawaii.”) during their training.

In other words, the paper considers hand-crafted prompts like in the example to be "sub-optimal" because they are not in the right format. To paraphrase them a bit, such prompts are like making a mis-formed query to a database.

It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To be fair the ability to return a correct answer given a question in the right format is not without use. That, indeed, is how databases work. But it shows none of the "understanding" or "knowledge" the paper claims is acquired by Language Models.

gwern · on May 30, 2020

> It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To use your database analogy, in what sense should we claim a database doesn't know a record when you are using a malformed SQL query? If we fixed the query and it emitted the right answer, then obviously it did store the information. The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way. Since LMs can get much better results just by tailoring the prompts (increased by a third in that paper! and there's no reason to think that that is the very best possible performance either!), that shows that existing practices drastically underestimate what knowledge the model has been able to learn. Learning about the real world or text is very different from learning your particular dumb broken query method.

YeGoblynQueenne · on May 30, 2020

The problem is that nobody claims that databases "know" anything. They store data. Data can be retrieved from storage. That's all they do.

>> The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way.

Oh, yes, absolutely. A query encodes the answer. Queries are patterns that are matched by the data stored in the database. If a query fails it's because it does not correctly represent the information it is trying to retrieve. For example, if I SELECT * FROM TABLE PEOPLE and there is no table "PEOPLE", then I don't get an answer because the query does not correctly represnt the structure of the database. You cannot retrieve any data from a database unless you have some idea about the structure of that data.

But that's not the point here. I don't disagree that a language model can learn (i.e. it can represent some elements of its training dataset). I disagree that it "understands" anything and I find the fact that it needs specific queries to retrieve the data it is representing to be evidence that it does not.

And so it's not more useful than a traditional database at this kind of task. Except it's much less precise than a traditional database and costs considerably more to create.

>> Learning about the real world or text is very different from learning your particular dumb broken query method.

I'm sorry, I don't understand what you mean here. What is my "particular dumb borken query method"? Is that meant as a personal attack?