> People who say this nonsense need to start properly defining human level intelligence because nearly anything you throw at GPT-4 it performs at at least average human level, often well above.
"Average human level" is pretty boring though. Computers have been doing arithmetic at well above "average human level" since they were first invented. The premise of AGI isn't that it can do something better than people, it's that it can do everything at least as well. Which is clearly still not the case.
Lol ok. Still human level. and GPT-4 is way above average in most tasks.
>Computers have been doing arithmetic at well above "average human level" since they were first invented.
Cool. That's what the general in agi is about. GPT-4 is very general.
>The premise of AGI isn't that it can do something better than people, it's that it can do everything at least as well.
as well as what kind of people ? experts ?
That was not the premise of agi when the term was coined or for a long time afterwards. Posts have shifted(as they often do in this field) so that that's what the term seems to mean now but agi was artificial and generally intelligent, which has been passed.
There's no difference between your definition of agi which is supposed to surpass experts in every field and super intelligence.
> Lol ok. Still human level. and GPT-4 is way above average in most tasks.
It has access to a lot of information that most humans don't have memorized. It's a better search engine than most humans. And it can format that information into natural language.
But can it drive a car? If given an incentive to not confabulate and the knowledge that its statements are being verified, can it achieve that as consistently as the median human?
If you start by giving it a simple instruction with stark consequences for not following it, can it continue to register the importance of that instruction even after you give it a lot more text to read?
> as well as what kind of people ? experts ?
Experts are just ordinary people with specific information. You're giving the specific information to the AI, aren't you? It's in the training data.
> There's no difference between your definition of agi which is supposed to surpass experts in every field and super intelligence.
That's because there is no difference between them. Super intelligence is achievable just by making general intelligence faster. If you have AGI and can make it go faster by throwing more compute hardware at it then you have super intelligence.
>It has access to a lot of information that most humans don't have memorized.
It's not just about knowledge.
Lots of papers showing strong reasoning across various reasoning types. Couple papers demonstrating the development of world models too.
>It's a better search engine than most humans. And it can format that information into natural language.
Not how this works. They aren't search engines. and their performance equity with people isn't relegated to knowledge tasks alone.
>But can it drive a car? If given an incentive to not confabulate and the knowledge that its statements are being verified, can it achieve that as consistently as the median human?
Can a blind man drive a car ? a man with no hands ?
>If you start by giving it a simple instruction with stark consequences for not following it, can it continue to register the importance of that instruction even after you give it a lot more text to read?
Lol yes
>Experts are just ordinary people with specific information. You're giving the specific information to the AI, aren't you? It's in the training data.
No. Experts are people with above average aptitude for any given domain. It's not just about knowledge. many people try and fail to become experts in any given domain.
>That's because there is no difference between them. Super intelligence is achievable just by making general intelligence faster.
That's not how intelligence works. Dumb thinking sped up is just more dumb thinking but faster.
> Lots of papers showing strong reasoning across various reasoning types. Couple papers demonstrating the development of world models too
Actual reasoning, or reconstruction of existing texts containing similar reasoning?
> Not how this works. They aren't search engines. and their performance equity with people isn't relegated to knowledge tasks alone.
It kind of is how this works, and most of the source of its ability to beat average humans at things is on knowledge tasks.
> Can a blind man drive a car ? a man with no hands ?
Lack of access to cameras or vehicle controls isn't why it can't drive a car.
> Lol yes
The existence of numerous ChatGPT jailbreaks is evidence to the contrary.
> No. Experts are people with above average aptitude for any given domain. It's not just about knowledge. many people try and fail to become experts in any given domain.
Many people are of below average intelligence, or give up when something is hard but not impossible.
> That's not how intelligence works. Dumb thinking sped up is just more dumb thinking but faster.
If you have one machine that will make one attempt to solve a problem a day and succeeds 90% of the time and another that will make a billion attempts to solve a problem a second and succeeds 10% of the time, which one has solved more problems by the end of the week?
>Lack of access to cameras or vehicle controls isn't why it can't drive a car.
It would be best to wait till what you say can be evaluated. that is your hunch, not fact.
>The existence of numerous ChatGPT jailbreaks is evidence to the contrary.
No it's not. People fall for social engineering and do what you ask. if you think people can't be easily derailed, boy do i have a bridge for you.
>Many people are of below average intelligence, or give up when something is hard but not impossible.
Ok. Doesn't help your point. and many above average people don't reach expert level either. If you want to rationalize all that as "gave up when it wasn't impossible", go ahead lol but reality paints a very different picture.
>If you have one machine that will make one attempt to solve a problem a day and succeeds 90% of the time and another that will make a billion attempts to solve a problem a second and succeeds 10% of the time, which one has solved more problems by the end of the week?
"Problems" aren't made equal. Practically speaking, it's very unlikely the billion per second thinker is solving any of the caliber of problems the one attempt per day is solving. Solving more "problems" does not make you a super intelligence.
For anyone following along, they are in my sibling comment. Linked papers here[0]. The exact same conversation is happening there, but sourced.
> 3 of them don't even have anything to do with a existing dataset testing
Specifically I address this claim and bring strong evidence to why you should doubt this claim. Especially this specific wording. The short end is when you scrape the entire internet for your training data that you have a lot of overlap and that you can't confidently call these evaluations "zero shot." All experiments performed in the linked works use datasets that are not significantly different from data found in the training set. For those that are "hand written" see my complaints (linked) about HumanEval.
> It would be best to wait till what you say can be evaluated. that is your hunch, not fact.
LLMs aren't even the right kind of thing to drive a car. We have AIs that attempt to drive cars and have access to cameras and vehicle controls and they still crash into stationary objects.
> No it's not. People fall for social engineering and do what you ask. if you think people can't be easily derailed, boy do i have a bridge for you.
Social engineering works because most human interactions aren't malicious and the default expectation is that any given one won't be.
That's a different thing than if you explicitly point out that this text in particular is confirmed malicious and you must not heed it, and then it immediately proceeds to do it anyway.
And yes, you can always find that one guy, but that's this:
> Many people are of below average intelligence
It has to beat the median because if you go much below it, there are people with brain damage. Scoring equal to someone impaired or disinclined to make a minimal effort isn't a passing grade.
> "Problems" aren't made equal. Practically speaking, it's very unlikely the billion per second thinker is solving any of the caliber of problems the one attempt per day is solving.
The speed is unrelated to the difficulty. You get from one a day to a billion a second by running it on a thousand supercomputers instead of a single dated laptop.
So the percentages are for problems of equal difficulty.
This is infinite monkeys on infinite typewriters. Except that we don't actually have infinite monkeys or infinite typewriters, so an AI which is sufficiently terrible can't be made great by any feasible amount of compute resources. Whereas one which is kind of mediocre and fails 90% of the time, or even 99.9% of the time, can be made up for in practice with brute force.
But there are still problems that ChatGPT can't even solve 0.1% of the time.
> The premise of AGI isn't that it can do something better than people, it's that it can do everything at least as well. Which is clearly still not the case.
I imagine an important concern is the learning & improvement velocity. Humans get old, tired, etc. GPUs do not. It isn't the case now, but it is fuzzy how fast we could collectively get there. Break out problem domains into modules, off to the silicon dojos until your models exceed human capabilities, and then roll them up. You can pick from OpenGPT plugins, why wouldn't an LLM hypervisor/orchestrator do the same?
It seems you have the wrong idea of what is being conveyed, or what average human intelligence is. It isn't about being able to do math. It is being able to invent, mimic quickly, abstract, memorize, specialize, and generalize. There's a reason humans have occupied every continent of the earth and even areas outside. It's far more than being able to do arithmetic or playing chess. This just all seems unimpressive to us because it is normal, to us. But this certainly isn't normal if we look outside ourselves. Yes, there's intelligence in many lifeforms, even ants, but there is some ineffable or difficult to express uniqueness to human intelligence (specifically in its generality) that is being referenced here.
To put it one way, a group of machines that could think at the level of an average teenager (or even lower) but able to do so 100x faster would probably outmatch a group of human scientists in being able to solve complex and novel math problems. This isn't "average human level" but below. "Average human level" is just a shortcut term for this ineffable description of the _capacity_ to generalize and adapt so well. Because we don't even have a fucking definition of intelligence.
> It isn't about being able to do math. It is being able to invent, mimic quickly, abstract, memorize, specialize, and generalize.
But this is exactly why average is boring.
If you ask ChatGPT what it's like to be in the US Navy, it will have texts written by Navy sailors in its training data and produce something based on those texts in response to related questions.
If you ask the average person what it's like to be in the US Navy, they haven't been in the Navy, may not know anyone who is, haven't taken any time to research it, so their answers will be poor. ChatGPT could plausibly give a better response.
But if you ask the questions of someone who has, they'll answer related questions better than ChatGPT. Even if the average person who has been in the Navy has no greater intelligence than the average person who hasn't.
It's not better at reasoning. It's barely even capable of it, but has access to training data that the average person lacks.
Honestly, I think it is the lens. Personally I find it absolutely amazing. It's this incredibly complex thing that we've been trying to describe for thousands of years but have completely failed to (we've gotten better of course). It's this thing that is right in front of that looks simple but because not many try to peak behind the curtain. Looking behind there is like trying to describe a Lovecraftian monster. But this is all in plain sight. That's pretty crazy imo. But hey, dig down the rabbit hole of any subject and you'll find this complex world. Most things are like collage. From far away their shape looks clear and precise but on close inspection you find that each tile itself is another beautiful piece. This is true even for seemingly simple things, and honestly I think that's even more beautiful. This complex and chaotic world is all around us but we take it for granted. Being boring comes down to a choice.
> If you ask the average person what it's like to be in the US Navy, ChatGPT could plausibly give a better response.
There's also a bias. Does a human know the instructions are a creative exercise? It is hard to measure because what you'd need to prompt a human with is "Supposing you were a conman trying to convince me you were in the Navy, how would you describe what it was like?" Because the average human response is going to default to not lying and fabricating things. You also need to remember that your interpretation is (assuming you aren't/weren't in the Navy) is as someone hearing a story rather than aligning that story to lived experiences. You'd need to compare the average human making up a story to GPT, not an average human's response.
> It's barely even capable of it, but has access to training data that the average person lacks.
I do agree that GPT is great as a pseudo and noisy library. I find that as a wonderful and very useful tool. I often forget specific words used to describe certain concepts. This is hard to google. GPT finds them pretty well or returns something close enough that I can do a quick iterative prompt and find the desired term. Much faster than when I used to do this by googling. But yeah, I think we both agree that GPT is by no means sentient and likely not intelligent (ill-defined and defined differently for different people). But we can find many things and different things interesting. My main point is I wanted to explain why I find intelligence so fascinating. Hell, it is a major part of why I got into ML research in the first place (Asimov probably helped a lot too).
I definitely don't buy these papers at face value. I say this as an ML researcher btw.
You'll often see these works discussing zero-shot performance. But many of these tasks are either not zero-shot or even a known n-shot. Let's take a good example, Imagen[0] claims zero-shot MS-COCO performance but trains on LAION. COCO classes exist in LAION and there are similar texts. Explore COCO[1] and explore clip retrieval[2] for LAION. The example given is the first sample from COCO aircraft and you'll find almost identical images and captions with many of the same keywords. This isn't zero-shot.
Why's this matter? Dataset contamination[3] being used in the evaluation process. You can't conclude that a model has learned something if it has access to the evaluation data. Test sets have always been a proxy for generalization and MUST be recognized as proxies.
This gets really difficult with LLMs where all we know is that they've scrapped a large swath of the internet and that includes GitHub and Reddit. I show some explicit examples and explanation with code generation here [4]. From there you might even see how it is difficult to generate novel test sets that aren't actually contaminated, which is my complaint about HumanEval. I show that we can find dupes or near dupes on GitHub despite these being "hand written."
As per your sources all use GPT, which we don't know what data they have and don't have. But we do know they were trained on Reddit and GitHub. That should be enough to tell you that certain things like Physics and Coding problems[5] are spoiled. If you look at all the datasets used for evaluation in the works you listed I think you'll find reason to believe that there's a good chance that these too are spoiled. (Other datasets are spoiled and there's lots of experimentation that demonstrates the causal reasoning isn't as good as the performance suggests)
Now mind you, this doesn't mean that LMs can't do causal reasoning. They definite can. Including causal discovery[6]. But this all tells us that it is fucking hard to evaluate models and even harder when we don't know what they were trained on. That maybe we need to be a bit more nuanced and stop claiming things so confidently. There's a lot of people trying to sell snake oil right now. These are very powerful tools that are going to change the world, but they are complex and people don't know much about them. We saw many snake oil salesmen at the birth of the internet too. Didn't mean the internet wasn't important or not going to change the course of humanity. Just meant that people were profiting off of the confusion and complexity.
I don't think you took more than a passing glance, if any at those papers.
What you describe is impossible with these 3.
https://arxiv.org/abs/2212.09196 - new evaluation set introduced with the paper. modelled after tests that previously only had visual equivalents. contamination literally impossible
https://arxiv.org/abs/2204.02329 - effect of explanations on questions introduced with the paper. dataset concerns make no sense.
https://arxiv.org/abs/2211.09066 - new prompting method introduced to improve algorithmic calculations. dataset concerns make no sense.
The Casual paper is the only one where worries about dataset contamination makes any sense at all.
> I don't think you took more than a passing glance, if any at those papers.
I'll assume in good faith but let's try to keep this in mind both ways.
> What you describe is impossible with these 3.
Definitely possible. I did not write my comment as a paper but I did provide plenty of evidence. I specifically ask that you pay close attention to my HumanEval comment and click that link. I am much more specific about how a "novel" dataset may not actually be novel. This is a complicated topic and we must connect many dots. So care is needed. You have no reason to trust my claim that I am an ML researcher, but I assure you that this is what I do. I have a special place in my heart for evaluation metrics too and understanding their limitations. This is actually key. If you don't understand the limits to a metric then you don't understand your work. If you don't understand the limits of your datasets and how they could be hacked you don't understand your work.
=== Webb et al ===
Let's see what they are using to evaluate.
> To answer this question, we evaluated the language model GPT-3 on a range of zero-shot analogy tasks, and
performed direct comparisons with human behavior. These tasks included a novel text-based matrix reasoning task
based on Raven’s Progressive Matrices, a visual analogy problem set commonly viewed as one of the best measures
of fluid intelligence
Okay, so they created a new dataset. Great, but do we have the HumanEval issues? You can see that Raven Progressive Matrices were introduced in 1938 (referenced paper) and you'll also find many existing code sets on GitHub that are almost a decade old. Even ML ones that are >7 years old. We can also find them in blogspot, wordpress, and wikipedia, which are the top three domains for common crawl (used for GPT3)[0]. This automatically disqualifies this claim from the paper:
> Strikingly, we found that GPT-3 performed as well or better than college students in most conditions, __despite receiving no direct training on this task.__
It may be technically correct since there is no "direct" training but it is clear that the model was trained on these types of problems. But that's not the only work they did
> GPT-3 also displayed strong zero-shot performance on letter string analogies, four-term verbal analogies, and identification of analogies between stories.
I think we can see that these are also obviously going to be in the training data as well. That GPT-3 had access to examples, similar questions, and even in depth break downs as to why the answers are the correct answers.
Contamination isn't "literally impossible" but trivially proven. This seems to exactly match my complaint about HumanEval.
=== Lampinen et al ===
We need just look at our example on the second page.
Task instruction
> Answer these questions by identifying
whether the second sentence is an appro-
priate paraphrase of the first, metaphori-
cal sentence.
Answer explanation
> Explanation: David’s eyes were not lit-
erally daggers, it is a metaphor used to
imply that David was glaring fiercely at
Paul.
You just have to ask yourself if this prompt and answer are potentially anywhere in common crawl. I think we know there are many blogspot posts that have questions similar to SAT and IQ tests, which this experiment is similar to
=== Conclusion ===
You have strong critiques of my response but have little to back up these critiques. I'll reiterate, because it was in my initial response: you are not performing zero-shot testing when your test set is includes similar data. That's not what zero shot it. I wrote more about this a few months back[1] and may be worth reading. What needs to be responded to me to change my opinion is not a claim that the dataset was not existing prior to the crawl but that the model was not trained on data significantly similar to that in the test set. This is, again, my original complaint about HumanEval and these papers do nothing to address these complaints.
I'll go even further. I'd encourage you to look at this paper[2] where data isn't just exactly de-duplicated, but near de-duplicated. There is an increase in performance for these results. But I'm not going to explain everything to you. I will tell you that you need to look at Figures 4, 6, 7, A3, ESPECIALLY A4, A5, and A6 VERY carefully. Think about how these results can be explained and the relationship to random pruning. I'll also say that their ImageNet results ARE NOT zero-shot (for reasons given previously).
But we're coming back to the same TLDR: evaluating models is hard and already noisy process. Evaluating models that have scraped a significant portion of the internet are substantially harder to evaluate. If you can provide to me strong evidence that there isn't contamination then I'll take these works more seriously. This is a point you are not addressing. You have to back up the claims, not just state them. In the mean time, I have strong evidence that these, and many other, datasets are contaminated. This even includes many causal datasets that you have not listed but were used in other works. Essentially: if the test sets are on GitHub, it is contaminated. Again, see HumanEval and my specific response that I linked. You can't just say "wrong," drop some sources, and leave it at that. That's not how academic conversations happen.
"Average human level" is pretty boring though. Computers have been doing arithmetic at well above "average human level" since they were first invented. The premise of AGI isn't that it can do something better than people, it's that it can do everything at least as well. Which is clearly still not the case.