> I'd heartily recommend maybe taking down the marketing vibrance down a notch a...

tbalsam · 2024-12-06T22:22:13 1733523733

I feel rather consternated that this response effectively boils down to "yes, we know we overhyped this to get people's attention, and now that we have it we can be more honest about it". Fighting for place in the attention economy is understandable, being deceptive about it is not.

This is part of the ethical morass of why some more serious researchers aren't touching the benchmark. People are not going to take it seriously if it continues like this!

mikeknoop · 2024-12-06T22:55:26 1733525726

I think we agree; to clarify, sharp messaging isn't inaccurate messaging. And I believe the story is not overhyped given the evidence: the benchmark resisted a $1M prize pool for ~6 months. But I concede we did obsess about the story to give it the best chance of survival in the marketplace of ideas against the incumbent AI research meme (LLM scaling). Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

mrandish · 2024-12-07T00:12:48 1733530368

Mike - please know that not everyone who appreciates ARC feels the same way as the GP. I'm not an academic researcher but I am quite sensitive to hype and excessive marketing. I've never felt the ARC site was anything other than appropriately professional.

Even revisiting it now, I don't see anything wrong with being concisely clear and even a little provocative in stating your case on your own site. Especially since a key value of ARC is getting more objectively grounded regarding progress toward AGI. On top of that ARC is "A non-profit for the public advancement of open artificial general intelligence" that you guys are personally donating serious money and time to that's helping a field where a lot of entrepreneurs are going to make money and academics are going to advance their careers.

My perception is ARC tried it the other way for years but a lot of academics and AI pundits ignored or dismissed it without ever meaningfully engaging with it. "Sharpening" the message this year has clearly paid off in bringing attention that's shifted the conversation and is helping advance progress toward AGI in ways nothing else has. I also greatly appreciate the time and care you and Francois have put into making the ARC proposition clear enough for non-technical people to understand. That's hard to do and doesn't happen by accident.

Personally, I've found ARC valuable in the real world outside of academia and domain experts because it provides a conceptually simple starting place to discuss with non-technical people what the term AGI might even mean. My high school-aged daughter asked me about vague AGI impending doom scenarios she heard on TikTok. I had her solve a couple ARC samples and then pointed out that today's best AIs aren't yet close to doing the same. This counter-intuitive revelation got her pondering the "Why?" which led to a deep discussion about the multi-dimensional breadth of human creativity and an appreciation of the many ways artificial intelligences might differ from human intelligence.

YeGoblynQueenne · 2024-12-07T12:53:59 1733576039

>> My perception is ARC tried it the other way for years but a lot of academics and AI pundits ignored or dismissed it without ever meaningfully engaging with it.

Your perception is very wrong and the likely reason is that as you say you're not an academic researcher. ARC made a huge splash with the original Kaggle competition a few years ago and it drew in exactly the kind of "academic researcher" you seem to be pointing to: those in university research groups who do not have access to the data and compute that the big tech companies have, and who can consequently not compete in the usual big data benchmarks that are dominated by Google, OpenAI, Meta, and friends. ARC, with its (unfair) few-shot tasks and constantly changing private test set, is exactly the kind of dataset that that kind of researcher are looking for, something that is relatively safe from big tech deep neural nets. Even the $1 million prize seems specially designed to be just enough to draw in that crowd of not super-rich academics while leaving corporate research groups insufficiently motivated.

Besides which, I won't name names but one of the principal researchers in the winning system is just one of those academics. I don't know which is the period you mean ARC was ignored by the academic community but that particular researcher was in a certain meeting of like-minded academics two years ago where one of the main areas of discussion was in short "how to beat ARC and show that our stuff works".

YeGoblynQueenne · 2024-12-07T12:30:05 1733574605

>> Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

How so? All the three top systems are deep neural net systems. The first place went to a system that, quoting from the "contributions" section of the paper, employed:

>> An automated data generation methodology that starts with 100-160 program solutions for ARC training tasks, and expands them to make 400k new problems paired with Python solutions

As I pointed out in another comment the top results in ARC have been achieved by ordinary, deep-learning, big-data, memorisation based approaches. You and fchollet (in these comments) try to claim otherwise but I don't understand why.

In fact, no, I understand why. I think fchollet wanted to place ARC as "not just a benchmark", the opposite of what tbalsam is asking for above. The motivation is solid: if we've learned anything in the last twenty-thirty years is that deep neural nets are very capable at beating benchmarks. For any deep neural net model that beats a benchmark though the question remains whether it can do anything else besides. Unfortunately, that is not a question that can be answered by beating yet another benchmark.

And here we are now, and the first place in the current ARC challenge goes to a deep neural net system trained on a synthetically augmented dataset. The right thing to do now would be to scale back the claims about the magickal AGI-IQ test with unicorns, and accept that your benchmark is just not any different than any other previous AI benchmark, that it is not any more informative than any other benchmark, and that a completely different kind of test of artificial intelligence is needed.

There is after all such a thing as scientific integrity. You make a big conjecture, you look at the data, realise that you're wrong, accept it, and move on. For example the authors of GLUE did that (as in SUPERGLUE). The authors of the Winograd Schema Challenge did that. You should follow their examples.

trott · 2024-12-07T18:35:31 1733596531

> realise that you're wrong, accept it, and move on

What do you think about limiting the submission size? Kaggle does this sometimes.

With a limit like 0.1-1MB (compressed), you are basically saying: "Give me sample-efficient learning algorithms, not pretrained models."

YeGoblynQueenne · 2024-12-08T02:13:35 1733624015

That's fine if you want to measure sample efficiency, but ARC-AGI is supposed to measure progress towards AGI.

trott · 2024-12-08T03:04:32 1733627072

> That's fine if you want to measure sample efficiency, but ARC-AGI is supposed to measure progress towards AGI.

On the Measure of Intelligence defines intelligence as skill-acquisition efficiency, I believe, where efficiency is with respect to whatever is the limiting factor. For each ARC task, the primary limiting factor is the number of samples in it. And the skill here is your ability to convert inputs into the correct outputs. In other words, in this context, intelligence is sample-efficiency, as I see it.

YeGoblynQueenne · 2024-12-08T12:06:28 1733659588

Is that what fchollet is claiming?

trott · 2024-12-08T16:38:20 1733675900

Not sure. But I think this follows logically from the definition of intelligence he is using. Also, see II.2.2 in the paper.

tbalsam · 2024-12-06T23:14:40 1733526880

> Now that the AI research field is coming around to the idea that something beyond deep learning is needed,

I have not heard this from anyone that I work with! It would be a curious violation of info theory were this to be the case.

Certainly, some things cannot efficiently be learned from data. This is a case where some other kind of inductive bias or prior is needed (again, from info theory) -- but replacing deep learning entirely would be rather silly.

Part of the reason that a number of researchers don't take the benchmark more seriously is because it's meant to cripple the results. For example, in the name of reducing brute force search, the compute was severely limited! This turned many off to begin with. The general contention as I understand was to let compute be a reasonable amount, but this would not play well with the numbers game. Because if you restrict compute beyond a reasonable point, it makes the numbers artificially low for people who don't know what's going on behind the scenes. And this ends up biasing the results unreasonably to favor the original messaging, (i.e., "We need something other than deep learning.")

If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

Unfortunately due to that, plus the consistent goal-post moving of the benchmark is why it's generally not really held with staying power in the research community -- the messaging changes based upon what is convenient for publicity, and there's unfortunately been a history of similar things in the past in the pedigree leading up to the ARC prize itself.

It is not entirely unsalvageable, but there really needs to be a turnaround of how the competition and prize is managed in order to win back people's trust. Placing a thumb on the scales to confirm a prior bias/previous messaging may work for a little while, but over time it robs the metric of its usability over time as the greater research community loses trust.

WhitneyLand · 2024-12-07T00:26:16 1733531176

I think you’re overly fixated on some minor points relative to the overall utility on offer here. And also skewing the facts a bit. For example at one point you quote the OP on words that were never said as far as I can see. At another point, you characterize their position as “replacing deep learning entirely” which, as far as I can tell, has never been advocated for in this comment thread or on behalf of ARC.

tbalsam · 2024-12-07T00:52:08 1733532728

That is an understandable statement, and probably fair as well I feel.

Much of this comes in reference to statements from fchollet w.r.t. replacing deep learning -- around the time of the initial prize, with a lot of the much more hype marketing, this was essentially the thru-line that was used, and it left a bitter taste in a number of peoples' mouths. W.r.t. misquoting, they did say that we needed something "beyond" deep learning, not "other than" here, and that is on me.

The utility is certainly still present, if I feel diminished, and it probably is a case of my own frustrations due to previous similar issues leading up to the ARC prize.

That being said, I do agree in retrospect that my response skewed from being objective -- it is a benchmark with a mixed history, but that doesn't mean that I should get personally caught up in it.

YeGoblynQueenne · 2024-12-07T12:47:33 1733575653

>> If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

The entire benchmark is set up so as to try and make it _artificially_ hard for deep learning: there are only three examples for each task; AND the private test set has a different distribution than the public training and validation sets (from what I can tell; a violation of PAC-Learning assumptions and then why should anyone be surprised if machine learning approaches in general can't deal with that?).

Even I (long story) find ARC to be unfair in the simplest sense of the word: it does not make for a level playing field that would allow for disparate approaches to machine learning to be compared fairly. Strangely and uniquely, the unfairness is aimed at the dominant approach, deep learning, where every other benchmark tends to skew towards deep learning (e.g. huge feature-based, labelled data).

But why's that? If ARC-AGI is a true test of AGI, or intelligence, or whatever it is supposed to be (an IQ test for AIs) then why does it have to jump through hoops just to defend itself from the dominant approach to AI? If it's a good test for AI, and the dominant approach to AI can't really do AI, then the dominant approach should not be capable of passing the test, without any shenanigans with reduced compute or few examples.

Is the purpose to demonstrate that deep neural nets can't generalise from few examples? That's machine learning 101 (although I guess there's still those who missed the lecture). Is it to encourage deep neural nets to get better at generalising from few examples? Well, first place just went to a big, deep, bad neural net with data augmentation so that doesn't even work.

iwsk · 2024-12-06T22:56:46 1733525806

we live in a society