Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?

It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.

The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).

It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.

Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).

And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.

For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.



No no, that's not right. They're not asking for specific solutions. Any transformation of one grid to another will do.


What?

They say: "ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.

Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.

A successful submission is a pixel-perfect description (color and position) of the final task's output."

As far as I can tell, they are asking to reproduce exactly the final task's output.


What they mean by "specific learned skill" is that each task illustrates the use of certain "core knowledge priors" that François Chollet has claimed are necessary to solve said tasks. You can find this claim in Chollet's white paper that introduced ARC, linked below:

On the Measure of Intelligence

https://arxiv.org/abs/1911.01547

"Core knowledge priors" are a concept from psychology and cognitive science as far as I can tell.

To be clear, other than Chollet's claim that the "core knowledge priors" are necessary to solve ARC tasks, as far as I can tell, there is no other reason to assume so and every single system that has posted any above-0% results so far does not make any attempt to use that concept, so at the very least we can know that the tasks solved so far do not need any core knowledge priors to be solved.

But, just to be perfectly clear: when results are posted, they are measured by simple comparison of the target output grids with the output grids generated by a system. Not by comparing the method used to solve a task.

Also, if I may be critical: you can find this information all over the place online. It takes a bit of reading I suppose, but it's public information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: