Okay I admit I'm confused and think I probably missed a crucial thing here. You'...

rfoo · on June 18, 2024

> You're saying the publicly available problem set isn't indicative of the distribution of the test set?

    Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.
    The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.

advael · on June 18, 2024

Well, in this paragraph they seem to explain that their public evaluation set is meant to be indicative of the kind of jump in difficulty you can expect from the private test set. This to me implies that my guess is close: They're looking for models that can learn simple concepts and apply them to complex problems. Keeping the test set private seems to be an attempt at making it difficult to "cheat" at this by simply memorizing superficial details of the more complex problem set, which makes sense given that the whole point of this seems to be testing for systems that can use learned abstractions to tackle novel, out-of-distribution problems

Like with our toy "algebra" examples, sure there's a lot of emphasis on repetition and rote in primary education on these subjects, and that's one way to get people more consistent at getting the calculations right, but to be frank I don't think it's the best way, or as crucial as it's made out to be. What someone really needs to understand about algebra is how the notation works and what the symbols mean. Like I can't unsee the concept of "+" as a function that takes two operands and starts counting for as many steps as one would in the right operand, starting at the value of the left operand. When looking at algebra, the process I go through relies on a bunch of conceptual frameworks, like "Anything in the set of all arabic numerals can be considered a literal value". "Anything in the roman alphabet is likely a variable". "Any symbol is likely an infix operator, that is, a function whose operands are on either side of it". Some of the concepts I'm using are just notational convention. At some point I memorized the set of arabic numerals, what they look like, what each of them means, how they're generally written in relation to each other to express quantities combinatorically. Some of the concepts are logical relations about quantities, or definitions of functions. But crucially, the form of these distillations makes them composable. If I didn't really understand what "+" does, then maybe someone could give me some really bad homework that goes

1 + 30 = 31

20 + 7 = 27

3 + 10 = 13

And then present me the problem

20 + 10 + 3 = ?

And I'd think the answer is

20 + 10 + 3 = 213

That demonstrates some model of how to do these calculations, but it doesn't really capture all the important relationships the symbols represent

We can have any number of objections to this training set. Like I wasn't presented with any examples of adding two-digit numbers together! OR even any examples where I needed to combine numbers in the same rank!

Definitely all true. Probably mistakes we could make in educating a kid on algebraic notation too. It's really hard to do these things in a way that's both accomplishing the goal and testable, quantifiable. But many humans demonstrate the ability to distill conceptual understanding of concepts without exhaustive examples of their properties, so that's one of the things ARC seems to want to test. It's hard to get this perfectly right, but it's a reasonable thing to want

rfoo · on June 19, 2024

I agree. However, it is not a clear cut what's fair and what is "gaming the benchmark" in this setup, for example:

- can I train on my own private training set (which is harder)?

- can I pretrain on The Pile or something similar, a dataset full of texts crawled from web?

- can I pretrain on elementary school textbooks?

It seems like the latter two is acceptable given the use of GPT-4o here. But then, are the latter two that different to the first one? GPT-4o have the public test set in its training data (GPT-4o is definitely trained on public GitHub repos).

What's the point of having a training set with different distribution in this case, other than making participating harder? Maybe it's to discourage data-hungry approaches, but if there are legit shortcuts, anyone who seriously want to win would take it.