Author here. Which aspects are misleading? How can it be improved?

killthebuddha · 2024-09-14T05:48:09 1726292889

I think the post is great, clear and fair and all that. And I definitely agree with the general point that o1 shows some amount of improvement on generality but with a massive tradeoff on cost.

I'm going to think through what I find "misleading" as I write this...

Ok so I guess it's that I wouldn't be surprised at all if we learn that models can improve a ton w.r.t. human-in-the-loop prompt engineering (e.g. ChatGPT) without a commensurate improvement in programmatic prompt engineering.

It's very difficult to get a Python-driven claude-3.5-sonnet agent to solve ARC tasks and it's also very difficult to get claude-3.5-sonnet to solve ARC tasks via the claude.ai UI. The blog post shows that it's also very difficult to get a Python-driven o1-preview agent to solve ARC tasks. From a cursory exploration of o1-preview's capabilities in the ChatGPT UI my intuition is that it's significantly smarter than claude-3.5-sonnet based on how much better it responds to my human-in-the-loop feedback.

So I guess my point is that many people will probably come away from the blog post thinking "there's nothing to see here", o1-preview is more of the same thing, but it seems to me that it's very clearly qualitatively different than previous models.

Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.

wokwokwok · 2024-09-14T04:31:04 1726288264

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

mikeknoop · 2024-09-14T05:00:02 1726290002

I personally am slightly surprised at o1's modest performance on ARC-AGI given the large leaps in performance on other objectively hard benchmarks like IOI and AIME.

Curiosity is the first step towards new ideas.

ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.

HeatrayEnjoyer · 2024-09-14T04:59:09 1726289949

What does minutes and hours even mean? Software comparison using absolute time duration is meaningless without a description of the system it was executed on; e.g. SHA256 hashes per second on a Win10 OS and i7-14100 processor. For a product as complex as multiuser TB-sized LLMs, compute time is dependent on everything from the VM software stack to the physical networking and memory caching architecture.

CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.

achierius · 2024-09-14T05:37:25 1726292245

These are API calls to a remote server. We don't have the option of scaling up or even measuring the compute they use to run them, so for better or worse the server cluster has to be measured as part of their model service offering.

HeatrayEnjoyer · 2024-09-16T13:12:18 1726492338

I understand that, but that's only useful if you're only looking at it from a shallow business perspective.

tsimionescu · 2024-09-14T10:38:33 1726310313

You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.

So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.

dr_dshiv · 2024-09-14T09:37:00 1726306620

I mean, “scaling compute at inference” actually means “using an LLM agent system.” Haven’t we known that chains of agents can be useful for a while?

killthebuddha · 2024-09-14T05:04:02 1726290242

I agree with basically everything you said but I think you've misunderstood my point. I'll reply to the other comment with more.