For what it's worth, the Qwen team misreported an ARC-AGI score benchmark on the...

ducviet00 · 2025-07-25T14:22:07 1753453327

Maybe 41.8% is the score of Qwen3-235B-A22B-Thinking-2507, lol. 11% for the non-thinking model is pretty high

jug · 2025-07-25T14:38:49 1753454329

Makes sense, it's in line with Gemini 2.5 Pro in that case. It aligns with their other results in the post.

christianqchung · 2025-07-25T15:49:42 1753458582

They made it very clear that they were reporting that score for the non-thinking model[0]. I still don't have any guesses as to what happened here, maybe something format related. I can't see a motivation to blatantly lie on a benchmark which would very obviously be publicly corrected.

[0] https://x.com/JustinLin610/status/1947836526853034403

coolspot · 2025-07-25T16:41:48 1753461708

They have provided repro for the 41.8% result here: https://github.com/QwenLM/Qwen3/tree/main/eval

mattnewton · 2025-07-25T16:30:07 1753461007

Could it be the public eval set vs the private eval the ARC team has? The public eval set is slightly easier and may have had a some unintentional data leakage since it was released before their training data cutoff.