Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For what it's worth, the Qwen team misreported an ARC-AGI score benchmark on the non-thinking model by a factor of 4, which has not been explained yet. They claimed a score of 41.8% on ARC-AGI 1 [0] which is much higher than what non-chain of thought models have been able to achieve (GPT 4.5 got 10%). The ARC team later benchmarked it at 11%[1], which is still a high score, but not the same as 41.8%. It's still probably a significant update on the model though.

[0] https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507

[1] https://x.com/arcprize/status/1948453132184494471





Maybe 41.8% is the score of Qwen3-235B-A22B-Thinking-2507, lol. 11% for the non-thinking model is pretty high

Makes sense, it's in line with Gemini 2.5 Pro in that case. It aligns with their other results in the post.

They made it very clear that they were reporting that score for the non-thinking model[0]. I still don't have any guesses as to what happened here, maybe something format related. I can't see a motivation to blatantly lie on a benchmark which would very obviously be publicly corrected.

[0] https://x.com/JustinLin610/status/1947836526853034403


They have provided repro for the 41.8% result here: https://github.com/QwenLM/Qwen3/tree/main/eval

Could it be the public eval set vs the private eval the ARC team has? The public eval set is slightly easier and may have had a some unintentional data leakage since it was released before their training data cutoff.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: