For what it's worth, the Qwen team misreported an ARC-AGI score benchmark on the non-thinking model by a factor of 4, which has not been explained yet. They claimed a score of 41.8% on ARC-AGI 1 [0] which is much higher than what non-chain of thought models have been able to achieve (GPT 4.5 got 10%). The ARC team later benchmarked it at 11%[1], which is still a high score, but not the same as 41.8%. It's still probably a significant update on the model though.
They made it very clear that they were reporting that score for the non-thinking model[0]. I still don't have any guesses as to what happened here, maybe something format related. I can't see a motivation to blatantly lie on a benchmark which would very obviously be publicly corrected.
Could it be the public eval set vs the private eval the ARC team has? The public eval set is slightly easier and may have had a some unintentional data leakage since it was released before their training data cutoff.
[0] https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
[1] https://x.com/arcprize/status/1948453132184494471