Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They made it very clear that they were reporting that score for the non-thinking model[0]. I still don't have any guesses as to what happened here, maybe something format related. I can't see a motivation to blatantly lie on a benchmark which would very obviously be publicly corrected.

[0] https://x.com/JustinLin610/status/1947836526853034403





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: