Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We’re definitely going to need better benchmarks for agentic tasks, and not just code reasoning. Things that are needlessly painful that humans go through all the time


it's insane on lmarena for a size, livebench should have it soon too I guess


The size isn't stated, not necessarily a given that it's as small as 1.5-Flash.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: