Hacker News new | past | comments | ask | show | jobs | submit login

> I don't put a lot of stock on evals.

Same, although they are helpful for setting expectations for me. I have some use cases (I'm hesitant to call them evals) related to how we use GPT for our product that are a good "real world" test case. I've found that Claude models are the only ones that are up to par with GPT in the past.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: