How do you measure code generation accuracy? Are there some base tests and if so... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		prmoustache 7 months ago \| parent \| context \| favorite \| on: We fine-tuned Llama and got 4.2x Sonnet 3.5 accura... How do you measure code generation accuracy? Are there some base tests and if so how can I ensure the models aren't tuned for those tests only the same way vw cheated the emissions tests on their diesels?

samatdav 7 months ago [–]

We run a set of change requests on the discourse repo. Good point, we plan to publish more detailed testing benchmarks and metrics on the website.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact