Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One approach we've been working on is having multiple LLMs score each other. Here is the design with an example of how that works: https://github.com/HashemAlsaket/prompttools/pull/1

In short: Pick top 50% responses, LLMs score each other, repeat until top response remains



What does 'top 50%' responses mean here, though? You'd need to have a ground truth of how 'good' each score was to calculate that --and if you had ground truth, no need to use an LLM evaluator to begin with.

If you mean trusting the LLM scores to pick the 50% 'top' responses they grade, this doesn't get around the issue of overly trusting the LLM's scores.


For now, the design is basic:

User to LLM: "Rate this response to the following prompt on a scale of 1-10, where 1 is a poor response and 10 is a great response: [response]"

LLM rates responses of all other LLMs

All other LLMs do the same

Then we take the average score of each response. The LLMs that produced the top 50% of responses will respond again until one response with the highest score remains.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: