One approach we've been working on is having multiple LLMs score each other. Her...

fatso784 · on Aug 1, 2023

What does 'top 50%' responses mean here, though? You'd need to have a ground truth of how 'good' each score was to calculate that --and if you had ground truth, no need to use an LLM evaluator to begin with.

If you mean trusting the LLM scores to pick the 50% 'top' responses they grade, this doesn't get around the issue of overly trusting the LLM's scores.

hashemalsaket · on Aug 1, 2023

For now, the design is basic:

User to LLM: "Rate this response to the following prompt on a scale of 1-10, where 1 is a poor response and 10 is a great response: [response]"

LLM rates responses of all other LLMs

All other LLMs do the same

Then we take the average score of each response. The LLMs that produced the top 50% of responses will respond again until one response with the highest score remains.