What does 'top 50%' responses mean here, though? You'd need to have a ground truth of how 'good' each score was to calculate that --and if you had ground truth, no need to use an LLM evaluator to begin with.
If you mean trusting the LLM scores to pick the 50% 'top' responses they grade, this doesn't get around the issue of overly trusting the LLM's scores.
User to LLM: "Rate this response to the following prompt on a scale of 1-10, where 1 is a poor response and 10 is a great response: [response]"
LLM rates responses of all other LLMs
All other LLMs do the same
Then we take the average score of each response. The LLMs that produced the top 50% of responses will respond again until one response with the highest score remains.
In short: Pick top 50% responses, LLMs score each other, repeat until top response remains