No, they aren't. Most benchmarks use ground truth, not evaluation by another LLM...

charlieyu1 · 2025-08-08T11:09:21 1754651361

Even the benchmarks for maths only checked numerical answers for ground truth, which means the LLM can output a lot of nonsense and guess the correct answer to pass it

sigmoid10 · 2025-08-08T10:24:48 1754648688

Ground truth evaluation is not that simple unless you are doing multiple-choice-style tests or something similar where the correctness of an answer can be determined by a simple process. Open ended natural language tasks like this one are incredibly difficult to evaluate and using LLMs as judge is not just the current standard, it is basically the only way to do it at scale economically.

qsort · 2025-08-08T11:09:21 1754651361

The original comment was this:

> So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

The comment I replied to was:

> That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

And that's just false. SWE-Bench verified isn't like this. Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like this. The new internal benchmarks used by OpenAI in GPT-5's model card aren't like this.

Maybe this benchmark is a special snowflake and needs LLM-as-a-judge, but this doesn't invalidate the original concern: setting up a benchmark this way runs into a series of problems and is prone to show performance differences that might not be there with a different setups. Benchmarks are already hard to trust, I'm not sure how this is any more indicative than the rest.

sigmoid10 · 2025-08-10T23:47:30 1754869650

Benchmarks that execute code are to some degree the only thing where you can automate testing at scale without humans in the loop, but even that has its caveats [1]. Regardless, when your output is natural language text (as is in this case), there is simply no viable alternative to measure accuracy economically. There is frankly no argument to be had here, because this is simply not achievable with current technology.

[1] https://openai.com/index/introducing-swe-bench-verified/