None of the evals are binary choice. MMLU questions have four options, so two co...

None of the evals are binary choice.

MMLU questions have four options, so two coin flips would have a 25% baseline. HumanEval evaluates code with a test, so a 100 byte program implemented with coin flips would have a O(2^-800) baseline (maybe not that bad since there are infinitely many programs that produce the same output). GSM-8K has numerical answers, so an average 3 digit answer implemented with coin flips would have a O(2^-9) chance of being correct randomly.

Moreover, using the same axis and scale across unrelated evals makes no sense. 0-100 is the only scale that's meaningful because 0 and 100 being the min/max is the only shared property across all evals. The reason for choosing 30 is that it's the minimum across all (model, eval) pairs, which is a completely arbitrary choice. A good rule of thumb to test this is to ask if the graph would still be relevant 5 years later.