MMLU questions have four options, so two coin flips would have a 25% baseline. HumanEval evaluates code with a test, so a 100 byte program implemented with coin flips would have a O(2^-800) baseline (maybe not that bad since there are infinitely many programs that produce the same output). GSM-8K has numerical answers, so an average 3 digit answer implemented with coin flips would have a O(2^-9) chance of being correct randomly.
Moreover, using the same axis and scale across unrelated evals makes no sense. 0-100 is the only scale that's meaningful because 0 and 100 being the min/max is the only shared property across all evals. The reason for choosing 30 is that it's the minimum across all (model, eval) pairs, which is a completely arbitrary choice. A good rule of thumb to test this is to ask if the graph would still be relevant 5 years later.
MMLU questions have four options, so two coin flips would have a 25% baseline. HumanEval evaluates code with a test, so a 100 byte program implemented with coin flips would have a O(2^-800) baseline (maybe not that bad since there are infinitely many programs that produce the same output). GSM-8K has numerical answers, so an average 3 digit answer implemented with coin flips would have a O(2^-9) chance of being correct randomly.
Moreover, using the same axis and scale across unrelated evals makes no sense. 0-100 is the only scale that's meaningful because 0 and 100 being the min/max is the only shared property across all evals. The reason for choosing 30 is that it's the minimum across all (model, eval) pairs, which is a completely arbitrary choice. A good rule of thumb to test this is to ask if the graph would still be relevant 5 years later.