This was very cool until I realized that a significant fraction of the questions have incorrect answers. Two for two biology related questions were wrong.
Obviously it's an expensive process, but I'd really want to know what percentage of these questions are just wrong. Some of the ones they called out are pretty terrible.
But it can be a good honeypot for either LLMs that cheat or are overfit.
I found the multiple models getting this one correct really interesting:
As a result of an accident, Abdul lost sight in his right eye. To judge the distance of vehicles when he is driving, Abdul is able to rely on cues of
A. I only
B. II only
C. III only
D. I and II only
I don't think this is the models learning the specific dataset, but rather the best performing models having learned how to score well on multiple choice tests, such as when not sure to guess an exclusionary combined answer (I'd wager this strategy ends up correct more often than incorrect when all answers seem equally probable based on available knowledge).
Which in turn is a useful reminder that we'd best be wary of turning measurements into targets, as we may be targeting adaptations that score well on the targeted measurement but aren't more broadly applicable (and might even undermine a better generally performing model that isn't as smart at acing the test format vs the test content).