The real problem is that tests used for humans are callibrated based on the way different human abilities correlate: they aren't objectives themselves, they are convenient proxies.
But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.
The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.
(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)
I've noticed one thing that LLMs seem to have trouble with is going "off task".
There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.
The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.
Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.
In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.
If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.
But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.
The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.
(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)