The real problem is that tests used for humans are callibrated based on the way ...

derbOac · 2024-06-03T00:10:19 1717373419

This is a good point.

I've noticed one thing that LLMs seem to have trouble with is going "off task".

There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.

The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.

Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.

In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.

If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.