These tiny “state of the art” performance increases are really indicative the current architecture for LLM(Transformers + Mixture of Experts) is maxed out even if you train it more/differently. The writings are on all over the walls.
It would not surprise me if this is what has delayed OpenAI in releasing a new model. After more than a year since GPT-4, they may have by now produced some mega-trained mega-model, but running it is so expensive, and its eval improvement over GPT-4 so marginal, that releasing it to the public simply makes no commercial sense just yet.
They may be working on how to optimize it to reduce cost, or re-engineer it to improve evals.
These “state of the art” llm barely eking out a win isn’t a threat to OpenAI and they can take their sweet time sharpening sword that will come down hard on these LLMs