Interestingly, when compering benchmarks of Experimental 03-25 [1] and Experimental 05-06 [2] it seems the new version scores slightly lower in everything except on LiveCodeBench.
They likely knew continued training on code would have some amount of catastrophic forgetting on other stuff. They didn't throw away the old weights so probably not sunk cost fallacy going on, but since it is relatively new and they found out X% of API token spend was on coding agents (where X is huge), compared to what token spend distribution looked like on prior Geminis that couldn't code well, they probably didn't want the complexity and worse batching of having another model for it if the impacts weren't too large and decided they didn't weight coding enough initially and it is worth the tradeoffs.
That's clearly not a PR angle they could possibly take when it's replacing the overall SotA model. This is a business decision, potentially inference cost related.
According to the article, "[t]he previous iteration (03-25) now points to the most recent version (05-06)." I assume this applies to both the free tier gemini-2.5-pro-exp-03-25 in the API (which will be used for training) and the paid tier gemini-2.5-pro-preview-03-25.
Fair enough, one could say, as these were all labeled as preview or experimental. Still, considering that the new model is slightly worse across the board in benchmarks (except for LiveCodeBench), it would have been nice to have the option to stick with the older version. Not everyone is using these models for coding.
Sounds like they were losing so much money on 2.5-Pro they came up with a forced update that made it cheaper to run. They can't come out with "we've made it worse across the board", nor do they want to be the first to actually raise prices, so instead they made a bit of a distill that's slightly better at coding so they can still spin it positively.
I'd be surprised if this was a new base model. It sounds like they just did some post-training RL tuning to make this version specifically stronger for coding, at the expense of other priorities.
Every frontier model now is a distill of a larger unpublished model. This could be a slightly smaller distill, with potentially the extra tuning you're mentioning.
That's an unsubstantiated claim. I doubt this is true, since people are disproportionately more willing to pay for the best of the best, rather than for something worse.
“Every” is unsubstantiated but probably accurate. Meta has published theirs (behemoth) and it’s clear this is largely how frontier models are being used and trained right now: too slow and expensive for daily driving inference, distillable at various levels for different tradeoffs.
DeepSeek-V3 is not a distilled model, which already disproves the "every" claim. And if you happen to have a model which is better than any other available model, it makes no sense to not use it just because it is allegedly "too slow and expensive". Inference speed is highly unimportant compared to absolute model performance. If inference speed was so important, everyone would use small models. But most people use huge models, the best of the best, like GPT-4o, o3, Claude Sonnet 3.7, Gemini 2.5 Pro. People don't prefer Gemini 2.5 Flash to Gemini 2.5 Pro. And people don't pay for ChatGPT Plus to get more access to faster models, they pay to get access to better, slower models. People want quality from their LLM, not quantity.
Google doesn't pay the nvidia tax. Their TPUs are designed for Gemini and Gemini designed for their TPUs. Google is no doubt paying far less per token than every other AI house.
Yes, it does worse but a far margin. Requires more instructions and way too eager to code without proper instructions unlike the 03-25 version. I want that version back.
[1] https://storage.googleapis.com/model-cards/documents/gemini-... [2] https://deepmind.google/technologies/gemini/