Interestingly, when compering benchmarks of Experimental 03-25 [1] and Experimen...

arnaudsm · 2025-05-06T16:03:14 1746547394

This should be the top comment. Cherry-picking is hurting this industry.

I bet they kept training on coding tasks, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.

cma · 2025-05-06T21:02:14 1746565334

They likely knew continued training on code would have some amount of catastrophic forgetting on other stuff. They didn't throw away the old weights so probably not sunk cost fallacy going on, but since it is relatively new and they found out X% of API token spend was on coding agents (where X is huge), compared to what token spend distribution looked like on prior Geminis that couldn't code well, they probably didn't want the complexity and worse batching of having another model for it if the impacts weren't too large and decided they didn't weight coding enough initially and it is worth the tradeoffs.

luckydata · 2025-05-06T16:16:44 1746548204

Or because they realized that coding is what most of those LLMs are used for anyways?

arnaudsm · 2025-05-06T16:29:32 1746548972

They should have shown the benchmarks. Or market it as a coding model, like Qwen & Mistral.

jjani · 2025-05-06T16:32:14 1746549134

That's clearly not a PR angle they could possibly take when it's replacing the overall SotA model. This is a business decision, potentially inference cost related.

arnaudsm · 2025-05-06T17:39:20 1746553160

From a business pov it's a great move, for the customers it's evil to hide evidence that your product became worse.

merksittich · 2025-05-06T16:45:56 1746549956

According to the article, "[t]he previous iteration (03-25) now points to the most recent version (05-06)." I assume this applies to both the free tier gemini-2.5-pro-exp-03-25 in the API (which will be used for training) and the paid tier gemini-2.5-pro-preview-03-25.

Fair enough, one could say, as these were all labeled as preview or experimental. Still, considering that the new model is slightly worse across the board in benchmarks (except for LiveCodeBench), it would have been nice to have the option to stick with the older version. Not everyone is using these models for coding.

zurfer · 2025-05-06T19:00:22 1746558022

Just switching a pinned version (even alpha, beta, experimental, preview) to another model doesn't feel right.

I get it, chips are sparse and they want their capacity back, but it breaks trust with developers to just downgrade your model.

Call it gemini-latest and I understand that things will change. Call it *-03-25 and I want the same model that I got on 25th March.

nopinsight · 2025-05-06T17:14:39 1746551679

Livebench.ai actually suggests the new version is better on most things.

https://livebench.ai/#/

jjani · 2025-05-06T16:31:26 1746549086

Sounds like they were losing so much money on 2.5-Pro they came up with a forced update that made it cheaper to run. They can't come out with "we've made it worse across the board", nor do they want to be the first to actually raise prices, so instead they made a bit of a distill that's slightly better at coding so they can still spin it positively.

sauwan · 2025-05-06T16:51:22 1746550282

I'd be surprised if this was a new base model. It sounds like they just did some post-training RL tuning to make this version specifically stronger for coding, at the expense of other priorities.

jjani · 2025-05-06T17:10:28 1746551428

Every frontier model now is a distill of a larger unpublished model. This could be a slightly smaller distill, with potentially the extra tuning you're mentioning.

cubefox · 2025-05-06T17:42:25 1746553345

That's an unsubstantiated claim. I doubt this is true, since people are disproportionately more willing to pay for the best of the best, rather than for something worse.

vessenes · 2025-05-07T15:23:16 1746631396

“Every” is unsubstantiated but probably accurate. Meta has published theirs (behemoth) and it’s clear this is largely how frontier models are being used and trained right now: too slow and expensive for daily driving inference, distillable at various levels for different tradeoffs.

cubefox · 2025-05-08T12:56:43 1746709003

DeepSeek-V3 is not a distilled model, which already disproves the "every" claim. And if you happen to have a model which is better than any other available model, it makes no sense to not use it just because it is allegedly "too slow and expensive". Inference speed is highly unimportant compared to absolute model performance. If inference speed was so important, everyone would use small models. But most people use huge models, the best of the best, like GPT-4o, o3, Claude Sonnet 3.7, Gemini 2.5 Pro. People don't prefer Gemini 2.5 Flash to Gemini 2.5 Pro. And people don't pay for ChatGPT Plus to get more access to faster models, they pay to get access to better, slower models. People want quality from their LLM, not quantity.

tangjurine · 2025-05-06T18:54:35 1746557675

Any info on this?

Workaccount2 · 2025-05-06T17:26:10 1746552370

Google doesn't pay the nvidia tax. Their TPUs are designed for Gemini and Gemini designed for their TPUs. Google is no doubt paying far less per token than every other AI house.

excerionsforte · 2025-05-06T23:38:16 1746574696

Yes, it does worse but a far margin. Requires more instructions and way too eager to code without proper instructions unlike the 03-25 version. I want that version back.