The current trend of continual improvement of LLM coding ability to solve previously unseen problems, handle larger codebases, operate for longer periods of time, and improved agentic scaffolding.
The reward-verifier compatability of programming and RL.
Do you have a stronger precedent for that not being the case?
Accelerating. Below is a list of the SOTA's over time (with some slight wiggle room between similar era models)
gpt4 | 3/2023
gpt4-turbo - 11/2023
opus3 | 3/2024
gpt4o | 5/2024
sonnet3.5 | 6/2024
o1-preview | 9/2024
o1 | 12/2024
o3-minihigh | 1/2025
gemini2pro | 2/2025
o3 | 4/2025
gemini2.5pro | 4/2025
opus4 | 5/2025
??? | 8/2025
This is also not to mention the miniaturization and democratization of intelligence that is the smaller models which has also been impressive.
Id say this shows that improvements are becoming more frequent.
---
Each wave of models was a significant step above what came previously. One needs only to step back a generation to be reminded of the intelligence differential.
Some notable differences have been with o3mh and gemini2.5's ability to spit out 1-3k loc(lines of code) with accurate alterations (most of the time).
Though better prompting should be used to not do this in general, the ability is impressive.
Context length with gemini 2.5 pro's intelligence is incredible. To load 20k+ loc of a project and recieve a targeted code change that implements a perfect update is ridiculous.
The amount of dropped imports and improper syntax has dramatically reduced.
I'd say this shows improvements are becoming more impressive.
---
Also note the timespan.
We are only 25 months into the explosion kicked off by GPT-4.
We are only 12 months into the reasoning paradigm.
We have barely scratched the surface of agentic tooling and scaffolding.
There are countless architectural improvements and alternatives in development and research.
Infrastructure buildouts and compute scaling are also chugging along, allowing faster training, faster inference, faster testing, etc. etc.
---
This all paints a picture of an acceleration in time and depth of capability
The reward-verifier compatability of programming and RL.
Do you have a stronger precedent for that not being the case?