The current trend of continual improvement of LLM coding ability to solve previo...

parineum · 2025-08-04T03:30:43 1754278243

Is it your view that the improvements have been accelerating or constant?

In my view, improvements have been becoming both less frequent and less impressive.

jatora · 2025-08-04T04:49:56 1754282996

Accelerating. Below is a list of the SOTA's over time (with some slight wiggle room between similar era models)

gpt4 | 3/2023

gpt4-turbo - 11/2023

opus3 | 3/2024

gpt4o | 5/2024

sonnet3.5 | 6/2024

o1-preview | 9/2024

o1 | 12/2024

o3-minihigh | 1/2025

gemini2pro | 2/2025

o3 | 4/2025

gemini2.5pro | 4/2025

opus4 | 5/2025

??? | 8/2025

This is also not to mention the miniaturization and democratization of intelligence that is the smaller models which has also been impressive.

Id say this shows that improvements are becoming more frequent.

---

Each wave of models was a significant step above what came previously. One needs only to step back a generation to be reminded of the intelligence differential.

Some notable differences have been with o3mh and gemini2.5's ability to spit out 1-3k loc(lines of code) with accurate alterations (most of the time). Though better prompting should be used to not do this in general, the ability is impressive.

Context length with gemini 2.5 pro's intelligence is incredible. To load 20k+ loc of a project and recieve a targeted code change that implements a perfect update is ridiculous.

The amount of dropped imports and improper syntax has dramatically reduced.

I'd say this shows improvements are becoming more impressive.

---

Also note the timespan.

We are only 25 months into the explosion kicked off by GPT-4.

We are only 12 months into the reasoning paradigm.

We have barely scratched the surface of agentic tooling and scaffolding.

There are countless architectural improvements and alternatives in development and research.

Infrastructure buildouts and compute scaling are also chugging along, allowing faster training, faster inference, faster testing, etc. etc.

---

This all paints a picture of an acceleration in time and depth of capability