This is an incredibly vague essay. Let me be more explicit: I think this is a clear sign of a bubble. LLMs are very cool technology, but they are not the second coming. It can't do experiments; it doesn't have an imagination; it doesn't have an ethical framework; its not an agent in any human sense.
LLMs are awesome but I haven't felt significant improvement since the original GP4 (only in speed).
The reasoning models (o1 pro) don't have good reasoning capability when I'm asking things from them, so I don't expect o3 to be significantly better in practice even if they look good on the benchmarks.
Still, I think ARC-AGI benchmark is awesome, and the fact that they are targeting resoning is a good direction (I just think they need to research more techniques / theories).
Sonnet 3.6 (the 2022-10-22 release of Sonnet 3.5) is head and shoulders above GPT-4 and anyone who has been using both regularly can attest to this fact.
Reasoning models do reason quite well but you need the right problems to ask them. Don't throw open-ended problems at them. They perform well on problems with one (or many) correct solution(s). Code is a great example - o1 has fixed tricky code bugs for me where Sonnet and other GPT-4 class models have failed.
LLMs are leaky abstractions still - as the user, you need to know when and how to use them. This, I think, will get fixed in the 1-2 years. For now, there's no substitute for hands on time using these weird tools. But the effort is well worth it.
I’d argue that most coding problems have one truly correct solution and many many many half correct solutions.
I personally have not found AI coding assistance very helpful, but from blog posts by people who do much of the code I see from Claude is very barebones html templates and small scripts which call out to existing npm packages. Not really reasoning or problem solving per se.
I’m honestly curious to hear what tricky code bugs sonnet has helped you solve.
It’s led me down several incorrect paths, one of which actually burned me at work.
> LLMs are awesome but I haven't felt significant improvement since the original GP4 (only in speed).
Taking the outside view here - maybe you don't "feel" like it's getting better. But benchmarks aside, there are now plenty of anecdotal stories of scientists and mathematicians using them for actual work. Sometimes for simple labor-saving, but some stories of actually creative work that is partially/wholly based on interactions with LLMs. This is on top of many, many people using this for things like software development, and claiming that they get significant benefits out of these models.
>LLMs are awesome but I haven't felt significant improvement since the original GP4 (only in speed).
Absolutely disagree. Are you using LLMs for coding? There has been a 10x (or whatever) improvement since GPT4.
I causally tracked the ability of LLMs to create a processore design in a HDL since 2023. I stopped in June of 2024, because Sonnet would basically oneshot the CPU, testbench and emulator. There are another substantional update of Sonnet in October 2024.