Interesting tidbit at the very end that's worth noting for anyone using the API today:
> By switching to the new gpt-4o-2024-08-06, developers save 50% on inputs ($2.50/1M input tokens) and 33% on outputs ($10.00/1M output tokens) compared to gpt-4o-2024-05-13.
I don't think it's been well enough acknowledged that all of the shortcuts LLMs have been taking with ways of attempting to compress/refine/index the attention mechanism seem to result in dumber models.
GPT 4 Turbo was more like GPT 3.9, and GPT 4o is more like GPT 3.7.
I'm building a product that requires complex LLM flows and out of OpenAI's "cheap" tier models, the old versions of Turbo-3.5 are far better than the last versions of it and 4o-mini. I have a number of tasks that the former consistently succeed at and the latter consistently fail at regardless of prompting.
Leaderboards and benchmarks are very misleading as OpenAI is optimizing for them, like in the past when certain CPU manufacturers would optimize for synthetic benchmarks.
Fwif these aren't chat usecases, for which the newer models may well be better.
They try to gaslight us and tell us this isn't true because of benchmarks, as though anyone has done anything but do the latent space exploration equivalent of throwing darts at the ocean from space.
It's taken years to get even preliminary reliable decision boundary examples from LLMs because doing so is expensive.
Thats ok from the perspective of it making room for a more capable and expensive GPT5 model to compete with Opus 3.5 when that arrives this year. The significant price drops for a small loss in quality is a reasonable tradeoff. Then GTP4o becomes the mid tier and GTP4o-mini the low tier.
There was 100 days in between Claude 3.0 Opus and Claude 3.5 Sonnet being released which gave us similar capability at a 80% price reduction. When I was using Opus I was thinking this is nice, but the cost does add up. Having Sonnet 3.5 so soon after was a nice surprise.
One more round of 80% price cuts after that combined with building out the multi-step agentic workflows should provide some decent capabilities!
Am I the only one that wants to know 1,000% *WHY* such things?
Is it a natural function of how models evolve?
Is it engineered as such? Why? Marketing/money/resources/what?
WHO makes these decisions and why?
---
I have been building a thing with Claude 3.5 pro account and its *utter fn garbage* of an experience.
It lies, hallucinates, malevolently changes code that was already told was correct, removes features - explicitly ignore project files. Has no search, no line items, so much screen real-estate is consumed with useless empty space. It ignores states style guides. get CAUGHT forgetting about a premise we were actively working on them condescendingly apologies "oh you're correct - I should have been using XYZ knowledge"
It makes things FN harder to learn.
If I had any claude engineers sitting in the room watching what a POS service it is from a project continuity point...
Its evil. It actively f's up things.
One should have the ability to CHARGE the model token credit when it Fs up so bad.
NO FN SEARCH??? And when asked for line nums in it output - its in txt...
Seriously, I practically want not just a refund, I want claude to pay me for my time correcting its mistakes.
ChatGPT does the same thing. It forgets things committed to memory - refactors successful things back out of files. ETc....
Its been a really eye opening and frustrating experience and my squinty looks are aiming that its specifically intentional:
They dont want people using a $20/month AI plan to actually be able to do any meaningful work and build a product.
It is difficult to get the AI models to get everything right every time. I noticed too that it would sometimes remove comments etc when re-writing code.
The way to get better results is with agentic workflows that breakdown the task into smaller steps that the models can iteratively come to a correct result.
One important step I added to mine is a review step (in the reviewChanges.ts
file) in my workflow at
https://github.com/TrafficGuard/nous/blob/main/src/swe/codeE...
This gets the diff and asks questions like:
- Are there any redundant changes in the diff?
- Was any code removed in the changes which should not have been?
- Review the style of the code changes in the diff carefully against the original code.
Maybe try using that, or the package that I use which does the actual code edits called Aider https://aider.chat/
Also, is it a coincidence that at cheaper (potentially faster?) model has been released (just) before they roll out the "new" voice mode (which boasts very low latency)?
For the record, you should never use that in an application. Always explicitly note the full versioned model name. This will prevent bad surprises because not every new version is an improvement; sometimes they get worse, especially at specific tasks.
> By switching to the new gpt-4o-2024-08-06, developers save 50% on inputs ($2.50/1M input tokens) and 33% on outputs ($10.00/1M output tokens) compared to gpt-4o-2024-05-13.