We do the same (all requests go to o1, sonnet and gemini and we store the results for later to compare) automatically for our research: Claude always wins. Even with specific prompting on both platforms. Especially frontend it seems o1 really is terrible.
Every time I try Gemini, it's really subpar. I found that qwen2.5-coder-32b-instruct can be better.
Also, for me 50% 50% for Sonnet and o1, but although I'm not 100% sure about it, I think o1 is better with longer and more complicated (C++) code and debugging. At least from my brief testing. Also, OpenAI models seem to be more verbose - sometimes it's better - where I'd like additional explanation on chosen fields in a SQL schema, sometimes it's too much.
EDIT: Just asked both o1 and Sonnet 3.5 the same QML coding question, and Sonnet 3.5 succeeded, o1 failed.
Very anecdotal but I’ve found that for things that are well spec’d out with a good prompt Sonnet 3.5 is far better. For problems where I might have introduced a subtle logical error o1 seems to catch it extremely well. So better reasoning might be occurring but reasoning is only a small part of what we would consider intelligence.
Exactly. The previous version of o1 did actually worse in the coding benchmarks, so I would expect it to be worse in real life scenarios.
The new version released a few days ago on the other hand is better in the benchmarks, so it would seem strange that someone used it and is saying that it’s worse than Claude.
Wins? What does this mean? Do you have any results? I see the claims that Claude is better for coding a lot but using it and using Gemini 2.0 flash and o1 and it sure doesn't seem like it.