Anthropic has attributed Sonnet 3.5's model improvement to better training data.
"Which data specifically? Gerstenhaber wouldn’t disclose, but he implied that Claude 3.5 Sonnet draws much of its strength from these training sets."[0]
My guess, which could be completely wrong, Anthropic spent more resources on interpretability and it's paying off.
I remember when I first started using activation maps when building image classification models and it was like what on earth was I doing before this... just blindly trusting the loss.
How do you discover biases and issues with training data without interpretability?
Is it really that much better? I'm really happy with GPT-4o's coding capabilities and very seldom experience problems with hallucinations or incorrect responses, so I'm intrigued by how much better it can actually be.
Does Anthropic do something like this as well, or is there another reason Claude Sonnet 3.5 is so much better at coding than GPT-4o?