Hi HN! We worked at OpenAI and Anthropic and believe we can provide much higher quality code generation by fine-tuning an LLM on your codebase compared to Sonnet-3.5 or o1 but not fine-tuned. Let me know if you are interested and we can fine-tune for you for free to test.
I wish you posted more evaluation details on your page as text. What exactly was your accuracy vs. Sonnet? (Right now, we can only tell that Sonnet's was ≤ 1/4.3.) Why the Discourse repo? Providing more detailed information would help folks trust your claims more.
I agree, we need to post more data. Since we are very early (<1 month) we just shared the initial results. Discourse repo was just a good option since it is a big public repo that could benefit from fine-tuning. We plan to add more benchmarks to the website as we progress.
Could be done in the future. Our current focus is highest accuracy. But there are no limitations on the models - just would depend on user preference of size/performance tradeoff.
What is the metric for LLMs? Shouldn't more than just accuracy be measured? If something has high accuracy but low recall, won't it be overfit and fail to generalize? Your metrics would give you false confidence in how effective your model is. Just wondering because the announcement only seems to mention accuracy.
Good point, we should provide more detailed metrics. Since we are very early, we focus on the main metric in our view: higher accuracy of changes to be more practically usable. We will do more testing on overfitting and how the model performance on different types of tasks. On high level we believe in the idea of "a well fine-tuned model should be much better than a large general model". But we need more metrics, I agree.
I agree, we plan to publish more benchmarks and metrics. We also want to publicly host our fine-tuned model for one of the open-source repos so that people can try themselves agains SOTA models.
It seems an interesting fine-tuning idea. Drawing from reasoning models, I wonder if it’s effective to 10x or 100x the fine-tune dataset by having a larger reasoning model create documentation and reasoning COTs about the code base’s current state and speculation about future state updates. Maybe have it output some verbose execution flow analysis.
Thank you for the idea! We are also considering upsampling and distillation. But on high level, correctly setting up the data for simple fine-tuning can already produce great results.
Yes, we fine-tune for each codebase. Now we are focusing on larger enterprise codebases that would: 1. benefit from the fine-tuning the most. 2. have the budget to pay us for the service.
For smaller projects that are price-sensitive we are probably not a good fit at this point.
DeepSeek is an amazing product but has few issues:
1. Data is used for training
2. Context window is rather small and doesn't fit as well large codebase
I keep saying this over and over in all the content I create, the valu of coding with AI will come from working on big, complex, legacy codebases. Not from flashy demo where you create a to-do app.
For that you need solid models with big context and private inference.
Shouldn't we be comparing with other open source model? In particular since this is about llama3.3 then they have the exact context limit which is 128k [1]. Also
Talk is cheap, benchmarks please. Also why did you decide for LLama? AFAIK deepseek always had a slight edge over llama when it comes to coding performance, or is this no longer the case?
Good point, we plan to publish more benchmarks and also publicly host a model for anyone to try. We think Llama is a good option but as we progress we will test other open source models too like deepseek.
I'm not saying you're an imposter... but you're making it really easy to assume that; it doesn't seem you have learnt much while you guys were there. Are you sure you weren't hired by mistake?