Hacker News new | past | comments | ask | show | jobs | submit login

Hi HN! We worked at OpenAI and Anthropic and believe we can provide much higher quality code generation by fine-tuning an LLM on your codebase compared to Sonnet-3.5 or o1 but not fine-tuned. Let me know if you are interested and we can fine-tune for you for free to test.





I wish you posted more evaluation details on your page as text. What exactly was your accuracy vs. Sonnet? (Right now, we can only tell that Sonnet's was ≤ 1/4.3.) Why the Discourse repo? Providing more detailed information would help folks trust your claims more.

I agree, we need to post more data. Since we are very early (<1 month) we just shared the initial results. Discourse repo was just a good option since it is a big public repo that could benefit from fine-tuning. We plan to add more benchmarks to the website as we progress.

Any plans on distilling it down to an 8b model to enable it for pure local usage on most consumer hardware?

Could be done in the future. Our current focus is highest accuracy. But there are no limitations on the models - just would depend on user preference of size/performance tradeoff.

I would be interested in a fine tune on OpenZFS:

https://github.com/openzfs/zfs


Thank you for the suggestion, we will take a look!

Interested! Our large Rust code base at https://zed.dev is open-source at https://github.com/zed-industries/zed and I'd be curious to try this out on it.

My email is richard at our website's domain if you'd like to get in touch!


Looks like a great repo to try the fine-tuning! I will email you, thanks!

What is the metric for LLMs? Shouldn't more than just accuracy be measured? If something has high accuracy but low recall, won't it be overfit and fail to generalize? Your metrics would give you false confidence in how effective your model is. Just wondering because the announcement only seems to mention accuracy.

Good point, we should provide more detailed metrics. Since we are very early, we focus on the main metric in our view: higher accuracy of changes to be more practically usable. We will do more testing on overfitting and how the model performance on different types of tasks. On high level we believe in the idea of "a well fine-tuned model should be much better than a large general model". But we need more metrics, I agree.

Some more details that programmers can inspect would be very useful.

I agree, we plan to publish more benchmarks and metrics. We also want to publicly host our fine-tuned model for one of the open-source repos so that people can try themselves agains SOTA models.

It seems an interesting fine-tuning idea. Drawing from reasoning models, I wonder if it’s effective to 10x or 100x the fine-tune dataset by having a larger reasoning model create documentation and reasoning COTs about the code base’s current state and speculation about future state updates. Maybe have it output some verbose execution flow analysis.

Thank you for the idea! We are also considering upsampling and distillation. But on high level, correctly setting up the data for simple fine-tuning can already produce great results.

I am getting quite deep into coding with AI and cost of tokens is a bit of an issue indeed.

Trivial issue because it saves me A LOT of time, but it could be an issue for new people testing it.

I would love to test this approach. Are you guys fine tuning for each codebase?


Yes, we fine-tune for each codebase. Now we are focusing on larger enterprise codebases that would: 1. benefit from the fine-tuning the most. 2. have the budget to pay us for the service. For smaller projects that are price-sensitive we are probably not a good fit at this point.

>>cost of tokens is a bit of an issue indeed

Their cost is $0.7 per 1M token.

DeepSeek is $0.14 / 1M tokens ( cache miss)


DeepSeek is an amazing product but has few issues:

1. Data is used for training

2. Context window is rather small and doesn't fit as well large codebase

I keep saying this over and over in all the content I create, the valu of coding with AI will come from working on big, complex, legacy codebases. Not from flashy demo where you create a to-do app.

For that you need solid models with big context and private inference.


DeepSeek is open source and has a context length of 128k tokens.

Commercial service have a context of 64k tokens, which I find quite limiting.

https://api-docs.deepseek.com/quick_start/pricing

Running it locally is quite a bit beyond the scope of being productive while coding with AI.

Beside that 128k is still significantly less than Claude


Shouldn't we be comparing with other open source model? In particular since this is about llama3.3 then they have the exact context limit which is 128k [1]. Also

[1] https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct


Why?

Whenever using a model to be more effective as a developer I don't particularly care if the model is open source or closed source.

I would love to use open source models as well, but the convenience to just plug an API against some endpoints in unbeatable.


I'm interested. I submitted my email to your landing page form.

Thank you! Will email you within a couple of days:)

Talk is cheap, benchmarks please. Also why did you decide for LLama? AFAIK deepseek always had a slight edge over llama when it comes to coding performance, or is this no longer the case?

Good point, we plan to publish more benchmarks and also publicly host a model for anyone to try. We think Llama is a good option but as we progress we will test other open source models too like deepseek.

Training and running fine tunes of Deepseek could get expensive.

I'm not saying you're an imposter... but you're making it really easy to assume that; it doesn't seem you have learnt much while you guys were there. Are you sure you weren't hired by mistake?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: