Hi HN! We worked at OpenAI and Anthropic and believe we can provide much higher ...

dr_kiszonka · 2024-12-30T17:13:01 1735578781

I wish you posted more evaluation details on your page as text. What exactly was your accuracy vs. Sonnet? (Right now, we can only tell that Sonnet's was ≤ 1/4.3.) Why the Discourse repo? Providing more detailed information would help folks trust your claims more.

samatdav · 2025-01-03T04:30:47 1735878647

I agree, we need to post more data. Since we are very early (<1 month) we just shared the initial results. Discourse repo was just a good option since it is a big public repo that could benefit from fine-tuning. We plan to add more benchmarks to the website as we progress.

doubtfuluser · 2025-01-02T12:40:29 1735821629

Any plans on distilling it down to an 8b model to enable it for pure local usage on most consumer hardware?

samatdav · 2025-01-03T04:32:53 1735878773

Could be done in the future. Our current focus is highest accuracy. But there are no limitations on the models - just would depend on user preference of size/performance tradeoff.

ryao · 2025-01-02T14:07:08 1735826828

I would be interested in a fine tune on OpenZFS:

https://github.com/openzfs/zfs

samatdav · 2025-01-03T04:35:20 1735878920

Thank you for the suggestion, we will take a look!

rtfeldman · 2025-01-02T17:47:23 1735840043

Interested! Our large Rust code base at https://zed.dev is open-source at https://github.com/zed-industries/zed and I'd be curious to try this out on it.

My email is richard at our website's domain if you'd like to get in touch!

samatdav · 2025-01-03T04:34:10 1735878850

Looks like a great repo to try the fine-tuning! I will email you, thanks!

redman25 · 2025-01-02T14:38:37 1735828717

What is the metric for LLMs? Shouldn't more than just accuracy be measured? If something has high accuracy but low recall, won't it be overfit and fail to generalize? Your metrics would give you false confidence in how effective your model is. Just wondering because the announcement only seems to mention accuracy.

samatdav · 2025-01-03T04:39:49 1735879189

Good point, we should provide more detailed metrics. Since we are very early, we focus on the main metric in our view: higher accuracy of changes to be more practically usable. We will do more testing on overfitting and how the model performance on different types of tasks. On high level we believe in the idea of "a well fine-tuned model should be much better than a large general model". But we need more metrics, I agree.

pdimitar · 2025-01-02T12:06:04 1735819564

Some more details that programmers can inspect would be very useful.

samatdav · 2025-01-03T04:41:57 1735879317

I agree, we plan to publish more benchmarks and metrics. We also want to publicly host our fine-tuned model for one of the open-source repos so that people can try themselves agains SOTA models.

futureshock · 2025-01-02T14:30:27 1735828227

It seems an interesting fine-tuning idea. Drawing from reasoning models, I wonder if it’s effective to 10x or 100x the fine-tune dataset by having a larger reasoning model create documentation and reasoning COTs about the code base’s current state and speculation about future state updates. Maybe have it output some verbose execution flow analysis.

samatdav · 2025-01-03T04:43:38 1735879418

Thank you for the idea! We are also considering upsampling and distillation. But on high level, correctly setting up the data for simple fine-tuning can already produce great results.

siscia · 2025-01-02T12:21:41 1735820501

I am getting quite deep into coding with AI and cost of tokens is a bit of an issue indeed.

Trivial issue because it saves me A LOT of time, but it could be an issue for new people testing it.

I would love to test this approach. Are you guys fine tuning for each codebase?

samatdav · 2025-01-03T04:46:05 1735879565

Yes, we fine-tune for each codebase. Now we are focusing on larger enterprise codebases that would: 1. benefit from the fine-tuning the most. 2. have the budget to pay us for the service. For smaller projects that are price-sensitive we are probably not a good fit at this point.

manishsharan · 2025-01-02T12:57:33 1735822653

>>cost of tokens is a bit of an issue indeed

Their cost is $0.7 per 1M token.

DeepSeek is $0.14 / 1M tokens ( cache miss)

siscia · 2025-01-02T13:15:09 1735823709

DeepSeek is an amazing product but has few issues:

1. Data is used for training

2. Context window is rather small and doesn't fit as well large codebase

I keep saying this over and over in all the content I create, the valu of coding with AI will come from working on big, complex, legacy codebases. Not from flashy demo where you create a to-do app.

For that you need solid models with big context and private inference.

MacsHeadroom · 2025-01-02T13:20:33 1735824033

DeepSeek is open source and has a context length of 128k tokens.

siscia · 2025-01-02T13:30:07 1735824607

Commercial service have a context of 64k tokens, which I find quite limiting.

https://api-docs.deepseek.com/quick_start/pricing

Running it locally is quite a bit beyond the scope of being productive while coding with AI.

Beside that 128k is still significantly less than Claude

elashri · 2025-01-02T13:52:39 1735825959

Shouldn't we be comparing with other open source model? In particular since this is about llama3.3 then they have the exact context limit which is 128k [1]. Also

[1] https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

siscia · 2025-01-02T14:18:37 1735827517

Why?

Whenever using a model to be more effective as a developer I don't particularly care if the model is open source or closed source.

I would love to use open source models as well, but the convenience to just plug an API against some endpoints in unbeatable.

jbellis · 2025-01-02T15:37:40 1735832260

I'm interested. I submitted my email to your landing page form.

samatdav · 2025-01-03T04:46:33 1735879593

Thank you! Will email you within a couple of days:)

WhatsName · 2025-01-02T11:51:29 1735818689

Talk is cheap, benchmarks please. Also why did you decide for LLama? AFAIK deepseek always had a slight edge over llama when it comes to coding performance, or is this no longer the case?

samatdav · 2025-01-03T04:48:19 1735879699

Good point, we plan to publish more benchmarks and also publicly host a model for anyone to try. We think Llama is a good option but as we progress we will test other open source models too like deepseek.

KTibow · 2025-01-02T12:41:37 1735821697

Training and running fine tunes of Deepseek could get expensive.

tucnak · 2025-01-02T14:05:27 1735826727

I'm not saying you're an imposter... but you're making it really easy to assume that; it doesn't seem you have learnt much while you guys were there. Are you sure you weren't hired by mistake?