We fine-tuned Llama and got 4.2x Sonnet 3.5 accuracy for code generation

tkgally · 2025-01-02T13:56:23 1735826183

> Our team is ex-OpenAI, Anthropic, and Asana research scientists and AI engineers

The page includes the logos of those companies. Is it normal to do that for companies one used to work for?

TrackerFF · 2025-01-02T15:00:23 1735830023

I've seen plenty of startups who's single-page includes the pitch, and the founders section which has big logos of their alma mater (almost always Ivy leaguers + Stanford) and whatever FAANG or consulting job gig they had previously.

Easier to land investors and customers if you have the "correct" pedigree.

Aurornis · 2025-01-02T15:24:22 1735831462

Private pitch decks aren’t the same as public product websites.

It seems this team is using the old scammy marketing trick of using logos of every company you can claim any possible relationship with as a way of building trust. Plastering giant logos on a product website implies some endorsement or affiliation to most casual readers. It’s not until you read all of the text that you realize this is just a list of companies they worked at.

This is the kind of behavior that earns a sternly worded letter from corporate council. You shouldn’t expect to be able to leave a company and then put their logo on your product page.

samatdav · 2025-01-03T04:23:10 1735878190

Good point! We are just very early and our experience is our main selling point. We plan to remove it.

tucnak · 2025-01-02T14:09:15 1735826955

No it's not! These are probably imposters anyway. Apparently, you can buy HN upvotes... I find it hard to believe that honest researchers from frontier labs would behave like crypto scammers.

uxhacker · 2025-01-02T14:28:02 1735828082

It’s also weird that there is no about page naming the founders.

Lerc · 2025-01-02T15:19:11 1735831151

https://x.com/karpenoid/status/1670723794544263170

https://x.com/karpenoid/status/1873281722613400002

This might be linked.

zwaps · 2025-01-02T16:01:42 1735833702

It says they WILL fine tune a model.

Sounds fishy

samatdav · 2025-01-03T04:25:12 1735878312

At Asana we did not do any fine-tuning because it was too complicated even for our AI org of 40 engineers. We believe we can do it by setting up and cleaning data correctly.

tucnak · 2025-01-02T18:54:23 1735844063

There's a reason why you don't see frontier-grade AI researchers throwing around meaningless numbers to go with the most layman idea of a product in the field imaginable. The whole thing stinks. I reckon this is some kind of extortion scam intended to trick people into compromising IP.

samatdav · 2025-01-03T04:28:24 1735878504

I understand the concern but we don't need anyone's IP. Unfortunately, it is hard to provide fine-tuning solution without access to the codebase. We just think that using a large general-purpose model for a highly specific codebase with a lot of internal frameworks is not the best solution and want to try to improve it.

banddk · 2024-12-29T13:07:04 1735477624

Hi HN! We worked at OpenAI and Anthropic and believe we can provide much higher quality code generation by fine-tuning an LLM on your codebase compared to Sonnet-3.5 or o1 but not fine-tuned. Let me know if you are interested and we can fine-tune for you for free to test.

dr_kiszonka · 2024-12-30T17:13:01 1735578781

I wish you posted more evaluation details on your page as text. What exactly was your accuracy vs. Sonnet? (Right now, we can only tell that Sonnet's was ≤ 1/4.3.) Why the Discourse repo? Providing more detailed information would help folks trust your claims more.

samatdav · 2025-01-03T04:30:47 1735878647

I agree, we need to post more data. Since we are very early (<1 month) we just shared the initial results. Discourse repo was just a good option since it is a big public repo that could benefit from fine-tuning. We plan to add more benchmarks to the website as we progress.

doubtfuluser · 2025-01-02T12:40:29 1735821629

Any plans on distilling it down to an 8b model to enable it for pure local usage on most consumer hardware?

samatdav · 2025-01-03T04:32:53 1735878773

Could be done in the future. Our current focus is highest accuracy. But there are no limitations on the models - just would depend on user preference of size/performance tradeoff.

ryao · 2025-01-02T14:07:08 1735826828

I would be interested in a fine tune on OpenZFS:

https://github.com/openzfs/zfs

samatdav · 2025-01-03T04:35:20 1735878920

Thank you for the suggestion, we will take a look!

rtfeldman · 2025-01-02T17:47:23 1735840043

Interested! Our large Rust code base at https://zed.dev is open-source at https://github.com/zed-industries/zed and I'd be curious to try this out on it.

My email is richard at our website's domain if you'd like to get in touch!

samatdav · 2025-01-03T04:34:10 1735878850

Looks like a great repo to try the fine-tuning! I will email you, thanks!

redman25 · 2025-01-02T14:38:37 1735828717

What is the metric for LLMs? Shouldn't more than just accuracy be measured? If something has high accuracy but low recall, won't it be overfit and fail to generalize? Your metrics would give you false confidence in how effective your model is. Just wondering because the announcement only seems to mention accuracy.

samatdav · 2025-01-03T04:39:49 1735879189

Good point, we should provide more detailed metrics. Since we are very early, we focus on the main metric in our view: higher accuracy of changes to be more practically usable. We will do more testing on overfitting and how the model performance on different types of tasks. On high level we believe in the idea of "a well fine-tuned model should be much better than a large general model". But we need more metrics, I agree.

pdimitar · 2025-01-02T12:06:04 1735819564

Some more details that programmers can inspect would be very useful.

samatdav · 2025-01-03T04:41:57 1735879317

I agree, we plan to publish more benchmarks and metrics. We also want to publicly host our fine-tuned model for one of the open-source repos so that people can try themselves agains SOTA models.

futureshock · 2025-01-02T14:30:27 1735828227

It seems an interesting fine-tuning idea. Drawing from reasoning models, I wonder if it’s effective to 10x or 100x the fine-tune dataset by having a larger reasoning model create documentation and reasoning COTs about the code base’s current state and speculation about future state updates. Maybe have it output some verbose execution flow analysis.

samatdav · 2025-01-03T04:43:38 1735879418

Thank you for the idea! We are also considering upsampling and distillation. But on high level, correctly setting up the data for simple fine-tuning can already produce great results.

siscia · 2025-01-02T12:21:41 1735820501

I am getting quite deep into coding with AI and cost of tokens is a bit of an issue indeed.

Trivial issue because it saves me A LOT of time, but it could be an issue for new people testing it.

I would love to test this approach. Are you guys fine tuning for each codebase?

samatdav · 2025-01-03T04:46:05 1735879565

Yes, we fine-tune for each codebase. Now we are focusing on larger enterprise codebases that would: 1. benefit from the fine-tuning the most. 2. have the budget to pay us for the service. For smaller projects that are price-sensitive we are probably not a good fit at this point.

manishsharan · 2025-01-02T12:57:33 1735822653

>>cost of tokens is a bit of an issue indeed

Their cost is $0.7 per 1M token.

DeepSeek is $0.14 / 1M tokens ( cache miss)

siscia · 2025-01-02T13:15:09 1735823709

DeepSeek is an amazing product but has few issues:

1. Data is used for training

2. Context window is rather small and doesn't fit as well large codebase

I keep saying this over and over in all the content I create, the valu of coding with AI will come from working on big, complex, legacy codebases. Not from flashy demo where you create a to-do app.

For that you need solid models with big context and private inference.

MacsHeadroom · 2025-01-02T13:20:33 1735824033

DeepSeek is open source and has a context length of 128k tokens.

siscia · 2025-01-02T13:30:07 1735824607

Commercial service have a context of 64k tokens, which I find quite limiting.

https://api-docs.deepseek.com/quick_start/pricing

Running it locally is quite a bit beyond the scope of being productive while coding with AI.

Beside that 128k is still significantly less than Claude

elashri · 2025-01-02T13:52:39 1735825959

Shouldn't we be comparing with other open source model? In particular since this is about llama3.3 then they have the exact context limit which is 128k [1]. Also

[1] https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

siscia · 2025-01-02T14:18:37 1735827517

Why?

Whenever using a model to be more effective as a developer I don't particularly care if the model is open source or closed source.

I would love to use open source models as well, but the convenience to just plug an API against some endpoints in unbeatable.

jbellis · 2025-01-02T15:37:40 1735832260

I'm interested. I submitted my email to your landing page form.

samatdav · 2025-01-03T04:46:33 1735879593

Thank you! Will email you within a couple of days:)

WhatsName · 2025-01-02T11:51:29 1735818689

Talk is cheap, benchmarks please. Also why did you decide for LLama? AFAIK deepseek always had a slight edge over llama when it comes to coding performance, or is this no longer the case?

samatdav · 2025-01-03T04:48:19 1735879699

Good point, we plan to publish more benchmarks and also publicly host a model for anyone to try. We think Llama is a good option but as we progress we will test other open source models too like deepseek.

KTibow · 2025-01-02T12:41:37 1735821697

Training and running fine tunes of Deepseek could get expensive.

tucnak · 2025-01-02T14:05:27 1735826727

I'm not saying you're an imposter... but you're making it really easy to assume that; it doesn't seem you have learnt much while you guys were there. Are you sure you weren't hired by mistake?

aussieguy1234 · 2025-01-02T12:28:22 1735820902

You make a bold marketing claim, 4.2x Sonnet, but viewing your website, I can see no data or test results to back this up.

mathgeek · 2025-01-02T12:42:41 1735821761

Thanks for calling this out, even if it just gets OP to comment with some details/data. Was hoping this would be a shallow or deep dive into the results, but looks like it’s just a marketing post to a marketing page to support a PH launch.

samatdav · 2025-01-03T04:58:54 1735880334

Good point, I agree, we haven't shared enough details. Since we are very early, we only got high level results and want to get feedback on what direction would be most applicable and useful. We plan to add more metrics and data to the website in the future and also want to publicly host a fine-tuned model for anyone to try and see.

eurekin · 2025-01-02T11:44:32 1735818272

That's exactly what everybody advised me against doing - finetuning on own projects. Got really discouraged and stopped. So glad someone has done it!

redeux · 2025-01-02T13:36:16 1735824976

Almost no one knows if a project/business idea will be successful or not, so it's not much use asking. It's more productive to ask smart, experienced people how to best validate and execute an idea. People generally give useful and actionable feedback based on their experiences. Just make sure you understand who you're talking to when evaluating someone's advice.

mentalgear · 2025-01-02T13:52:28 1735825948

"understand who you're talking to when evaluating someone's advice." Good you mentioned this, found out to this is a crucial part as well: Always perceive the advice you get depending on that person's background and interests (e.g. your target group, or domain-foreign expert).

menaerus · 2025-01-02T13:29:45 1735824585

> That's exactly what everybody advised me against doing - finetuning on own projects

Why would someone advise against it? IMHO that sounds as the end game to me. If it weren't so darn expensive, I'd try this for myself for sure.

freehorse · 2025-01-02T13:35:54 1735824954

I think that people suggest RAG, also because the models develop so fast that very probably the base model you finetune on will be obsolete in a year or so.

If we are approaching diminishing returns it makes more sense to finetune. As the recent advances seem to happen by throwing more compute to CoT etc maybe the time is close or has already come.

eru · 2025-01-02T13:45:46 1735825546

What's CoT?

vlabakje90 · 2025-01-02T13:49:40 1735825780

Chain of Thought. When I see people using abbreviations like this I sometimes jokingly wonder what they do with all this time they're saving.

zitterbewegung · 2025-01-02T15:08:46 1735830526

There are so many chain types it is easier to do the abbreviations. Basically extend a RAG to have a graph to influence how to either critisize itself or perform different actions. It has gotten to the point where there are libraries for define them. https://langchain-ai.github.io/langgraph/tutorials/introduct...

plagiarist · 2025-01-02T13:57:44 1735826264

Perhaps they're preemptively reducing several tokens into one, for the machines' benefit.

freehorse · 2025-01-02T18:11:04 1735841464

I post in twitter and invest in crypto

scosman · 2025-01-02T16:31:03 1735835463

Fine tuning to a specific codebase is a bit strange. It's going to learn some style/tool guidance which is good (but there are other ways of getting), at the risk of unlearning some generalization it learned from looking at 1,000,000x more code samples of varied styles.

In general I'd suggest trying this first:

- Large context: use large context models to load relevant files. It can pickup your style/tool choices fine this way without fine tuning. I'm usually manually inserting files into context, but a great RAG solution would be ideal.

- Project specific instructions (like .cursorrules): tell it specific things you want. I tell it preferred test tools/strategies/styles.

I am curious to see more detailed evals here, but the claims are too high level to really dive into.

In generally: I love fine tuning for more specific/repeatable tasks. I even have my own fine-tuning platform (https://github.com/Kiln-AI/Kiln). However coding is very broad. Good use case for foundation large models with smart use of context.

QuesnayJr · 2025-01-02T13:36:32 1735824992

Other people have spent a lot of time on it and gotten nowhere, so I suspect there is some art to it.

eurekin · 2025-01-02T22:24:46 1735856686

They have? Is there a write up about that?

prmoustache · 2025-01-02T15:37:59 1735832279

How do you measure code generation accuracy? Are there some base tests and if so how can I ensure the models aren't tuned for those tests only the same way vw cheated the emissions tests on their diesels?

samatdav · 2025-01-03T04:50:55 1735879855

We run a set of change requests on the discourse repo. Good point, we plan to publish more detailed testing benchmarks and metrics on the website.

Nelkins · 2025-01-02T16:07:43 1735834063

Here's a Hugging Face blog post where they walk through how to fine tune a model on your code base: https://huggingface.co/blog/personal-copilot .

Kudos to the founders for shipping. I do think this kind of functionality will become very rapidly commoditized though. But then, I suppose people said the same thing about Dropbox.

Imnimo · 2025-01-02T16:10:57 1735834257

4.2x doesn't mean anything if you don't tell me what "accuracy" Sonnet 3.5 had.

samatdav · 2025-01-03T04:56:27 1735880187

I agree. Our local early results were promising were a higher percentage of code change requests produced a functionally correct output. We will post more metrics and data in the future.

smcleod · 2025-01-02T12:03:57 1735819437

Is the source code available for inspection somewhere? It's not really clear from the landing page.

samatdav · 2025-01-03T04:55:23 1735880123

Not yet, but we plan to publicly host a fine-tuned model so anyone can try.

samatdav · 2025-01-03T04:19:06 1735877946

Hi HN! I'm Samat, the co-founder from the video. Thank you for the critical feedback, great points.

0. Is this a scam? No. We're very early (started <1 month ago) so our landing page is to validate our concept, gather initial feedback and start conversation on what we can build that would be most applicable. We'll add more details and benchmarks to the website.

1. Company logos. You're right. We're using our work experience as a credibility signal because at this stage that is our main selling point. We'll replace logos with concrete results as we develop.

2. Team. We're 2 software engineers and 1 AI researcher: - I was an AI product engineer at Asana. https://linkedin.com/in/samatd - Denis was a tech lead at a unicorn startup. https://x.com/karpenoid - Our third co-founder works at Anthropic and was previously at OpenAI. Since he is still at Anthropic and planning to leave soon for the startup, I can share his details privately.

3. Claims and transparency. Our "4.2x Sonnet-3.5 accuracy" is an initial estimate from a locally fine-tuned model. Actual results may vary - a small app might not see big improvements, but we believe larger, private enterprise projects could see significant gains. We plan to publish our fine-tuned model so others can verify the results.

4. Competition from LLM providers. Fine-tuning requires complex data cleanup and setup. Enterprise projects have fragmented data, making automation challenging for big providers like OpenAI.

Appreciate the feedback! If you want to chat more 1-1, happy to discuss at hi@finecodex.com Samat.

DataDaemon · 2025-01-02T13:06:07 1735823167

Maybe ask this model to create a better landing page?

samatdav · 2025-01-03T05:06:09 1735880769

Thank you, we will!:) This was a quick landing page for us to start the conversation and gather feedback. We are trying to make sure we are not building something that nobody needs.

pcwelder · 2025-01-02T13:01:43 1735822903

Was the comparison done with or without code context (as obtained using RAG or letting Sonnet ask for files)?

jeswin · 2025-01-02T13:05:24 1735823124

In the absense of other information, looks like a cherry-picked example to me.

samatdav · 2025-01-03T05:05:15 1735880715

We used a single file for the context. It is a cherry-picked example, you are right. I wanted to demonstrate a simple visual change that our model did correctly unlike Sonnet-3.5. Since we are just getting started, we don't have many features like making changes across multiple files in the code editor so it would be harder to demo. Our premise is that a smaller fine-tuned works better than a large, general-purpose SOTA model. We plan to share more metrics and data in the future.

mrfinn · 2025-01-02T13:06:51 1735823211

2023: Our tiny model blah blah blah beats GPT4!

2024: Our tiny model blah blah blah beats Claude!

2025: Our tiny model blah blah blah beats ???

samatdav · 2025-01-03T05:10:26 1735881026

Haha, yes it is a pattern. However, the claim here is that "our tiny model beats best model" is applicable for highly specific tasks.

Bluestein · 2025-01-02T13:21:52 1735824112

... DeepSeek?

mentalgear · 2025-01-02T13:55:53 1735826153

I like it and it makes sense, but from a business perspective I wonder what keeps the upstream LLM providers (all trying to generate profits) from offering the same fine-tuning service quickly ?

Edit: OK, right it's olama, so I assume you can download your own model. (Assuming it's downloadable?)

I think openAI already offers fine-tuning with custom data for some of their models, but maybe not specific to coding tasks.

samatdav · 2025-01-03T05:08:32 1735880912

Yes, you can download and host the fine-tuned open-source model like Llama. The fine-tuning is easy once you have the data, but gathering and cleaning data is challenging. There are also optimizations like upsampling and distillation that could improve the quality of the resulting model. We had 40 engineers at the Asana AI org and never did the fine-tuning because it is not easy.

fovc · 2024-12-29T22:07:28 1735510048

Makes a ton of sense!

Is this for completions, patches, or new files?

samatdav · 2024-12-30T01:04:31 1735520671

Hi! Currently we generate a whole diff (like cmd+shift+k in Cursor). But plan to add there rest soon!:)

flakiness · 2025-01-02T14:24:20 1735827860

The page lacks details. The details are only in the YouTube video apparently? Please.