Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] We fine-tuned Llama and got 4.2x Sonnet 3.5 accuracy for code generation (finecodex.com)
137 points by banddk 7 months ago | hide | past | favorite | 77 comments


> Our team is ex-OpenAI, Anthropic, and Asana research scientists and AI engineers

The page includes the logos of those companies. Is it normal to do that for companies one used to work for?


I've seen plenty of startups who's single-page includes the pitch, and the founders section which has big logos of their alma mater (almost always Ivy leaguers + Stanford) and whatever FAANG or consulting job gig they had previously.

Easier to land investors and customers if you have the "correct" pedigree.


Private pitch decks aren’t the same as public product websites.

It seems this team is using the old scammy marketing trick of using logos of every company you can claim any possible relationship with as a way of building trust. Plastering giant logos on a product website implies some endorsement or affiliation to most casual readers. It’s not until you read all of the text that you realize this is just a list of companies they worked at.

This is the kind of behavior that earns a sternly worded letter from corporate council. You shouldn’t expect to be able to leave a company and then put their logo on your product page.


Good point! We are just very early and our experience is our main selling point. We plan to remove it.


No it's not! These are probably imposters anyway. Apparently, you can buy HN upvotes... I find it hard to believe that honest researchers from frontier labs would behave like crypto scammers.


It’s also weird that there is no about page naming the founders.



It says they WILL fine tune a model.

Sounds fishy


At Asana we did not do any fine-tuning because it was too complicated even for our AI org of 40 engineers. We believe we can do it by setting up and cleaning data correctly.


There's a reason why you don't see frontier-grade AI researchers throwing around meaningless numbers to go with the most layman idea of a product in the field imaginable. The whole thing stinks. I reckon this is some kind of extortion scam intended to trick people into compromising IP.


I understand the concern but we don't need anyone's IP. Unfortunately, it is hard to provide fine-tuning solution without access to the codebase. We just think that using a large general-purpose model for a highly specific codebase with a lot of internal frameworks is not the best solution and want to try to improve it.


Hi HN! We worked at OpenAI and Anthropic and believe we can provide much higher quality code generation by fine-tuning an LLM on your codebase compared to Sonnet-3.5 or o1 but not fine-tuned. Let me know if you are interested and we can fine-tune for you for free to test.


I wish you posted more evaluation details on your page as text. What exactly was your accuracy vs. Sonnet? (Right now, we can only tell that Sonnet's was ≤ 1/4.3.) Why the Discourse repo? Providing more detailed information would help folks trust your claims more.


I agree, we need to post more data. Since we are very early (<1 month) we just shared the initial results. Discourse repo was just a good option since it is a big public repo that could benefit from fine-tuning. We plan to add more benchmarks to the website as we progress.


Any plans on distilling it down to an 8b model to enable it for pure local usage on most consumer hardware?


Could be done in the future. Our current focus is highest accuracy. But there are no limitations on the models - just would depend on user preference of size/performance tradeoff.


I would be interested in a fine tune on OpenZFS:

https://github.com/openzfs/zfs


Thank you for the suggestion, we will take a look!


Interested! Our large Rust code base at https://zed.dev is open-source at https://github.com/zed-industries/zed and I'd be curious to try this out on it.

My email is richard at our website's domain if you'd like to get in touch!


Looks like a great repo to try the fine-tuning! I will email you, thanks!


What is the metric for LLMs? Shouldn't more than just accuracy be measured? If something has high accuracy but low recall, won't it be overfit and fail to generalize? Your metrics would give you false confidence in how effective your model is. Just wondering because the announcement only seems to mention accuracy.


Good point, we should provide more detailed metrics. Since we are very early, we focus on the main metric in our view: higher accuracy of changes to be more practically usable. We will do more testing on overfitting and how the model performance on different types of tasks. On high level we believe in the idea of "a well fine-tuned model should be much better than a large general model". But we need more metrics, I agree.


Some more details that programmers can inspect would be very useful.


I agree, we plan to publish more benchmarks and metrics. We also want to publicly host our fine-tuned model for one of the open-source repos so that people can try themselves agains SOTA models.


It seems an interesting fine-tuning idea. Drawing from reasoning models, I wonder if it’s effective to 10x or 100x the fine-tune dataset by having a larger reasoning model create documentation and reasoning COTs about the code base’s current state and speculation about future state updates. Maybe have it output some verbose execution flow analysis.


Thank you for the idea! We are also considering upsampling and distillation. But on high level, correctly setting up the data for simple fine-tuning can already produce great results.


I am getting quite deep into coding with AI and cost of tokens is a bit of an issue indeed.

Trivial issue because it saves me A LOT of time, but it could be an issue for new people testing it.

I would love to test this approach. Are you guys fine tuning for each codebase?


Yes, we fine-tune for each codebase. Now we are focusing on larger enterprise codebases that would: 1. benefit from the fine-tuning the most. 2. have the budget to pay us for the service. For smaller projects that are price-sensitive we are probably not a good fit at this point.


>>cost of tokens is a bit of an issue indeed

Their cost is $0.7 per 1M token.

DeepSeek is $0.14 / 1M tokens ( cache miss)


DeepSeek is an amazing product but has few issues:

1. Data is used for training

2. Context window is rather small and doesn't fit as well large codebase

I keep saying this over and over in all the content I create, the valu of coding with AI will come from working on big, complex, legacy codebases. Not from flashy demo where you create a to-do app.

For that you need solid models with big context and private inference.


DeepSeek is open source and has a context length of 128k tokens.


Commercial service have a context of 64k tokens, which I find quite limiting.

https://api-docs.deepseek.com/quick_start/pricing

Running it locally is quite a bit beyond the scope of being productive while coding with AI.

Beside that 128k is still significantly less than Claude


Shouldn't we be comparing with other open source model? In particular since this is about llama3.3 then they have the exact context limit which is 128k [1]. Also

[1] https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct


Why?

Whenever using a model to be more effective as a developer I don't particularly care if the model is open source or closed source.

I would love to use open source models as well, but the convenience to just plug an API against some endpoints in unbeatable.


I'm interested. I submitted my email to your landing page form.


Thank you! Will email you within a couple of days:)


Talk is cheap, benchmarks please. Also why did you decide for LLama? AFAIK deepseek always had a slight edge over llama when it comes to coding performance, or is this no longer the case?


Good point, we plan to publish more benchmarks and also publicly host a model for anyone to try. We think Llama is a good option but as we progress we will test other open source models too like deepseek.


Training and running fine tunes of Deepseek could get expensive.


I'm not saying you're an imposter... but you're making it really easy to assume that; it doesn't seem you have learnt much while you guys were there. Are you sure you weren't hired by mistake?


You make a bold marketing claim, 4.2x Sonnet, but viewing your website, I can see no data or test results to back this up.


Thanks for calling this out, even if it just gets OP to comment with some details/data. Was hoping this would be a shallow or deep dive into the results, but looks like it’s just a marketing post to a marketing page to support a PH launch.


Good point, I agree, we haven't shared enough details. Since we are very early, we only got high level results and want to get feedback on what direction would be most applicable and useful. We plan to add more metrics and data to the website in the future and also want to publicly host a fine-tuned model for anyone to try and see.


That's exactly what everybody advised me against doing - finetuning on own projects. Got really discouraged and stopped. So glad someone has done it!


Almost no one knows if a project/business idea will be successful or not, so it's not much use asking. It's more productive to ask smart, experienced people how to best validate and execute an idea. People generally give useful and actionable feedback based on their experiences. Just make sure you understand who you're talking to when evaluating someone's advice.


"understand who you're talking to when evaluating someone's advice." Good you mentioned this, found out to this is a crucial part as well: Always perceive the advice you get depending on that person's background and interests (e.g. your target group, or domain-foreign expert).


> That's exactly what everybody advised me against doing - finetuning on own projects

Why would someone advise against it? IMHO that sounds as the end game to me. If it weren't so darn expensive, I'd try this for myself for sure.


I think that people suggest RAG, also because the models develop so fast that very probably the base model you finetune on will be obsolete in a year or so.

If we are approaching diminishing returns it makes more sense to finetune. As the recent advances seem to happen by throwing more compute to CoT etc maybe the time is close or has already come.


What's CoT?


Chain of Thought. When I see people using abbreviations like this I sometimes jokingly wonder what they do with all this time they're saving.


There are so many chain types it is easier to do the abbreviations. Basically extend a RAG to have a graph to influence how to either critisize itself or perform different actions. It has gotten to the point where there are libraries for define them. https://langchain-ai.github.io/langgraph/tutorials/introduct...


Perhaps they're preemptively reducing several tokens into one, for the machines' benefit.


I post in twitter and invest in crypto


Fine tuning to a specific codebase is a bit strange. It's going to learn some style/tool guidance which is good (but there are other ways of getting), at the risk of unlearning some generalization it learned from looking at 1,000,000x more code samples of varied styles.

In general I'd suggest trying this first:

- Large context: use large context models to load relevant files. It can pickup your style/tool choices fine this way without fine tuning. I'm usually manually inserting files into context, but a great RAG solution would be ideal.

- Project specific instructions (like .cursorrules): tell it specific things you want. I tell it preferred test tools/strategies/styles.

I am curious to see more detailed evals here, but the claims are too high level to really dive into.

In generally: I love fine tuning for more specific/repeatable tasks. I even have my own fine-tuning platform (https://github.com/Kiln-AI/Kiln). However coding is very broad. Good use case for foundation large models with smart use of context.


Other people have spent a lot of time on it and gotten nowhere, so I suspect there is some art to it.


They have? Is there a write up about that?


How do you measure code generation accuracy? Are there some base tests and if so how can I ensure the models aren't tuned for those tests only the same way vw cheated the emissions tests on their diesels?


We run a set of change requests on the discourse repo. Good point, we plan to publish more detailed testing benchmarks and metrics on the website.


Here's a Hugging Face blog post where they walk through how to fine tune a model on your code base: https://huggingface.co/blog/personal-copilot .

Kudos to the founders for shipping. I do think this kind of functionality will become very rapidly commoditized though. But then, I suppose people said the same thing about Dropbox.


4.2x doesn't mean anything if you don't tell me what "accuracy" Sonnet 3.5 had.


I agree. Our local early results were promising were a higher percentage of code change requests produced a functionally correct output. We will post more metrics and data in the future.


Is the source code available for inspection somewhere? It's not really clear from the landing page.


Not yet, but we plan to publicly host a fine-tuned model so anyone can try.


Hi HN! I'm Samat, the co-founder from the video. Thank you for the critical feedback, great points.

0. Is this a scam? No. We're very early (started <1 month ago) so our landing page is to validate our concept, gather initial feedback and start conversation on what we can build that would be most applicable. We'll add more details and benchmarks to the website.

1. Company logos. You're right. We're using our work experience as a credibility signal because at this stage that is our main selling point. We'll replace logos with concrete results as we develop.

2. Team. We're 2 software engineers and 1 AI researcher: - I was an AI product engineer at Asana. https://linkedin.com/in/samatd - Denis was a tech lead at a unicorn startup. https://x.com/karpenoid - Our third co-founder works at Anthropic and was previously at OpenAI. Since he is still at Anthropic and planning to leave soon for the startup, I can share his details privately.

3. Claims and transparency. Our "4.2x Sonnet-3.5 accuracy" is an initial estimate from a locally fine-tuned model. Actual results may vary - a small app might not see big improvements, but we believe larger, private enterprise projects could see significant gains. We plan to publish our fine-tuned model so others can verify the results.

4. Competition from LLM providers. Fine-tuning requires complex data cleanup and setup. Enterprise projects have fragmented data, making automation challenging for big providers like OpenAI.

Appreciate the feedback! If you want to chat more 1-1, happy to discuss at hi@finecodex.com Samat.


Maybe ask this model to create a better landing page?


Thank you, we will!:) This was a quick landing page for us to start the conversation and gather feedback. We are trying to make sure we are not building something that nobody needs.


Was the comparison done with or without code context (as obtained using RAG or letting Sonnet ask for files)?


In the absense of other information, looks like a cherry-picked example to me.


We used a single file for the context. It is a cherry-picked example, you are right. I wanted to demonstrate a simple visual change that our model did correctly unlike Sonnet-3.5. Since we are just getting started, we don't have many features like making changes across multiple files in the code editor so it would be harder to demo. Our premise is that a smaller fine-tuned works better than a large, general-purpose SOTA model. We plan to share more metrics and data in the future.


2023: Our tiny model blah blah blah beats GPT4!

2024: Our tiny model blah blah blah beats Claude!

2025: Our tiny model blah blah blah beats ???


Haha, yes it is a pattern. However, the claim here is that "our tiny model beats best model" is applicable for highly specific tasks.


... DeepSeek?


I like it and it makes sense, but from a business perspective I wonder what keeps the upstream LLM providers (all trying to generate profits) from offering the same fine-tuning service quickly ?

Edit: OK, right it's olama, so I assume you can download your own model. (Assuming it's downloadable?)

I think openAI already offers fine-tuning with custom data for some of their models, but maybe not specific to coding tasks.


Yes, you can download and host the fine-tuned open-source model like Llama. The fine-tuning is easy once you have the data, but gathering and cleaning data is challenging. There are also optimizations like upsampling and distillation that could improve the quality of the resulting model. We had 40 engineers at the Asana AI org and never did the fine-tuning because it is not easy.


Makes a ton of sense!

Is this for completions, patches, or new files?


Hi! Currently we generate a whole diff (like cmd+shift+k in Cursor). But plan to add there rest soon!:)


The page lacks details. The details are only in the YouTube video apparently? Please.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: