I've seen plenty of startups who's single-page includes the pitch, and the founders section which has big logos of their alma mater (almost always Ivy leaguers + Stanford) and whatever FAANG or consulting job gig they had previously.
Easier to land investors and customers if you have the "correct" pedigree.
Private pitch decks aren’t the same as public product websites.
It seems this team is using the old scammy marketing trick of using logos of every company you can claim any possible relationship with as a way of building trust. Plastering giant logos on a product website implies some endorsement or affiliation to most casual readers. It’s not until you read all of the text that you realize this is just a list of companies they worked at.
This is the kind of behavior that earns a sternly worded letter from corporate council. You shouldn’t expect to be able to leave a company and then put their logo on your product page.
No it's not! These are probably imposters anyway. Apparently, you can buy HN upvotes... I find it hard to believe that honest researchers from frontier labs would behave like crypto scammers.
At Asana we did not do any fine-tuning because it was too complicated even for our AI org of 40 engineers. We believe we can do it by setting up and cleaning data correctly.
There's a reason why you don't see frontier-grade AI researchers throwing around meaningless numbers to go with the most layman idea of a product in the field imaginable. The whole thing stinks. I reckon this is some kind of extortion scam intended to trick people into compromising IP.
I understand the concern but we don't need anyone's IP. Unfortunately, it is hard to provide fine-tuning solution without access to the codebase. We just think that using a large general-purpose model for a highly specific codebase with a lot of internal frameworks is not the best solution and want to try to improve it.
Hi HN! We worked at OpenAI and Anthropic and believe we can provide much higher quality code generation by fine-tuning an LLM on your codebase compared to Sonnet-3.5 or o1 but not fine-tuned. Let me know if you are interested and we can fine-tune for you for free to test.
I wish you posted more evaluation details on your page as text. What exactly was your accuracy vs. Sonnet? (Right now, we can only tell that Sonnet's was ≤ 1/4.3.) Why the Discourse repo? Providing more detailed information would help folks trust your claims more.
I agree, we need to post more data. Since we are very early (<1 month) we just shared the initial results. Discourse repo was just a good option since it is a big public repo that could benefit from fine-tuning. We plan to add more benchmarks to the website as we progress.
Could be done in the future. Our current focus is highest accuracy. But there are no limitations on the models - just would depend on user preference of size/performance tradeoff.
What is the metric for LLMs? Shouldn't more than just accuracy be measured? If something has high accuracy but low recall, won't it be overfit and fail to generalize? Your metrics would give you false confidence in how effective your model is. Just wondering because the announcement only seems to mention accuracy.
Good point, we should provide more detailed metrics. Since we are very early, we focus on the main metric in our view: higher accuracy of changes to be more practically usable. We will do more testing on overfitting and how the model performance on different types of tasks. On high level we believe in the idea of "a well fine-tuned model should be much better than a large general model". But we need more metrics, I agree.
I agree, we plan to publish more benchmarks and metrics. We also want to publicly host our fine-tuned model for one of the open-source repos so that people can try themselves agains SOTA models.
It seems an interesting fine-tuning idea. Drawing from reasoning models, I wonder if it’s effective to 10x or 100x the fine-tune dataset by having a larger reasoning model create documentation and reasoning COTs about the code base’s current state and speculation about future state updates. Maybe have it output some verbose execution flow analysis.
Thank you for the idea! We are also considering upsampling and distillation. But on high level, correctly setting up the data for simple fine-tuning can already produce great results.
Yes, we fine-tune for each codebase. Now we are focusing on larger enterprise codebases that would: 1. benefit from the fine-tuning the most. 2. have the budget to pay us for the service.
For smaller projects that are price-sensitive we are probably not a good fit at this point.
DeepSeek is an amazing product but has few issues:
1. Data is used for training
2. Context window is rather small and doesn't fit as well large codebase
I keep saying this over and over in all the content I create, the valu of coding with AI will come from working on big, complex, legacy codebases. Not from flashy demo where you create a to-do app.
For that you need solid models with big context and private inference.
Shouldn't we be comparing with other open source model? In particular since this is about llama3.3 then they have the exact context limit which is 128k [1]. Also
Talk is cheap, benchmarks please. Also why did you decide for LLama? AFAIK deepseek always had a slight edge over llama when it comes to coding performance, or is this no longer the case?
Good point, we plan to publish more benchmarks and also publicly host a model for anyone to try. We think Llama is a good option but as we progress we will test other open source models too like deepseek.
I'm not saying you're an imposter... but you're making it really easy to assume that; it doesn't seem you have learnt much while you guys were there. Are you sure you weren't hired by mistake?
Thanks for calling this out, even if it just gets OP to comment with some details/data. Was hoping this would be a shallow or deep dive into the results, but looks like it’s just a marketing post to a marketing page to support a PH launch.
Good point, I agree, we haven't shared enough details. Since we are very early, we only got high level results and want to get feedback on what direction would be most applicable and useful. We plan to add more metrics and data to the website in the future and also want to publicly host a fine-tuned model for anyone to try and see.
Almost no one knows if a project/business idea will be successful or not, so it's not much use asking. It's more productive to ask smart, experienced people how to best validate and execute an idea. People generally give useful and actionable feedback based on their experiences. Just make sure you understand who you're talking to when evaluating someone's advice.
"understand who you're talking to when evaluating someone's advice."
Good you mentioned this, found out to this is a crucial part as well: Always perceive the advice you get depending on that person's background and interests (e.g. your target group, or domain-foreign expert).
I think that people suggest RAG, also because the models develop so fast that very probably the base model you finetune on will be obsolete in a year or so.
If we are approaching diminishing returns it makes more sense to finetune. As the recent advances seem to happen by throwing more compute to CoT etc maybe the time is close or has already come.
There are so many chain types it is easier to do the abbreviations. Basically extend a RAG to have a graph to influence how to either critisize itself or perform different actions. It has gotten to the point where there are libraries for define them. https://langchain-ai.github.io/langgraph/tutorials/introduct...
Fine tuning to a specific codebase is a bit strange. It's going to learn some style/tool guidance which is good (but there are other ways of getting), at the risk of unlearning some generalization it learned from looking at 1,000,000x more code samples of varied styles.
In general I'd suggest trying this first:
- Large context: use large context models to load relevant files. It can pickup your style/tool choices fine this way without fine tuning. I'm usually manually inserting files into context, but a great RAG solution would be ideal.
- Project specific instructions (like .cursorrules): tell it specific things you want. I tell it preferred test tools/strategies/styles.
I am curious to see more detailed evals here, but the claims are too high level to really dive into.
In generally: I love fine tuning for more specific/repeatable tasks. I even have my own fine-tuning platform (https://github.com/Kiln-AI/Kiln). However coding is very broad. Good use case for foundation large models with smart use of context.
How do you measure code generation accuracy? Are there some base tests and if so how can I ensure the models aren't tuned for those tests only the same way vw cheated the emissions tests on their diesels?
Kudos to the founders for shipping. I do think this kind of functionality will become very rapidly commoditized though. But then, I suppose people said the same thing about Dropbox.
I agree. Our local early results were promising were a higher percentage of code change requests produced a functionally correct output. We will post more metrics and data in the future.
Hi HN! I'm Samat, the co-founder from the video. Thank you for the critical feedback, great points.
0. Is this a scam?
No. We're very early (started <1 month ago) so our landing page is to validate our concept, gather initial feedback and start conversation on what we can build that would be most applicable. We'll add more details and benchmarks to the website.
1. Company logos.
You're right. We're using our work experience as a credibility signal because at this stage that is our main selling point. We'll replace logos with concrete results as we develop.
2. Team.
We're 2 software engineers and 1 AI researcher:
- I was an AI product engineer at Asana. https://linkedin.com/in/samatd
- Denis was a tech lead at a unicorn startup. https://x.com/karpenoid
- Our third co-founder works at Anthropic and was previously at OpenAI. Since he is still at Anthropic and planning to leave soon for the startup, I can share his details privately.
3. Claims and transparency.
Our "4.2x Sonnet-3.5 accuracy" is an initial estimate from a locally fine-tuned model. Actual results may vary - a small app might not see big improvements, but we believe larger, private enterprise projects could see significant gains. We plan to publish our fine-tuned model so others can verify the results.
4. Competition from LLM providers.
Fine-tuning requires complex data cleanup and setup. Enterprise projects have fragmented data, making automation challenging for big providers like OpenAI.
Appreciate the feedback! If you want to chat more 1-1, happy to discuss at hi@finecodex.com
Samat.
Thank you, we will!:) This was a quick landing page for us to start the conversation and gather feedback. We are trying to make sure we are not building something that nobody needs.
We used a single file for the context. It is a cherry-picked example, you are right. I wanted to demonstrate a simple visual change that our model did correctly unlike Sonnet-3.5. Since we are just getting started, we don't have many features like making changes across multiple files in the code editor so it would be harder to demo. Our premise is that a smaller fine-tuned works better than a large, general-purpose SOTA model. We plan to share more metrics and data in the future.
I like it and it makes sense, but from a business perspective I wonder what keeps the upstream LLM providers (all trying to generate profits) from offering the same fine-tuning service quickly ?
Edit: OK, right it's olama, so I assume you can download your own model. (Assuming it's downloadable?)
I think openAI already offers fine-tuning with custom data for some of their models, but maybe not specific to coding tasks.
Yes, you can download and host the fine-tuned open-source model like Llama. The fine-tuning is easy once you have the data, but gathering and cleaning data is challenging. There are also optimizations like upsampling and distillation that could improve the quality of the resulting model. We had 40 engineers at the Asana AI org and never did the fine-tuning because it is not easy.
The page includes the logos of those companies. Is it normal to do that for companies one used to work for?