Hacker News new | past | comments | ask | show | jobs | submit login

> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

MosaicML claims they trained a 7 billion parameter on 1 trillion tokens with a budget of $200k.

https://www.mosaicml.com/blog/mpt-7b

Does training cost scale linearly with model size and token count? If so, that suggests a lower bound of $600k to train the 13 billion params model. (Still roughly the same magnitude)




[Author] Mosaic must be getting some kind of sweetheart deals on A100 80GB and A100 40GB. The prices they are quoting are not what say the AWS on-demand prices are. They quote $2 per GPU for A100 40GB and $2.50 for A100 80GB. That's literally half the AWS on-demand rate for A100s here: https://aws.amazon.com/ec2/instance-types/p4/

And these are impossible to get. We tried to get some for Anyscale, and we were told there were no on-demand available and lead time for reserved (ouchie on the price! You're talking a quarter of a million dollars a year for one machine at list) was in weeks.

Once you take the model size and hefty sweetheart deals into account, you're within 10%. Mosaic does have some nice whitebox optimizations, but nothing that radically changes the equation.


A100-40GB is like $1.10 on LambdaLabs, on demand. Their availability is horrific on singles, but I've seen 8x instances pop up more often than not. And you can rent A100s for a buck a pop interruptible from other clouds, plenty of availability. $2 doesn't seem like much of a sweetheart deal.


There is no possible way for anyone buying 1M worth of compute to get list pricing.


Thanks for putting to this together.

I have a suggested modification. You are mixing references in your document.

Re: '~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs.'

The LLaMA-13B model took 2.75 days of 2048xA100 (135,168 GPU-hours) with 1 trillion tokens. The 21 days for 1.4 trillion was for LLaMA-65B.

I would suggest using the LLaMa-13B numbers since those are the most relevant for this section, or at least modify "21 days to train LLaMa" to "21 days to train LLaMa-65B" for clarity.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: