[Author] Mosaic must be getting some kind of sweetheart deals on A100 80GB and A100 40GB. The prices they are quoting are not what say the AWS on-demand prices are. They quote $2 per GPU for A100 40GB and $2.50 for A100 80GB. That's literally half the AWS on-demand rate for A100s here: https://aws.amazon.com/ec2/instance-types/p4/
And these are impossible to get. We tried to get some for Anyscale, and we were told there were no on-demand available and lead time for reserved (ouchie on the price! You're talking a quarter of a million dollars a year for one machine at list) was in weeks.
Once you take the model size and hefty sweetheart deals into account, you're within 10%. Mosaic does have some nice whitebox optimizations, but nothing that radically changes the equation.
A100-40GB is like $1.10 on LambdaLabs, on demand. Their availability is horrific on singles, but I've seen 8x instances pop up more often than not. And you can rent A100s for a buck a pop interruptible from other clouds, plenty of availability. $2 doesn't seem like much of a sweetheart deal.
I have a suggested modification. You are mixing references in your document.
Re: '~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens
The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs.'
The LLaMA-13B model took 2.75 days of 2048xA100 (135,168 GPU-hours) with 1 trillion tokens. The 21 days for 1.4 trillion was for LLaMA-65B.
I would suggest using the LLaMa-13B numbers since those are the most relevant for this section, or at least modify "21 days to train LLaMa" to "21 days to train LLaMa-65B" for clarity.
And these are impossible to get. We tried to get some for Anyscale, and we were told there were no on-demand available and lead time for reserved (ouchie on the price! You're talking a quarter of a million dollars a year for one machine at list) was in weeks.
Once you take the model size and hefty sweetheart deals into account, you're within 10%. Mosaic does have some nice whitebox optimizations, but nothing that radically changes the equation.