Hacker News new | past | comments | ask | show | jobs | submit login

I've been confused over this.

I've seen a $5.5M # for training, and commensurate commentary along the lines of what you said, but it elides the cost of the base model AFAICT.




$5.5 million is the cost of training the base model, DeepSeek V3. I haven't seen numbers for how much extra the reinforcement learning that turned it into R1 cost.


Ahhh, ty ty.


With $5.5M, you can buy around 150 H100s. Experts correct me if I’m wrong but it’s practically impossible to train a model like that with that measly amount.

So I doubt that figure includes all the cost of training.


It's even more. You also need to fund power and maintain infrastructure to run the GPUs. You need to build fast networks between the GPUs for RDMA. Ethernet is going to be too slow. Infiniband is unreliable and expensive.


You’ll also need sufficient storage, and fast IO to keep them fed with data.

You also need to keep the later generation cards from burning themselves out because they draw so much.

Oh also, depending on when your data centre was built, you may also need them to upgrade their power and cooling capabilities because the new cards draw _so much_.


The cost, as expressed in the DeepSeek V3 paper, was expressed in terms of training hours based on the market rate per hour if they'd rented the 2k GPUs they used.


Is it a fine tune effectively?


No, it's a full model. It's just...most concisely, it doesn't include the actual costs.

Claude gave me a good analogy, been struggling for hours: its like only accounting for the gas grill bill when pricing your meals as a restaurant owner

The thing is, that elides a lot, and you could argue it out and theoratically no one would be wrong. But $5.5 million elides so much info as to be silly.

ex. they used 2048 H100 GPUs for 2 months. That's $72 million. And we're still not even approaching the real bill for the infrastructure. And for every success, there's another N that failed, 2 would be an absurdly conservative estimate.

People are reading the # and thinking it says something about American AI lab efficiency, rather, it says something about how fast it is to copy when you can scaffold by training on another model's outputs. That's not a bad thing, or at least, a unique phenomena. That's why it's hard talking about this IMHO




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: