Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?

In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?



It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.

So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.


It should work, I believe. And anything that doesn't fit you can leave on your system RAM.

Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?


> Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.

On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: