how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H1...

zackangelo · 2024-11-19T02:23:54 1731983034

It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.

So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.

qingcharles · 2024-11-19T04:32:28 1731990748

It should work, I believe. And anything that doesn't fit you can leave on your system RAM.

Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

joha4270 · 2024-11-19T07:47:33 1732002453

> Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.

On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.