Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The ultra-simplified napkin math is 1 GB (V)RAM per 1 billion parameters, at a 4-5 bit-per-weight quantization. This usually gives most of the performance of the full size model and leaves a little bit of room for context, although not necessarily the full supported size.


Wouldn’t it be 1GB (billion bytes) per billion parameters when each parameter is 1 byte (FP8)?

Seems like 4 bit quantized models would use 1/2 the number of billions of parameters in bytes, because each parameter is half a byte, right?


Yes, it's more a rule of thumb than napkin math I suppose. The difference allows space for the KV cache which scales with both model size and context length, plus other bits and bobs like multimodal encoders which aren't always counted into the nameplate model size.


How much memory would correspond to a 100000 and a million tokens?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: