Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Rule of thumb is parameter_count * precision. Precision can be anything [32,16,8,4] bits. 32bits is sometimes used in training (although less now I guess), and rarely in inference. For a while now "full" precision is 16bit (fp16, bf16), fp8 is 8bit, int4 is 4bit, and so on. Everything that's not "full" precision is also known as quantised. fp8 is a quantised version of the "full" model.

So quick napkin math can give you the VRAM usage for loading the model. 7b can be ~14GB full, 7GB in fp8 and ~3.5GB in 4bit (AWQ, int4, q4_k_m, etc). But that's just to load the model in VRAM. You also need some available VRAM to run inference, and there are a lot of things to consider there too. You need to be able to run a forward pass on the required context, you can keep a kv cache to speed up inference, you can do multiple sessions in parallel, and so on.

Context length is important to take into account because images take a lot of tokens. So what you could do with a 7b LLM at full precision on a 16GB VRAM GPU might not be possible with a VLM, because the context of your query might not fit into the remaining 2GB.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: