> We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.
After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.
I remember vividly that the focus on GPT-4.1 to speak more humane and be more philosophical when speaking. I remember something like that. That model is special and is not meant like a next generation of their other models like 4o and o3.
Very rough (!) napkin math: for a q8 model (almost lossless) you have parameters = VRAM requirement. For q4 with some performance loss it's roughly half. Then you add a little bit for the context window and overhead. So a 32B model q4 should run comfortably on 20-24 GB.
Again, very rough numbers, there's calculators online.
The ultra-simplified napkin math is 1 GB (V)RAM per 1 billion parameters, at a 4-5 bit-per-weight quantization. This usually gives most of the performance of the full size model and leaves a little bit of room for context, although not necessarily the full supported size.
Yes, it's more a rule of thumb than napkin math I suppose. The difference allows space for the KV cache which scales with both model size and context length, plus other bits and bobs like multimodal encoders which aren't always counted into the nameplate model size.
Wondering if I'll get corrected, but my _napkin math_ is looking at the model download size — I estimate it needs at least this amount of vram/ram, and usually the difference in size between various models is large enough not to worry if the real requirements are size +5% or 10% or 15%. LM studio also shows you which models your machine should handle
The absolutely dumbest way is to compare the number of parameters with your bytes of RAM. If you have 2 or more bytes of RAM for every parameter you can generally run the model easily (eg 3B model with 8GB of RAM). 1 byte per parameter and it is still possible, but starts to get tricky.
Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.
We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.