Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

These performance numbers look absolutely incredible. The MoE outperforms o1 with 3B active parameters?

We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.



> We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.

After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.


What if GPT-4.1 was just the wrong model to use?


If OpenAI's flagship model can't add a simple calendar event, that doesn't do much to assuage my disappointment...


I remember vividly that the focus on GPT-4.1 to speak more humane and be more philosophical when speaking. I remember something like that. That model is special and is not meant like a next generation of their other models like 4o and o3.

You should try a different model for your task.


I think you're confusing GPT-4.5 with GPT-4.1. GPT-4.1 is their recommended model for non-reasoning API use.


You are right!


I'm dreaming of a time when commodity CPUs run LLMs for inference & serve at scale.


How do people typically do napkin math to figure out if their machine can “handle” a model?


Very rough (!) napkin math: for a q8 model (almost lossless) you have parameters = VRAM requirement. For q4 with some performance loss it's roughly half. Then you add a little bit for the context window and overhead. So a 32B model q4 should run comfortably on 20-24 GB.

Again, very rough numbers, there's calculators online.


The ultra-simplified napkin math is 1 GB (V)RAM per 1 billion parameters, at a 4-5 bit-per-weight quantization. This usually gives most of the performance of the full size model and leaves a little bit of room for context, although not necessarily the full supported size.


Wouldn’t it be 1GB (billion bytes) per billion parameters when each parameter is 1 byte (FP8)?

Seems like 4 bit quantized models would use 1/2 the number of billions of parameters in bytes, because each parameter is half a byte, right?


Yes, it's more a rule of thumb than napkin math I suppose. The difference allows space for the KV cache which scales with both model size and context length, plus other bits and bobs like multimodal encoders which aren't always counted into the nameplate model size.


How much memory would correspond to a 100000 and a million tokens?


Wondering if I'll get corrected, but my _napkin math_ is looking at the model download size — I estimate it needs at least this amount of vram/ram, and usually the difference in size between various models is large enough not to worry if the real requirements are size +5% or 10% or 15%. LM studio also shows you which models your machine should handle


The absolutely dumbest way is to compare the number of parameters with your bytes of RAM. If you have 2 or more bytes of RAM for every parameter you can generally run the model easily (eg 3B model with 8GB of RAM). 1 byte per parameter and it is still possible, but starts to get tricky.

Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: