These performance numbers look absolutely incredible. The MoE outperforms o1 wit...

stavros · 2025-04-28T23:07:41 1745881661

> We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.

After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.

Alifatisk · 2025-04-29T13:42:18 1745934138

What if GPT-4.1 was just the wrong model to use?

stavros · 2025-04-29T14:05:26 1745935526

If OpenAI's flagship model can't add a simple calendar event, that doesn't do much to assuage my disappointment...

Alifatisk · 2025-04-29T14:44:13 1745937853

I remember vividly that the focus on GPT-4.1 to speak more humane and be more philosophical when speaking. I remember something like that. That model is special and is not meant like a next generation of their other models like 4o and o3.

You should try a different model for your task.

85392_school · 2025-04-29T16:20:58 1745943658

I think you're confusing GPT-4.5 with GPT-4.1. GPT-4.1 is their recommended model for non-reasoning API use.

Alifatisk · 2025-05-09T07:33:20 1746776000

You are right!

the_arun · 2025-04-28T21:30:14 1745875814

I'm dreaming of a time when commodity CPUs run LLMs for inference & serve at scale.

thierrydamiba · 2025-04-28T21:29:46 1745875786

How do people typically do napkin math to figure out if their machine can “handle” a model?

derbaum · 2025-04-28T21:42:04 1745876524

Very rough (!) napkin math: for a q8 model (almost lossless) you have parameters = VRAM requirement. For q4 with some performance loss it's roughly half. Then you add a little bit for the context window and overhead. So a 32B model q4 should run comfortably on 20-24 GB.

Again, very rough numbers, there's calculators online.

daemonologist · 2025-04-28T21:41:46 1745876506

The ultra-simplified napkin math is 1 GB (V)RAM per 1 billion parameters, at a 4-5 bit-per-weight quantization. This usually gives most of the performance of the full size model and leaves a little bit of room for context, although not necessarily the full supported size.

bionhoward · 2025-04-29T01:25:35 1745889935

Wouldn’t it be 1GB (billion bytes) per billion parameters when each parameter is 1 byte (FP8)?

Seems like 4 bit quantized models would use 1/2 the number of billions of parameters in bytes, because each parameter is half a byte, right?

daemonologist · 2025-04-29T03:08:17 1745896097

Yes, it's more a rule of thumb than napkin math I suppose. The difference allows space for the KV cache which scales with both model size and context length, plus other bits and bobs like multimodal encoders which aren't always counted into the nameplate model size.

aitchnyu · 2025-04-29T08:23:25 1745915005

How much memory would correspond to a 100000 and a million tokens?

hn8726 · 2025-04-28T21:38:50 1745876330

Wondering if I'll get corrected, but my _napkin math_ is looking at the model download size — I estimate it needs at least this amount of vram/ram, and usually the difference in size between various models is large enough not to worry if the real requirements are size +5% or 10% or 15%. LM studio also shows you which models your machine should handle

samsartor · 2025-04-28T22:08:10 1745878090

The absolutely dumbest way is to compare the number of parameters with your bytes of RAM. If you have 2 or more bytes of RAM for every parameter you can generally run the model easily (eg 3B model with 8GB of RAM). 1 byte per parameter and it is still possible, but starts to get tricky.

Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.