More

danielhanchen · 2025-02-03T00:28:22 1738542502

There's no extensive benchmarks, but I did do a Flappy Bird pass@3 test just to show that a 1.58bit model does in fact work well!

The goal was to showcase that MoEs quantized down to 1.58bit without any further training does in fact work!

ilaksh · 2025-02-04T00:20:26 1738628426

Right and congrats on all of that. But I need to know the evals to know whether this actually makes sense versus alternatives.

danielhanchen · 2025-02-03T00:25:41 1738542341

Oh yep! :) I'm the main author of it! :) If you have any questions on it - ask away!

danielhanchen · 2025-02-01T09:43:16 1738402996

Oh so the trick is llama.cpp has VRAM, RAM and disk offloading! For good speeds, it's best to have (VRAM + RAM) >= 140GB will get you probs 5 tokens / s max. For reasonable speeds, (VRAM + RAM) >= 80GB will get you probs 2-3 tokens / s. If you have less, then llama.cpp will resort to mmap / disk offloading, so you'll probs get < 1 tokens / s.

There is a PR where we offload non MoE layers to disk if you don't have enough RAM / VRAM, and it should make stuff faster!

im3w1l · 2025-02-01T10:51:55 1738407115

If building a new personal computer where one design goal is running these and future models (inference only) at decent speeds what should be prioritized? Getting huge ram (256gb?)? Fast ram? Slow big vram gpu vs faster gpu with less vram? Some super fast ssd?

danielhanchen · 2025-02-01T22:10:42 1738447842

That's actually a wonderful question! In terms of trends, I see the following:

1. GPUs are most likely at their limit in terms of FLOPs - float4 / FP4 is most likely the "final" low precision data-type. NVIDIA might provide 1.58bit support or FP2, but unlikely. If there was FP2, it might make it 1.5x faster.

2. Shrinking transistors might still have some room to go, but don't expect 2x or 4x faster - the majority of speedups was in tensor cores and low bit representation.

3. We might get more interesting transformer archs since DeepSeek showcased their unique arch for R1 / V3.

Due to these, I would first wait and see - ie I would actually wait until the next OSS model release say Llama 3, Gemma 3, etc and see what the large model labs are focusing on, then maybe I would wait for RTX 50 Super or a cheaper version or even RTX 60x series. Larger VRAM is always better.

SSD is good, but not that important - RAM is more important.

danielhanchen · 2025-01-28T21:37:26 1738100246

Oh thanks :) Yes agreed we do need better GTM - temporarily it's still me and my brother running Unsloth, so for now we're just prioritizing many more engineering releases :)

danielhanchen · 2025-01-28T21:35:14 1738100114

Oh thanks a lot! Appreciate it :) We're always open to collaborating with anyone!

danielhanchen · 2025-01-28T21:28:53 1738099733

Ye a custom chip would be insane! 1.5 bit with a scaling factor seems to be actually usable for MoEs with shared experts!

danielhanchen · 2025-01-28T21:28:06 1738099686

Yes exactly! I edited the blog post to make the wording a bit better!

danielhanchen · 2025-01-28T21:27:08 1738099628

Oh yes I reduced it by 4 for just in case :) I found sometimes the formula doesn't work, so in the worst case -4 was used - glad at least it ran!

danielhanchen · 2025-01-28T21:20:10 1738099210

That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)

There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.

But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!

danielhanchen · 2025-01-28T21:17:57 1738099077

Oh I love Open WebUI as well!! But glad to hear the 1.58bit version could be helpful to you!