Hacker News new | past | comments | ask | show | jobs | submit | danielhanchen's comments login

There's no extensive benchmarks, but I did do a Flappy Bird pass@3 test just to show that a 1.58bit model does in fact work well!

The goal was to showcase that MoEs quantized down to 1.58bit without any further training does in fact work!


Right and congrats on all of that. But I need to know the evals to know whether this actually makes sense versus alternatives.

Oh yep! :) I'm the main author of it! :) If you have any questions on it - ask away!

Oh so the trick is llama.cpp has VRAM, RAM and disk offloading! For good speeds, it's best to have (VRAM + RAM) >= 140GB will get you probs 5 tokens / s max. For reasonable speeds, (VRAM + RAM) >= 80GB will get you probs 2-3 tokens / s. If you have less, then llama.cpp will resort to mmap / disk offloading, so you'll probs get < 1 tokens / s.

There is a PR where we offload non MoE layers to disk if you don't have enough RAM / VRAM, and it should make stuff faster!


If building a new personal computer where one design goal is running these and future models (inference only) at decent speeds what should be prioritized? Getting huge ram (256gb?)? Fast ram? Slow big vram gpu vs faster gpu with less vram? Some super fast ssd?

That's actually a wonderful question! In terms of trends, I see the following:

1. GPUs are most likely at their limit in terms of FLOPs - float4 / FP4 is most likely the "final" low precision data-type. NVIDIA might provide 1.58bit support or FP2, but unlikely. If there was FP2, it might make it 1.5x faster.

2. Shrinking transistors might still have some room to go, but don't expect 2x or 4x faster - the majority of speedups was in tensor cores and low bit representation.

3. We might get more interesting transformer archs since DeepSeek showcased their unique arch for R1 / V3.

Due to these, I would first wait and see - ie I would actually wait until the next OSS model release say Llama 3, Gemma 3, etc and see what the large model labs are focusing on, then maybe I would wait for RTX 50 Super or a cheaper version or even RTX 60x series. Larger VRAM is always better.

SSD is good, but not that important - RAM is more important.


Oh thanks :) Yes agreed we do need better GTM - temporarily it's still me and my brother running Unsloth, so for now we're just prioritizing many more engineering releases :)

Oh thanks a lot! Appreciate it :) We're always open to collaborating with anyone!

Ye a custom chip would be insane! 1.5 bit with a scaling factor seems to be actually usable for MoEs with shared experts!

Yes exactly! I edited the blog post to make the wording a bit better!

Oh yes I reduced it by 4 for just in case :) I found sometimes the formula doesn't work, so in the worst case -4 was used - glad at least it ran!

That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)

There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.

But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!


Oh I love Open WebUI as well!! But glad to hear the 1.58bit version could be helpful to you!

Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: