Dynamic 4bit Quantization

danielhanchen · 2024-12-04T20:16:50 1733343410

Hey HN family! Sometimes quantizing all parameters in models to 4bit will break them. I uploaded some mixed 4bit 90% + 16bit 10% quants for vision models (Llama, Qwen, Pixtral) and QwQ which retain accuracy and still are small to https://huggingface.co/unsloth

hedgehog · 2024-12-04T22:27:25 1733351245

Nice. Unsloth requires a platform supported by Triton, correct?

danielhanchen · 2024-12-04T23:15:23 1733354123

Yep! Linux works - Windows works if you manually compile it, or use an unofficial Python wheel. I'm actually unsure if Mac devices support Triton.

hedgehog · 2024-12-05T01:43:36 1733363016

I think basically only NVIDIA is reliably supported right now, it would be nice to have more hardware support to allow splitting models (like HF Accelerate or llama.cpp support).

danielhanchen · 2024-12-05T07:52:09 1733385129

Yep for now NVIDIA - AMD might work but I'll have to edit the dependencies - more hardware support is coming! I'm trying to add Apple and CPU support!