Hey HN family! Sometimes quantizing all parameters in models to 4bit will break them. I uploaded some mixed 4bit 90% + 16bit 10% quants for vision models (Llama, Qwen, Pixtral) and QwQ which retain accuracy and still are small to https://huggingface.co/unsloth
I think basically only NVIDIA is reliably supported right now, it would be nice to have more hardware support to allow splitting models (like HF Accelerate or llama.cpp support).