Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

for such dynamic 2bit, is there any benchmark results showing how many performance I would give up compared to the original model? thanks.


Currently no, but I'm running them! Some people on the aider discord are running some benchmarks!


@danielhanchen do you publish the benchmarks you run anywhere?


We had benchmarks for Llama 4 and Gemma 3 at https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs - for others I normally refer to https://discord.com/channels/1131200896827654144/12822404236... which is the Aider Polygot Discord - they always benchmark our quants :)


if you are running a 2bit quant, you are not giving up performance but gaining 100% performance since the alternative is usually 0%. Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware. I for instance often ran Q3_K_L, I don't think of how much performance I'm giving up, but rather how without Q3, I won't be able to run it at all. With that said, for R1, I did some tests against 2 public interfaces and my local Q3 crushed them. The problem with a lot of model providers is we can never be sure what they are serving up and could take shortcuts to maximize profit.


That's true only in a vacuum. For example, should I run gpt-oss-20b unquantized or gpt-oss-120b quantaized? Some models have a 70b/30b spread, and that's only across a single base model, where many different models exist at different quants could be compared for different tasks.


Definitely. As a hobbyist, I have yet to put together a good heuristic for better-quant-lower-params vs. smaller-quant-high-params. I've mentally been drawing the line at around q4, but now with IQ quants and improvements in the space I'm not so sure anymore.


Yeah, I've kinda quickly thrown in the towel trying to figure out what's 'best' for smaller memory systems. As things are just moving so quickly, whatever time I invest into that is likely to be for nil.


For GPT OSS in particular, OpenAI only released the MoEs in MXFP4 (4bit), so the "unquantized" version is 4bit MoE + 16bit attention - I uploaded "16bit" versions to https://huggingface.co/unsloth/gpt-oss-120b-GGUF, and they use 65.6GB whilst MXFP4 uses 63GB, so it's not that much difference - same with GPT OSS 20B

llama.cpp also unfortunately cannot quantize matrices that are not a multiple of 256 (2880)


Oh Q3_K_L as in upcasted embed_tokens + lm_head to Q8_0? I normally do Q4 embed Q6 lm_head - would a Q8_0 be interesting?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: