for such dynamic 2bit, is there any benchmark results showing how many performan...

danielhanchen · 2025-08-22T03:13:03 1755832383

Currently no, but I'm running them! Some people on the aider discord are running some benchmarks!

cowpig · 2025-08-22T19:15:59 1755890159

@danielhanchen do you publish the benchmarks you run anywhere?

danielhanchen · 2025-08-24T00:11:54 1755994314

We had benchmarks for Llama 4 and Gemma 3 at https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs - for others I normally refer to https://discord.com/channels/1131200896827654144/12822404236... which is the Aider Polygot Discord - they always benchmark our quants :)

segmondy · 2025-08-22T13:04:24 1755867864

if you are running a 2bit quant, you are not giving up performance but gaining 100% performance since the alternative is usually 0%. Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware. I for instance often ran Q3_K_L, I don't think of how much performance I'm giving up, but rather how without Q3, I won't be able to run it at all. With that said, for R1, I did some tests against 2 public interfaces and my local Q3 crushed them. The problem with a lot of model providers is we can never be sure what they are serving up and could take shortcuts to maximize profit.

linuxftw · 2025-08-22T15:05:13 1755875113

That's true only in a vacuum. For example, should I run gpt-oss-20b unquantized or gpt-oss-120b quantaized? Some models have a 70b/30b spread, and that's only across a single base model, where many different models exist at different quants could be compared for different tasks.

jkingsman · 2025-08-22T16:38:11 1755880691

Definitely. As a hobbyist, I have yet to put together a good heuristic for better-quant-lower-params vs. smaller-quant-high-params. I've mentally been drawing the line at around q4, but now with IQ quants and improvements in the space I'm not so sure anymore.

linuxftw · 2025-08-22T17:56:13 1755885373

Yeah, I've kinda quickly thrown in the towel trying to figure out what's 'best' for smaller memory systems. As things are just moving so quickly, whatever time I invest into that is likely to be for nil.

danielhanchen · 2025-08-22T18:35:02 1755887702

For GPT OSS in particular, OpenAI only released the MoEs in MXFP4 (4bit), so the "unquantized" version is 4bit MoE + 16bit attention - I uploaded "16bit" versions to https://huggingface.co/unsloth/gpt-oss-120b-GGUF, and they use 65.6GB whilst MXFP4 uses 63GB, so it's not that much difference - same with GPT OSS 20B

llama.cpp also unfortunately cannot quantize matrices that are not a multiple of 256 (2880)

danielhanchen · 2025-08-22T18:30:34 1755887434

Oh Q3_K_L as in upcasted embed_tokens + lm_head to Q8_0? I normally do Q4 embed Q6 lm_head - would a Q8_0 be interesting?