The Qwen3 32B dense model just fails for me due to a template issue, but the Qwen3 30B A3B model does work. I think the more dense the model is at a small number of active parameters, the more sensitive a model can be to quantization. Only have 24GB of VRAM and have been using a 4-bit quantization which I use for most models.
Qwen3 30B A3B is quick since it's MoE, but the results are just unreliable. It uses up all of its speed advantage on really poor reasoning and provides inconsistent answers. You have to crank up the context size to make room for all its very fluff-y thoughts. Since it's going to be wrong anyway, I just use /no_think to disable the thinking and get the wrong answer faster so I can tell it why it's wrong. I'm not sure it's more efficient, it makes me trust it less.
That said, the 0.6B unquantized model supporting reasoning was very interesting and it felt very smart for its size. In a way it has the same issue, though. Very fast, quite smart for the number of active parameters, but not accurate enough on average to matter. Realistically, is that useful to me for most of my use cases? Not feeling like it right now.
By comparison, Gemma 3 27B QAT is incredible at 4-bit quantization and it even has handicaps. It's multi-modal and multi-lingual, has obscure internet knowledge from decades ago that no other offline model has ever demonstrated (even if it hallucinates a bit of it), yet still gives me responses that are just smarter, better at following careful instructions and more useful than these fancy newer models that don't have those constraints.
It doesn't help that Qwen3 is a censored Chinese model forced to cover up CCP's litter in the litterbox. Sure, most models are censored in some way to avoid providing dangerous information, but the nature of the censorship in Chinese models is to cover up the CCP's failure. US models will gladly talk about anybody's failure, which is good, so we can avoid failure in the future.
Still, I am looking forward to trying the Qwen3 32B dense model when support for it is fixed, because QwQ was useful for a while there and this could be a nice iteration on that.
It seems much better than the Qwen3 30B A3B for quantized local use from what I can tell so far. Not sure yet how it compares to Gemma 3, but it's at least not clearly worse. It definitely does a much better job of formatting output in a friendlier way than Gemma does, but that's not as critical to me.
I suspect many people are getting even worse results out of the A3B than I did, since I saw downloads defaulting to 3-bit quants, but even at a higher quant, for local use it just isn't there yet.
I'm sure there are plenty of use cases for the low active parameter MoE models like sentiment analysis, summaries, etc, but for anything real I'll stick to the dense models. It makes me wonder if Qwen3 has similar problems that Llama 4 had, trying to be a big MoE model with low active parameters producing spotty results.
Qwen3 32B is quite usable, though. The problem I have with it so far is that it seems worse at instruction following, language translation and inferring the meaning of my prompt than Gemma 3. This isn't ideal, because if it can't follow instructions, you can't easily shape its reasoning/response to account for its issues.
One of my prompts simply asks it to do some translation and it occasionally feeds in Chinese characters. That's just not going to be usable for that scenario. Gemma 3's language consistency and quality is closer to production ready.
Gemma 3 does have its own problems with translation though, because if you instruct it to translate and what you want to translate is "what do you know?", it will instead go on talking about its capabilities rather than translating the language. You have to use a few tricks to prevent it from doing that.
Think I tried both the Unsloth and original yesterday, but it looks like the model got updated today so I'm downloading the new version. We'll see how that goes!
Qwen3 30B A3B is quick since it's MoE, but the results are just unreliable. It uses up all of its speed advantage on really poor reasoning and provides inconsistent answers. You have to crank up the context size to make room for all its very fluff-y thoughts. Since it's going to be wrong anyway, I just use /no_think to disable the thinking and get the wrong answer faster so I can tell it why it's wrong. I'm not sure it's more efficient, it makes me trust it less.
That said, the 0.6B unquantized model supporting reasoning was very interesting and it felt very smart for its size. In a way it has the same issue, though. Very fast, quite smart for the number of active parameters, but not accurate enough on average to matter. Realistically, is that useful to me for most of my use cases? Not feeling like it right now.
By comparison, Gemma 3 27B QAT is incredible at 4-bit quantization and it even has handicaps. It's multi-modal and multi-lingual, has obscure internet knowledge from decades ago that no other offline model has ever demonstrated (even if it hallucinates a bit of it), yet still gives me responses that are just smarter, better at following careful instructions and more useful than these fancy newer models that don't have those constraints.
It doesn't help that Qwen3 is a censored Chinese model forced to cover up CCP's litter in the litterbox. Sure, most models are censored in some way to avoid providing dangerous information, but the nature of the censorship in Chinese models is to cover up the CCP's failure. US models will gladly talk about anybody's failure, which is good, so we can avoid failure in the future.
Still, I am looking forward to trying the Qwen3 32B dense model when support for it is fixed, because QwQ was useful for a while there and this could be a nice iteration on that.