Ollama Qwen 3 FAQ
Now that Qwen 3 is out, it has become my new favorite model. Here are a few tips to make the best of it.
Choosing the right tag
The best variant of Qwen is the largest model you can run quickly on your hardware. The following sizes are currently available:
0.6b
1.7b
4b
8b
14b
30b
- MoE32b
235b
- MoE
Most models are dense, meaning that Ollama will have to read the entire model for each token it generates. The 30b
and 235b
are sparse Mixture-of-Experts models, with 3b
and 22b
active parameters respectively. This makes them surprisingly capable, even on CPU, as only the active parameters are processed during inference.
Best tag for CPU
If you do not have a supported GPU and are on a consumer-grade CPU with DDR4, you will most likely want to stick to one of the smaller variants, up to 4b
. This assumes you have at least 8 GB free RAM.
If you have DDR5, you can try slightly larger models like 8b
and 14b
, or the 30b
MoE variant. I would recommend 8 GB, 16 GB, and 24 GB free RAM respectively for those models.
If you have a prosumer or server-grade CPU with at least 160 GB of 8-channel DDR5 (e.g. ThreadRipper Pro or Epyc), you can try the 235b
model.
Best tag for GPU
Depending on the amount of free VRAM you have, you can run the best dense model that fits fully on your GPU (with room for the context). Note that increasing context length will increase the VRAM requirements.
For an 8 GB GPU, you can easily run any model up to 8b
, on a 12-16 GB GPU you can run 14b
, and if you have 24 GB or more free VRAM you can run the 32b
.
The sparse MoE models have slightly lower requirements. If you are relatively patient, you can run the 30b
model with partial CPU-offloading on any GPU with 8+ GB VRAM. I get 6-10 tokens/second on a 16 GB GPU with context set to 16k, and slow DDR4 RAM, which I personally find acceptable for the quality of the answers (but I haven't tried the 14b
yet to compare).
Thinking mode
To disable thinking mode, you can add /nothink
or /no_think
at the end of your prompt.
1 |
|
To force the model to think, you can append /think
instead.
1 2 3 4 5 6 7 8 |
|
Disable thinking by default
If you do not wish for the model to think by default, you can add /nothink
as part of the system prompt. I recommend saving it under a new name.
1 2 3 4 |
|
You can now access the model via the name qwen3-nothink
, and simply ask it to think when you want, or use the standard qwen3
model to have thinking enabled by default.
Limiting the thinking costs
As of this writing, Ollama does not currently support limiting the number of tokens Qwen will use for thinking, but maintainers are considering it, so stay tuned!
May 3, 2025, 4:05 p.m.