Ollama Qwen 3 FAQ

Now that Qwen 3 is out, it has become my new favorite model. Here are a few tips to make the best of it.

Choosing the right tag

The best variant of Qwen is the largest model you can run quickly on your hardware. The following sizes are currently available:

Most models are dense, meaning that Ollama will have to read the entire model for each token it generates. The 30b and 235b are sparse Mixture-of-Experts models, with 3b and 22b active parameters respectively. This makes them surprisingly capable, even on CPU, as only the active parameters are processed during inference.

Best tag for CPU

If you do not have a supported GPU and are on a consumer-grade CPU with DDR4, you will most likely want to stick to one of the smaller variants, up to 4b. This assumes you have at least 8 GB free RAM.

If you have DDR5, you can try slightly larger models like 8b and 14b, or the 30b MoE variant. I would recommend 8 GB, 16 GB, and 24 GB free RAM respectively for those models.

If you have a prosumer or server-grade CPU with at least 160 GB of 8-channel DDR5 (e.g. ThreadRipper Pro or Epyc), you can try the 235b model.

Best tag for GPU

Depending on the amount of free VRAM you have, you can run the best dense model that fits fully on your GPU (with room for the context). Note that increasing context length will increase the VRAM requirements.

For an 8 GB GPU, you can easily run any model up to 8b, on a 12-16 GB GPU you can run 14b, and if you have 24 GB or more free VRAM you can run the 32b.

The sparse MoE models have slightly lower requirements. If you are relatively patient, you can run the 30b model with partial CPU-offloading on any GPU with 8+ GB VRAM. I get 6-10 tokens/second on a 16 GB GPU with context set to 16k, and slow DDR4 RAM, which I personally find acceptable for the quality of the answers (but I haven't tried the 14b yet to compare).

Thinking mode

To disable thinking mode, you can add /nothink or /no_think at the end of your prompt.

1
>>> Hello! /nothink

To force the model to think, you can append /think instead.

1
2
3
4
5
6
7
8
>>> Hello! /nothink
<think></think>
<Hi, how can I assist you today?
>>> What is two plus two? /think
<think>Okay, the user is asking "What is two plus two?" That's a straightforward math question...</think>

Two plus two equals four. This is a basic arithmetic calculation where adding two units to another two units results in a total of four units.
>>>

Disable thinking by default

If you do not wish for the model to think by default, you can add /nothink as part of the system prompt. I recommend saving it under a new name.

1
2
3
4
$ ollama run qwen3
>>> /set system You are a friendly AI assistant. /nothink
>>> /save qwen3-nothink
>>> /bye

You can now access the model via the name qwen3-nothink, and simply ask it to think when you want, or use the standard qwen3 model to have thinking enabled by default.

Limiting the thinking costs

As of this writing, Ollama does not currently support limiting the number of tokens Qwen will use for thinking, but maintainers are considering it, so stay tuned!

May 3, 2025, 4:05 p.m.

Stats (since 2025-01-29)

Page Hits
00033