Controlling Qwen’s thinking mode

I wanted a coding agent I could run on my own hardware: open a terminal, start a session, and get a responsive loop without a round-trip to someone else’s API. Qwen3.6-27B looked close. It fit the cards I had, ran fast enough, and had a thinking mode for harder problems.

Then I turned thinking on and ds4-eval-92 dropped from 72.8 to 48.9. The failure was serving control rather than model quality. Qwen’s reasoning mode needed three things the serving path had to enforce: a reliable mode switch, a reasoning budget, and a way to close reasoning before the visible answer budget disappeared.

Baseline behavior

Before I touched thinking mode, Qwen did what I wanted. In nothink, it was stable across hardware and providers:

Serving path	Mode	ds4-eval-92
lucebox, RTX 5090 Laptop, Q4_K_M	nothink	70.7
lucebox, RTX 3090 Ti, Q4_K_M	nothink	71.7
OpenRouter, opaque quant	nothink	72.8
MLX 8-bit, Mac Studio M2 Ultra	nothink	73.9

The rows that needed reasoning were the ones I cared about, so I turned thinking on and ran the same eval:

Mode	ds4-eval-92	Note
nothink	72.8	clean nothink, zero thinking tokens
think, unbudgeted	48.9	84/92 reasoned, 32 hit the length cap
think, budgeted	76.1	force-close on 44/92

That was the problem: the feature I needed made the eval worse until the serving path controlled the reasoning phase.

Mode control across serving stacks

A reliable nothink mode was the first control. When I checked response tokens, some nothink runs still contained thinking tokens. The model was reasoning after I had explicitly told it not to.

There isn’t one portable off switch. These all mean slightly different things to different stacks:

"chat_template_kwargs": {"enable_thinking": false},

"thinking": {"type": "disabled"},

"reasoning_effort": "none"

MLX honored the chat-template flag. lucebox honored its own thinking field. OpenRouter honored none of those in the path I tested. The thing that actually worked was putting /no_think directly in the prompt, which Qwen’s template recognizes natively. After that, thinking tokens dropped to zero, and the scores landed back where they belonged.

The operational rule is simple:

Don’t trust the request.
Check the response.
If a nothink run has thinking tokens in it, it wasn’t a nothink run.

Reasoning budget enforcement

An enforced reasoning budget was the second control. reasoning_effort and budget_tokens sound like controls, but they only change behavior if the serving stack enforces them. In the OpenRouter path I tested, they didn’t. Qwen kept reasoning until the response hit the token cap, and the grader often saw no parseable answer. The model had spent its output budget inside <think> and never surfaced a reply.

The per-area scores showed where the missing budget hurt:

Area	nothink	think, unbudgeted	think, budgeted
hellaswag	86	34	88
longctx	100	33	100
gsm8k	93	77	96
truthfulqa	80	51	77

hellaswag and longctx need short, committed answers, and unbounded thinking ate the budget before those answers appeared. gsm8k held up better because the reasoning was doing useful work, but it still improved once the budget was enforced. Thinking was useful once metered; unmetered thinking was the failure mode.

Metering the reasoning phase

The third control was a way to protect the visible answer. Qwen’s thinking response has two phases: reasoning inside <think> ... </think>, and the visible answer after </think>. A single max_tokens cap can’t control both independently. If the cap is loose, the model can spend it all in <think>. If the cap is tight, the model can close the thought block with no room left to answer.

What the serving stack actually needs is three numbers: a reasoning cap, a total response cap, and a reply reserve. The Qwen3.6 model-card sidecar gives you the reasoning tiers:

Tier	Reasoning budget
low	4,032
medium	16,128
high	32,256
x-high	56,832
max	81,408

The sidecar also reserves 4,096 tokens for the visible answer, and I kept underestimating that reserve. A force-close that leaves no answer budget still fails, just differently than before.

“Think less” doesn’t do much once the model is already inside the reasoning trace. The server has to count generated tokens and, when reasoning gets close to the reply reserve, pull Qwen out of <think> before the budget runs out.

A bare </think> marked the boundary for the parser but didn’t reliably move the model into answer mode. The Qwen3 technical report includes the phrase Qwen was actually trained to use when reasoning is cut short:

Considering the limited time by the user, I have to give the solution based on
the thinking directly now.
</think>

That phrase lives in the Qwen3.6 sidecar. lucebox tokenizes it at startup. When the budget hook fires, the decode loop overrides the next sampled tokens with that sequence, putting Qwen on the trained “wrap up now” path with the KV state intact and the full reasoning trace still in frame.

When I don’t control the backend, luce-bench uses a rougher fallback: watch the stream, abort when the reasoning budget is gone, and re-prompt with the same trained close. That costs an extra request, but it’s still better than letting the answer disappear into the cap.

Results and lessons

Once the budget was enforced, Qwen behaved like the model I had been trying to use from the beginning. OpenRouter went from 48.9 unbudgeted to 76.1 budgeted, and MLX 8-bit hit 83.7 with budgeted thinking, up from 73.9 nothink. The force-close fired on 44 of 92 rows on OpenRouter and 50 of 92 on MLX, and both runs recorded zero continuation failures.

Nothink still matters because it is cheap, steady, and easy to compare across providers. For harder problems, Qwen’s thinking mode needs an actual meter. Without one, “thinking” can mean “spend the answer budget in scratch space and return nothing.”

The fix was to control three layers of serving behavior:

A mode switch that did not travel across serving stacks
A budget parameter the serving path did not enforce
A token cap that could not distinguish reasoning from reply

For Qwen, reasoning mode is a resource the serving stack has to meter: when thinking started, how many tokens it has spent, how much reply space remains, and how to close the block without stranding the model mid-derivation. If the stack can’t count it, reserve for it, and close it on schedule, thinking mode is not under control. I just wanted a coding agent I could use, and that meant understanding everything between the prompt and the answer.

Notes

Benchmark numbers come from ds4-eval-92, single seed, one pinned grader (v0.2.7.dev0). Mode-switching and thinking-budget behavior comes from the Qwen3 technical report and the Qwen3.6 model-card values transcribed into lucebox.

Primary references: