Three things I had to fix before I could control Qwen’s thinking mode

June 2026 · by Erik

I just wanted a coding agent I could use, so this was not a research project or a benchmark exercise. I wanted to open a terminal, start a session, and have something that felt good to work with. Locally, on my own hardware, without a round-trip to someone else’s API. Qwen3.6-27B seemed like the answer. It ran fast enough, it fit on the cards I had, and it had a thinking mode for the harder problems. I set it up, ran a few tasks, and it felt almost right.

But when I turned thinking on, the scores dropped off a cliff. I had to investigate where exactly it was going wrong.

The baseline was fine

Before I touched thinking mode, Qwen was doing what I wanted. In nothink, it was boring in the useful way: stable across hardware, stable across providers, numbers I could reason about.

Serving path	Mode	ds4-eval-92
lucebox, RTX 5090 Laptop, Q4_K_M	nothink	70.7
lucebox, RTX 3090 Ti, Q4_K_M	nothink	71.7
OpenRouter, opaque quant	nothink	72.8
MLX 8-bit, Mac Studio M2 Ultra	nothink	73.9

The rows I actually cared about needed reasoning, so I turned thinking on and ran the same eval. The comparison was hard to look at.

Mode	ds4-eval-92	Note
nothink	72.8	clean nothink, zero thinking tokens
think, unbudgeted	48.9	84/92 reasoned, 32 hit the length cap
think, budgeted	76.1	force-close on 44/92

72.8 to 48.9: the feature I needed was making the model worse. I spent a while assuming I had misconfigured something obvious, but when I dug into the logs, I found three separate serving problems that had been hiding behind each other.

The model was still thinking when I told it not to

To get a clean comparison, I needed a reliable nothink baseline I could pin and trust. So I started there, and when I checked the response tokens, I found the first problem: some nothink runs had thinking tokens in them. The model was reasoning when I had explicitly told it not to.

It turned out there isn’t one portable off switch and these all mean slightly different things to different stacks:

“chat_template_kwargs”: {“enable_thinking”: false},

“thinking”: {“type”: “disabled”},

“reasoning_effort”: “none”

MLX honored the chat-template flag. lucebox honored its own thinking field. OpenRouter honored none of those in the path I tested. The thing that actually worked was putting /no_think directly in the prompt, which Qwen’s template recognizes natively. After that, thinking tokens dropped to zero, and the scores landed back where they belonged.

The lesson was simple enough:

Don’t trust the request.
Check the response.
If a nothink run has thinking tokens in it, it wasn’t a nothink run.

The token cap was doing the wrong job

With a clean baseline in hand, I turned thinking back on and looked harder at what was happening. reasoning_effort and budget_tokens sound like controls, but they’re only controls if the serving stack actually enforces them. In the OpenRouter path I tested, they didn’t. Qwen kept reasoning until the response hit the token cap, and the grader was often seeing no parseable answer at all (not bad answers), because the model had spent its entire output budget inside <think> and never surfaced a reply.

The per-area scores showed exactly where this was hurting:

Area	nothink	think, unbudgeted	think, budgeted
hellaswag	86	34	88
longctx	100	33	100
gsm8k	93	77	96
truthfulqa	80	51	77

hellaswag and longctx need short, committed answers, and unbounded thinking ate the budget before those answers appeared. gsm8k held up better because the reasoning was doing useful work, but it still improved once the budget was enforced. This was clearly a case of “unbounded thinking is a bad serving policy,” rather than “thinking hurts Qwen.”

One cap can’t control both phases

The third problem was why budget enforcement was hard in the first place. Qwen’s thinking response has two phases: reasoning inside <think> ... </think>, and the visible answer after </think>. A single max_tokens cap can’t control both independently; if the cap is loose, the model can spend it all in <think>; if the cap is tight, the model can close the thought block with no room left to answer.

What the serving stack actually needs is three numbers: a reasoning cap, a total response cap, and a reply reserve. The Qwen3.6 model-card sidecar gives you the reasoning tiers:

Tier	Reasoning budget
low	4,032
medium	16,128
high	32,256
x-high	56,832
max	81,408

The sidecar also reserves 4,096 tokens for the visible answer, and I kept underestimating that reserve. A force-close that leaves no answer budget still fails, just differently than before.

Forcing the close

“Think less” doesn’t do much once the model is already inside the reasoning trace. The server has to count generated tokens and, when reasoning gets close to the reply reserve, pull Qwen out of <think> before the budget runs out.

A bare </think> marked the boundary for the parser but didn’t reliably move the model into answer mode. The Qwen3 technical report includes the phrase Qwen was actually trained to use when reasoning is cut short:

Considering the limited time by the user, I have to give the solution based on the thinking directly now. </think>

That phrase lives in the Qwen3.6 sidecar. lucebox tokenizes it at startup. When the budget hook fires, the decode loop overrides the next sampled tokens with that sequence, putting Qwen on the trained “wrap up now” path with the KV state intact and the full reasoning trace still in frame.

When I don’t control the backend, luce-bench uses a rougher fallback: watch the stream, abort when the reasoning budget is gone, and re-prompt with the same trained close. That costs an extra request, but it’s still better than letting the answer disappear into the cap.

What it looked like when it worked

Once the budget was enforced, Qwen looked like the model I had been trying to use from the beginning. OpenRouter went from 48.9 unbudgeted to 76.1 budgeted, and MLX 8-bit hit 83.7 with budgeted thinking, up from 73.9 nothink. The force-close fired on 44 of 92 rows on OpenRouter and 50 of 92 on MLX, and both runs recorded zero continuation failures.

Nothink is still useful as it is cheap, steady, and easy to compare across providers. But for the harder problems, Qwen’s thinking mode needs an actual meter, because without one, “thinking” just means “spend the answer budget in scratch space and return nothing.”

What I learned

The failure was in three layers of serving behavior I hadn’t fully controlled:

A mode switch that wasn’t portable across providers
A budget parameter that wasn’t being enforced, and,
A token cap that couldn’t distinguish reasoning from reply

Once I had those three things under control, I had the model I’d been trying to use all along.

For Qwen, reasoning mode is a resource the serving stack has to meter: when thinking started, how many tokens it has spent, how much reply space remains, and how to close the block without stranding the model mid-derivation. If the stack can’t count it, reserve for it, and close it on schedule, then thinking mode isn’t really under control. I just wanted a coding agent I could use, and it turned out that meant understanding everything between the prompt and the answer.

Notes

Benchmark numbers come from ds4-eval-92, single seed, one pinned grader (v0.2.7.dev0). Mode-switching and thinking-budget behavior comes from the Qwen3 technical report and the Qwen3.6 model-card values transcribed into lucebox.

Primary references: