I wanted a coding agent I could run on my own hardware: open a terminal, start a session, and get a responsive loop without a round-trip to someone else’s API. Qwen3.6-27B looked close. It fit the cards I had, ran fast enough, and had a thinking mode for harder problems.
Then I turned thinking on and ds4-eval-92 dropped from 72.8 to 48.9. The failure was serving control rather than model quality. Qwen’s reasoning mode needed three things the serving path had to enforce: a reliable mode switch, a reasoning budget, and a way to close reasoning before the visible answer budget disappeared.
Baseline behavior
Before I touched thinking mode, Qwen did what I wanted. In nothink, it was stable across hardware and providers:
| Serving path | Mode | ds4-eval-92 |
|---|---|---|
| lucebox, RTX 5090 Laptop, Q4_K_M | nothink | 70.7 |
| lucebox, RTX 3090 Ti, Q4_K_M | nothink | 71.7 |
| OpenRouter, opaque quant | nothink | 72.8 |
| MLX 8-bit, Mac Studio M2 Ultra | nothink | 73.9 |
The rows that needed reasoning were the ones I cared about, so I turned thinking on and ran the same eval:
| Mode | ds4-eval-92 | Note |
|---|---|---|
| nothink | 72.8 | clean nothink, zero thinking tokens |
| think, unbudgeted | 48.9 | 84/92 reasoned, 32 hit the length cap |
| think, budgeted | 76.1 | force-close on 44/92 |
That was the problem: the feature I needed made the eval worse until the serving path controlled the reasoning phase.
Mode control across serving stacks
A reliable nothink mode was the first control. When I checked response tokens, some nothink runs still contained thinking tokens. The model was reasoning after I had explicitly told it not to.
There isn’t one portable off switch. These all mean slightly different things to different stacks:
"chat_template_kwargs": {"enable_thinking": false},
"thinking": {"type": "disabled"},
"reasoning_effort": "none"
MLX honored the chat-template flag. lucebox honored its own thinking field. OpenRouter honored none of those in the path I tested. The thing that actually worked was putting /no_think directly in the prompt, which Qwen’s template recognizes natively. After that, thinking tokens dropped to zero, and the scores landed back where they belonged.
The operational rule is simple:
- Don’t trust the request.
- Check the response.
- If a nothink run has thinking tokens in it, it wasn’t a nothink run.
Reasoning budget enforcement
An enforced reasoning budget was the second control. reasoning_effort and budget_tokens sound like controls, but they only change behavior if the serving stack enforces them. In the OpenRouter path I tested, they didn’t. Qwen kept reasoning until the response hit the token cap, and the grader often saw no parseable answer. The model had spent its output budget inside <think> and never surfaced a reply.
The per-area scores showed where the missing budget hurt:
| Area | nothink | think, unbudgeted | think, budgeted |
|---|---|---|---|
| hellaswag | 86 | 34 | 88 |
| longctx | 100 | 33 | 100 |
| gsm8k | 93 | 77 | 96 |
| truthfulqa | 80 | 51 | 77 |
hellaswag and longctx need short, committed answers, and unbounded thinking ate the budget before those answers appeared. gsm8k held up better because the reasoning was doing useful work, but it still improved once the budget was enforced. Thinking was useful once metered; unmetered thinking was the failure mode.
Metering the reasoning phase
The third control was a way to protect the visible answer. Qwen’s thinking response has two phases: reasoning inside <think> ... </think>, and the visible answer after </think>. A single max_tokens cap can’t control both independently. If the cap is loose, the model can spend it all in <think>. If the cap is tight, the model can close the thought block with no room left to answer.
What the serving stack actually needs is three numbers: a reasoning cap, a total response cap, and a reply reserve. The Qwen3.6 model-card sidecar gives you the reasoning tiers:
| Tier | Reasoning budget |
|---|---|
| low | 4,032 |
| medium | 16,128 |
| high | 32,256 |
| x-high | 56,832 |
| max | 81,408 |
The sidecar also reserves 4,096 tokens for the visible answer, and I kept underestimating that reserve. A force-close that leaves no answer budget still fails, just differently than before.
“Think less” doesn’t do much once the model is already inside the reasoning trace. The server has to count generated tokens and, when reasoning gets close to the reply reserve, pull Qwen out of <think> before the budget runs out.
A bare </think> marked the boundary for the parser but didn’t reliably move the model into answer mode. The Qwen3 technical report includes the phrase Qwen was actually trained to use when reasoning is cut short:
Considering the limited time by the user, I have to give the solution based on
the thinking directly now.
</think>
That phrase lives in the Qwen3.6 sidecar. lucebox tokenizes it at startup. When the budget hook fires, the decode loop overrides the next sampled tokens with that sequence, putting Qwen on the trained “wrap up now” path with the KV state intact and the full reasoning trace still in frame.
When I don’t control the backend, luce-bench uses a rougher fallback: watch the stream, abort when the reasoning budget is gone, and re-prompt with the same trained close. That costs an extra request, but it’s still better than letting the answer disappear into the cap.
Results and lessons
Once the budget was enforced, Qwen behaved like the model I had been trying to use from the beginning. OpenRouter went from 48.9 unbudgeted to 76.1 budgeted, and MLX 8-bit hit 83.7 with budgeted thinking, up from 73.9 nothink. The force-close fired on 44 of 92 rows on OpenRouter and 50 of 92 on MLX, and both runs recorded zero continuation failures.
Nothink still matters because it is cheap, steady, and easy to compare across providers. For harder problems, Qwen’s thinking mode needs an actual meter. Without one, “thinking” can mean “spend the answer budget in scratch space and return nothing.”
The fix was to control three layers of serving behavior:
- A mode switch that did not travel across serving stacks
- A budget parameter the serving path did not enforce
- A token cap that could not distinguish reasoning from reply
For Qwen, reasoning mode is a resource the serving stack has to meter: when thinking started, how many tokens it has spent, how much reply space remains, and how to close the block without stranding the model mid-derivation. If the stack can’t count it, reserve for it, and close it on schedule, thinking mode is not under control. I just wanted a coding agent I could use, and that meant understanding everything between the prompt and the answer.
Notes
Benchmark numbers come from ds4-eval-92, single seed, one pinned grader (v0.2.7.dev0). Mode-switching and thinking-budget behavior comes from the Qwen3 technical report and the Qwen3.6 model-card values transcribed into lucebox.
Primary references: