Why fast models can still feel slow: The agentic stack is the product, not the model

For a while, I really thought I was missing something. Okay, let me start from the very beginning. My desk has been losing the battle against hardware for years. A 3090 arrived first, then some Apple Silicon, then borrowed time on bigger boxes whenever I could arrange it. I told myself it was research. Mostly, it was the same itch every engineer gets: what if I just ran it myself?

Every week seemed to bring another benchmark showing how much faster local models had become. People were posting screenshots of 5090s pushing absurd token rates and getting impressive results from 3090s, Apple Silicon, or DGX Spark. Hardware was getting better, and the models were getting better too.

And yet every time I sat down with a local coding agent, the same question surfaced: why does this still feel slower than Codex?

The strange part was that the model itself didn’t seem slow, and once it started talking, it was fine. Sometimes it was fast, sometimes very fast. It felt slow between things: before the next tool call landed, before the next patch materialized, before anything useful happened at all.

So I stopped looking at benchmark charts and started looking at logs.

Why my lived experience didn’t match benchmarks

The fast-local-model reports weren’t misleading, which is part of why the mismatch took a while to find. Sites like LocalMaxxing have plenty of real measurements from real machines. Some RTX 5090 runs show short-prompt decode speeds in the 140–240 tok/s range. A 3090 can comfortably run useful Qwen-class coding models. Apple Silicon and systems like DGX Spark have reached the point where dismissing local inference as “too slow” no longer holds.

But eventually, I noticed that most of the charts I was looking at were measuring the same shape of workload:

small prompt -> long answer -> output tokens per second

That’s a clean decode test. It answers how fast the model can keep talking once it starts, but not what I was waiting on, because the painful gap was before decode started. A model can stream beautifully and still make an agent feel sluggish if every turn has to push a large context back through the server before the first token appears.

I was measuring the wrong thing and calling it speed.

The session keeps getting heavier

A coding agent isn’t that benchmark workload. Every turn carries a large amount of history forward: the system prompt, tool definitions, repository context, previous messages, tool outputs, patches, test failures, and whatever new instruction comes next.

By the time you’re deep into a coding session, most of turn N is actually turn N-1 repeated with a little more context added.

That makes coding agents prefix-reuse workloads. If the server can recognize the shared prefix and skip reprocessing it, the next step starts fast. If it can’t, every tool loop pays the full prefill cost again. Decode still matters; it’s just not where the time was going.

Structured usage records from my own sessions made this concrete. They were counters from real sessions, not a controlled replay. Across the Codex corpus on this machine, median per-turn input was around 105k tokens. Cached-input ratio sat at 99.2% median; cached tokens made up 96.5% of aggregate input. The recent Codex slice was nearly identical: ~97k median per-turn input, 99.0% median cached-input ratio.

Claude’s numbers were blunter still: 68.4B cache-read input tokens, with cache reads accounting for 97.2% of cache-inclusive input.

The model could stream tokens once it got going and still feel sluggish in practice because much of the waiting happened before the first token appeared.

Context depth shows up in the clock

The Fizeau terminal-bench logs put timing numbers on what the logs implied. Across 7,247 OpenRouter turns, median time to first token (TTFT) scaled with input depth:

Input tokens	Median TTFT
0–10k	0.86s
10–30k	1.11s
30–60k	2.03s
60–120k	3.26s

The dominant Qwen model on OpenRouter showed the same pattern: TTFT climbing with context while decode stayed comparatively flat:

Input tokens	Median TTFT	Median decode
0–10k	0.80s	50.2 tok/s
10–30k	1.11s	51.2 tok/s
30–60k	1.67s	45.1 tok/s
60–120k	4.15s	41.7 tok/s

This doesn’t isolate cache on versus cache off: I don’t want to overstate what it proves. But it does illustrate the thing I was waiting on: as context deepens, the next turn takes longer to start. For an agent running dozens of steps, that startup latency is as much a part of the product as the model weights.

Different stacks felt different

I ran 23 overlapping terminal-bench tasks against GPT-5.5 directly through OpenAI, logging 1,384 turns. TTFT held around 1.0–1.3 seconds even at 60k+ context. Decode was also substantially higher than the Qwen-on-OpenRouter baseline.

Comparing paths side by side:

These rows are observed serving paths from logged runs, not a controlled benchmark. Each row reports the deepest bucket reached for that path, so TTFT and decode p50 are not directly comparable down the column; rows with n=5 are directional.

Serving path	Tasks (n)	Deepest bucket	TTFT p50	Decode p50
OpenRouter Qwen3.6 27B	87 tasks	60–120k	4.15s	41.7 tok/s
OpenRouter Claude 4.6 Sonnet	5 tasks	30–60k	2.36s	1,184.2 tok/s (logged)
OpenRouter GPT-5.4 mini	5 tasks	10–30k	0.90s	160.5 tok/s
OpenAI GPT-5.5	23 tasks	60–120k	1.34s	179.5 tok/s
OMLX Qwen3.6 27B 8-bit	78 tasks	60–120k	23.34s	11.2 tok/s
llama-server Qwen3.6 GGUF	76 tasks	120k+	4.13s	9.3 tok/s
DS4	82 tasks	60–120k	271.34s (logged)	18.3 tok/s

Here, “logged” marks a value retained as a run-log observation, not treated as a clean comparable benchmark metric.

These paths didn’t feel the same to work with. vLLM, OMLX, llama-server, OpenRouter, GPT-5.5: similar agent workloads, very different latency and throughput profiles. That’s stack behavior. Model quality alone doesn’t account for it.

The benchmark was answering a different question

Once I had these numbers, the public throughput reports started making sense in a new way. A “write a story” test has that shape. It’s a clean measurement of decode after the model is already talking. Long-prompt single-request tests are closer: they at least exercise prefill and TTFT; but they still miss repeated prefix reuse across turns.

The agent workload is prompt-heavy, prefix-reuse-heavy, and latency-sensitive at every step. Terminal-Bench, SWE-bench-style agent runs, and captured terminal-agent logs are shaped more like the real thing. Turns out, the standard benchmarks were just answering a different question.

What the evidence supports

The evidence points somewhere narrower than “local is bad” or “hosted is magic.”

Real coding-agent sessions are large and cache-heavy. Context depth changes both TTFT and decode behavior. Provider and runtime choices change user-visible latency in ways that are measurable. Local serving stacks can be excellent at decode throughput and still produce sluggish agent loops.

What the logs don’t yet prove is a clean cache-on/cache-off latency delta from the same replayed prompts. A cleaner test would replay identical agent sessions with prefix reuse enabled and disabled, then plot prompt tokens, reused prefix tokens, first-content latency, wall time, and decode rate turn by turn. That experiment hasn’t been run. The claims here are scoped to what the session logs actually show.

What I learned: The layer local stacks have to win on

There’s a real constraint in the open-weights world.

We can’t own the model’s reinforcement learning or download the exact Codex or Claude tool loop and fine-tune it onto local weights after the fact. That gap is genuine.

But we can own everything around the model: the inference engine, the harness contract, cache configuration, model metadata, autotuning, evals, replay tests. That’s the layer where local stacks have to compete: not by pretending decode throughput is the whole story, and not by dismissing the model quality gap. The model matters. It just isn’t the whole product.

The fast-local-model reports bothered me because they were right. They just didn’t explain what I was waiting on. For coding agents, the workload is long-context, prefix-reuse-heavy, and sensitive to the pause before each new turn starts. Systems that feel responsive have better weights, true, but they’re also serving stacks built around caching, prefill behavior, tool protocols, and the harness loop.

That’s the part the local stack has to match.

Notes

Local session measurements came from structured Codex and Claude usage records: token counts and cache counters, not transcript content. Fizeau measurements came from structured terminal-bench run logs: request timestamp, first streamed delta, response timestamp, input tokens, output tokens.

External references: