Why fast models can still feel slow

My desk has been losing the hardware battle for years: a 3090, then Apple Silicon, then borrowed time on bigger boxes whenever I could arrange it. I told myself it was research. Mostly, it was the same itch every engineer gets: what if I just ran it myself?

The local-model benchmarks said the idea should work. People were posting 5090 decode rates, useful 3090 runs, Apple Silicon results, and DGX Spark numbers. Hardware was getting better, and the models were getting better too.

And yet every time I sat down with a local coding agent, the same question surfaced: why does this still feel slower than Codex?

Once the model started talking, it did not seem slow. The delay lived between actions: before the next tool call landed, before the next patch materialized, before anything useful happened at all.

That mismatch is the thesis: decode speed is a bad proxy for agent performance. Coding agents feel fast or slow because the whole loop has to move: prefill, cache reuse, tool calls, routing, serving runtime, and model generation.

Challenges in measuring LLM performance

The fast-local-model reports were measuring something useful, but not the whole agent workload. Sites like LocalMaxxing publish measurements from real machines. Some RTX 5090 runs show short-prompt decode speeds in the 140–240 tok/s range. A 3090 can run useful Qwen-class coding models. Apple Silicon and systems like DGX Spark have reached the point where dismissing local inference as “too slow” no longer holds.

Most of the charts I was looking at measured the same workload shape:

small prompt -> long answer -> output tokens per second

As a decode test, that is clean: how fast can the model keep talking once it starts? That scope misses the wait before decode, where an agent often spends its time. A model can stream quickly and still make an agent feel sluggish if every turn has to push a large context back through the server before the first token appears.

I was measuring the wrong thing and calling it speed.

Anatomy of an agentic loop

A coding agent is not a single prompt followed by a long answer. Every turn carries history forward: the system prompt, tool definitions, repository context, previous messages, tool outputs, patches, test failures, and whatever new instruction comes next.

By the time you’re deep into a coding session, most of turn N is actually turn N-1 repeated with a little more context added.

That makes coding agents prefix-reuse workloads. If the server can recognize the shared prefix and skip reprocessing it, the next step starts fast. If it can’t, every tool loop pays the full prefill cost again. Decode still matters; it’s just not where the time was going.

Structured usage records from my own sessions showed the scale. They were counters from real sessions, not a controlled replay. Across the Codex corpus on this machine, median per-turn input was around 105k tokens. Cached-input ratio sat at 99.2% median; cached tokens made up 96.5% of aggregate input. The recent Codex slice was nearly identical: ~97k median per-turn input, 99.0% median cached-input ratio.

Claude’s numbers were blunter still: 68.4B cache-read input tokens, with cache reads accounting for 97.2% of cache-inclusive input.

The model could stream tokens once it got going and still feel sluggish because much of the waiting happened before the first token appeared.

Prefill, cache, and time to first token

Prefill is the missing term in most decode-speed comparisons. The Fizeau TerminalBench logs put timing numbers on what the usage records implied. Across 7,247 OpenRouter turns, median time to first token (TTFT) scaled with input depth:

Input tokens	Median TTFT
0–10k	0.86s
10–30k	1.11s
30–60k	2.03s
60–120k	3.26s

The dominant Qwen model on OpenRouter showed the same pattern. TTFT climbed with context while decode stayed comparatively flat:

Input tokens	Median TTFT	Median decode
0–10k	0.80s	50.2 tok/s
10–30k	1.11s	51.2 tok/s
30–60k	1.67s	45.1 tok/s
60–120k	4.15s	41.7 tok/s

This does not isolate cache on versus cache off, so it should not be read as a clean cache-ablation result. It does show the wait I was feeling: as context deepens, the next turn takes longer to start. For an agent running dozens of steps, startup latency is part of the product.

Comparing inference stacks with TerminalBench

TerminalBench also made runtime differences visible. I ran 23 overlapping tasks against GPT-5.5 directly through OpenAI, logging 1,384 turns. TTFT held around 1.0–1.3 seconds even at 60k+ context. Decode p50 was 179.5 tok/s in the 60–120k bucket, compared with 41.7 tok/s for Qwen on OpenRouter in the same depth bucket.

These rows are observed serving paths from logged runs, not a controlled benchmark. Each row reports the deepest bucket reached for that path, so TTFT and decode p50 are not directly comparable down the column; rows with n=5 are directional.

Serving path	Tasks (n)	Deepest bucket	TTFT p50	Decode p50
OpenRouter Qwen3.6 27B	87 tasks	60–120k	4.15s	41.7 tok/s
OpenRouter Claude 4.6 Sonnet	5 tasks	30–60k	2.36s	1,184.2 tok/s (logged)
OpenRouter GPT-5.4 mini	5 tasks	10–30k	0.90s	160.5 tok/s
OpenAI GPT-5.5	23 tasks	60–120k	1.34s	179.5 tok/s
OMLX Qwen3.6 27B 8-bit	78 tasks	60–120k	23.34s	11.2 tok/s
llama-server Qwen3.6 GGUF	76 tasks	120k+	4.13s	9.3 tok/s
DS4	82 tasks	60–120k	271.34s (logged)	18.3 tok/s

Here, “logged” marks a value retained as a run-log observation, not treated as a clean comparable benchmark metric.

These paths did not feel the same to work with. vLLM, OMLX, llama-server, OpenRouter, GPT-5.5: similar agent workloads, different latency and throughput profiles. Model quality matters, but the serving stack changes the loop around the model.

Metrics for agentic workloads

Public throughput reports answer a narrower question than coding-agent users ask. A “write a story” test measures decode after the model is already talking. Long-prompt single-request tests are closer because they exercise prefill and TTFT, but they still miss repeated prefix reuse across turns.

Agentic workloads are prompt-heavy, prefix-reuse-heavy, and latency-sensitive at every step. TerminalBench, SWE-bench-style agent runs, and captured terminal-agent logs are shaped more like the workload I felt at the keyboard.

A better metric set would report prompt tokens, reused prefix tokens, first-content latency, wall time, tool-call latency, output tokens, and decode rate turn by turn. A cleaner next experiment would replay identical agent sessions with prefix reuse enabled and disabled, then plot those values across the run. I have not run that experiment yet. These claims are scoped to what the session logs show.

The evidence supports a narrower conclusion than “local is bad” or “hosted is magic.” Coding-agent sessions are large and cache-heavy. Context depth changes both TTFT and decode behavior. Provider and runtime choices change user-visible latency in measurable ways. Local serving stacks can have good decode throughput and still produce sluggish agent loops.

We can’t own the model’s reinforcement learning or download the exact Codex or Claude tool loop and fine-tune it onto local weights after the fact. That gap matters.

But we can own the layer around the model: the inference engine, harness contract, cache configuration, model metadata, autotuning, evals, and replay tests. Decode throughput is one constraint; model quality is another. The competitive layer for local stacks is the system around the weights.

The fast-local-model reports bothered me because they were right. They just didn’t explain what I was waiting on. For coding agents, the workload is long-context, prefix-reuse-heavy, and sensitive to the pause before each new turn starts. Systems that feel responsive have better weights, true, but they’re also serving stacks built around caching, prefill behavior, tool protocols, and the harness loop.

Notes