The bugs that break agents aren't in the models

Anthropic just published a postmortem that validates what many of us suspected: Claude Code's quality issues weren't the model's fault. Three separate bugs in the harness were degrading results for months. Users noticed. Anthropic initially dismissed it as perception drift. The users were right.

This is the kind of story that matters for production agent systems. Not benchmark scores. Not parameter counts. The quiet, cumulative damage of infrastructure bugs interacting with non-deterministic outputs in ways that are nearly impossible to isolate.

The harness problem

The most damaging bug was deceptively simple. On March 26, Anthropic shipped a change to clear "older thinking" from sessions idle over an hour. This was supposed to reduce latency when users resumed. Instead, due to a logic error, it triggered every turn for the rest of the session.

The effect was subtle but cumulative. Claude appeared forgetful. Repetitive. Users reported the model "getting worse" over time. Which it was — not because weights degraded, but because context windows were being silently truncated mid-conversation.

I keep Claude Code sessions open for days. Eleven right now, according to ps aux. This bug was hitting my most-used workflow pattern, and I couldn't articulate why it felt broken. That's the insidious part. You don't get an error. You get a vague sense that the tool is less capable than it was last week. You blame the model. Everyone does.

Why this keeps happening

Agent harnesses are where the complexity lives. The model is a function: tokens in, tokens out, probabilities attached. The harness is state management, retry logic, token accounting, context pruning, tool routing, error recovery, streaming parsing, and a dozen other concerns that each seem simple in isolation.

They interact badly.

The Claude Code postmortem describes three separate issues, each "complex but material." This is engineer speak for "we couldn't simplify the failure mode enough to catch it in testing." The bugs required specific session durations, specific interaction patterns, specific state transitions to manifest. Your integration tests don't cover day-three-of-the-same-session. Your load tests don't simulate interrupted flows with hour-long gaps.

Production agents need production-grade harness observability. Not just latency and cost metrics. Semantic drift detection. Context window utilization over session lifetime. Tool call success rates correlated with session age. The signals that tell you your system is degrading before users start complaining on Reddit.

The deeper pattern

This isn't an Anthropic-specific problem. Every major agent platform has suffered harness bugs that looked like model degradation. OpenAI's early function calling implementations had subtle JSON parsing edge cases that caused silent failures. Google's Gemini API had context window miscalculations that truncated system prompts. These were infrastructure bugs. Users blamed the models.

The model is the most visible component. It's also the most tested, most benchmarked, most scrutinized component. When your agent behaves strangely, the model is the least likely culprit. Check your retry logic. Check your context assembly. Check whether you're accidentally streaming partial JSON and parsing it anyway.

Anthropic's transparency here is commendable. Most vendors would have quietly fixed the bugs and let the "model got worse" narrative fade. But the lesson isn't about Anthropic's engineering culture. It's about the structural reality of agent systems.

The frontier isn't in the foundation models anymore. It's in the orchestration layer. The teams that treat harness engineering as second-class infrastructure will keep shipping bugs that eat their model quality gains. The teams that instrument obsessively, that assume the harness is guilty until proven innocent, will build agents that actually work in production.

Your users can tell when something's wrong. They just can't tell you what. Build systems that answer that question before they have to ask.

The bugs that break agents aren't in the models

The bugs that break agents aren't in the models

The harness problem

Why this keeps happening

The deeper pattern

Comments

More from this blog

Voice Agents Are Finally Real. Your Architecture Isn't.

A Million Tokens Changes Nothing If Your Agent Can't Remember Yesterday

The Line Between Vibe Coding and Production Is Dissolving

Correctness Before Corrections: What vLLM's RL Migration Teaches Us About Agent Reliability

The Line Between Vibe Coding and Production Is Dissolving

Command Palette

The bugs that break agents aren't in the models

The harness problem

Why this keeps happening

The deeper pattern

Comments

More from this blog