AI Development

The Harness Is Everything: What Cursor, Claude Code, Codex, and Perplexity Actually Built

Clarvia Team
Author
Mar 22, 2026
25 min read
The Harness Is Everything: What Cursor, Claude Code, Codex, and Perplexity Actually Built

You are not using AI wrong because you haven't found the right model. You are using AI wrong because you haven't built the right environment.


What's Inside

  1. The Problem Nobody Talks About -- Why raw model capability is not enough, and why the context window is not RAM
  2. The SWE-Agent Paper -- The Princeton research that proved interface design matters more than model choice
  3. Anthropic's Harness Engineering -- Two-agent architecture, feature lists, and the context window boundary problem
  4. OpenAI's Codex -- Zero lines of manual code, a million-line product, and the death of monolithic AGENTS.md
  5. The Wider Ecosystem -- Stripe's 1,300 weekly PRs, Perplexity's 19-model orchestration, DeepMind's Aletheia
  6. Design Patterns That Repeat -- Five patterns every harness shares
  7. The Skill That Transfers -- What this means for engineers and how to build your own
  8. The Commoditization Thesis -- Why the harness is where durable competitive advantage lives

Stripe merged 1,300 pull requests last week -- human-reviewed, but containing no human-written code. OpenAI built a million-line internal product in five months with three engineers. Anthropic proved that even their strongest model falls short of building a production web app without the right scaffolding -- then built the scaffolding that made it work. In December 2025, Princeton researchers showed that Claude Opus 4.5 scores 42% on a scientific reproducibility benchmark with a generic scaffold and 78% with Claude Code's harness. Same model. Same weights. A 36-point swing from nothing but the environment.

The difference between these teams and everyone else is not the model. It is not the temperature, the max tokens, or the system prompt. It is not even the prompt, though the industry has burned years of collective life arguing about prompts.

The difference is the harness.

This word gets used loosely. A harness is not a system prompt. It is not a wrapper around an API call. It is not an eval framework or a chatbot with memory. A harness is the complete designed environment inside which a language model operates: the tools it can call, the format of information it receives, how its history is compressed and managed, the guardrails that catch its mistakes before they cascade, and the scaffolding that allows it to hand off work to its future self without losing coherence.

When you look at what every serious team in this space actually built -- Anthropic, OpenAI, Princeton NLP, Stripe, Google DeepMind, Perplexity -- the same pattern emerges.

The model is the engine. The harness is everything else. And everything else is what determines whether the engine produces anything worth shipping.

This is a detailed technical breakdown of how that idea became the defining engineering discipline of late 2025 and early 2026. It covers the research, the real implementations, the failure modes that motivated the design decisions, and the patterns that repeat whether you are building a coding agent, a research agent, or a long-running autonomous software engineer.

By the end, you will understand not just what a harness is, but why building one correctly is now the most valuable engineering skill in the industry.


Part One: The Problem Nobody Talks About

Why Raw Capability Is Not Enough

In March 2026, the numbers are unambiguous. On SWE-bench Pro -- the current uncontaminated benchmark for coding agents, after OpenAI declared the older SWE-bench Verified dataset compromised by training data leakage -- the same model running through a basic scaffold scores roughly 23%. The same model through an optimized harness scores 46% or higher. A 2x gap. Not from a better model. From a better environment.

This pattern was first demonstrated at scale in May 2024, when the Princeton NLP group published "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" and introduced the concept of the Agent-Computer Interface, or ACI. On a curated subset of 300 real GitHub issues, GPT-4 Turbo with a standard bash shell resolved roughly 7%. The same model with their purpose-built ACI resolved 18%. The finding was striking then. It has only become more dramatic since -- because as models have gotten stronger, the gap between good and bad scaffolding has widened, not narrowed.

This should not have been surprising. We have known for decades that the right tools make engineers dramatically more productive. A developer with a modern IDE, debugger, version control, and CI/CD pipeline is orders of magnitude more effective than the same developer working in a raw terminal with only a text editor. The IDE does not make the developer smarter. It removes friction, surfaces information at the right moment, catches errors early, and organizes work into navigable units.

Language models are the same. They are not general reasoners working from some infinite internal knowledge base. They are sophisticated pattern-matching engines that operate on tokens in a context window. Everything they know in a given moment is determined by what is in that context window, and everything they produce is conditioned on how that context is structured.

The format of the input is not decoration. It is the cognitive architecture of the agent.

The interface is not a convenience layer. For an LM agent, the interface is the mind.

The Context Window Is Not RAM

The naive mental model of an AI agent treats the context window like RAM. You load data in, the model processes it, you get output. More context equals better performance. This mental model is wrong in ways that will ruin your agent if you build around it.

The context window is actually closer to the agent's entire working consciousness for a given session. Every token in that window costs computation. Every irrelevant piece of information competes for attention with the relevant information. The model does not have a selective attention mechanism that cleanly ignores noise. The noise is in the room, and it affects the reasoning.

This has specific, measurable consequences. When you run grep on a large codebase from inside an agent loop and return ten thousand lines of matches, you have not given the agent more information to work with. You have flooded its working memory with irrelevant data that will degrade the quality of every subsequent step until the context is cleared.

The SWE-agent researchers documented these failure modes meticulously. A standard bash interface caused agents to thrash. They would issue grep commands that returned thousands of lines, lose track of what they were looking for, issue more grep commands, gradually fill up their context with noise, and eventually either produce a wrong answer or stop making progress entirely. The problem was not model intelligence. The problem was that the interface had no mechanism for protecting the agent from itself.

The ACI solution was to build a search tool that returned a maximum of 50 results. If your search exceeded that limit, the tool suppressed the output and told the agent to narrow its query. This single design decision -- almost insultingly simple in retrospect -- was one of the highest-leverage changes in the entire paper. It transformed a context-flooding failure mode into a natural refinement loop.

You cannot proceed by being vague. You must be specific. The tool forced better behavior.

The same model through a basic scaffold scores roughly 23%. Through an optimized harness: 46% or higher. A 2x gap -- not from a better model, but from a better environment.

Part Two: The SWE-Agent Paper and the Birth of the ACI

What an Agent-Computer Interface Actually Is

The ACI is defined as an abstraction layer between a language model agent and a computer environment. The analogy to Human-Computer Interface (HCI) research is intentional. Just as HCI asks how to design interfaces that match human cognitive architecture -- visual pattern recognition, spatial memory, parallel attention across a screen -- ACI research asks how to design interfaces that match LM cognitive architecture: sequential token processing, sensitivity to context order and formatting, limited working memory, and a tendency to anchor on whatever information appears most prominently in the prompt.

The SWE-agent ACI had four components, and each one reflects a specific insight about how language models fail when given raw computer access.

Search and Navigation

The search component replaced standard grep and find with purpose-built tools: find_file, search_file, and search_dir. Results were capped at 50. If a query exceeded that limit, the tool returned a message explaining there were too many results and prompting the agent to refine. The tool listed which files matched rather than showing detailed surrounding context, because showing more context "proved to be too confusing for the model."

This sounds trivial. In practice, it was among the most consequential decisions in the paper.

Agents under cognitive load behave like humans under cognitive load -- they keep doing what they are doing when they feel uncertain. When a human is lost in a large codebase, they search more broadly, generating more noise. The capped search interrupted this pattern by creating a forcing function. You must be specific. This pushed the agent toward deliberate, targeted behavior.

The File Viewer

The file viewer showed 100 lines at a time -- a Goldilocks number the researchers found through ablation testing. Fewer lines (they tested 30) caused agents to lose context about surrounding code and make editing mistakes: 14.3% resolve rate. More lines (400: 17.0%; full file: 12.7%) caused agents to lose track of where they were. The sweet spot was 100 lines: 18.0% resolve rate.

The viewer was stateful, maintaining position across interactions. And critically, it prepended explicit line numbers to every visible line. This sounds cosmetic. It was not. When an agent needs to edit lines 47 through 52, it needs to read those numbers directly rather than counting or performing arithmetic. Removing that cognitive task freed capacity for actual problem-solving.

The File Editor With Linting

The editor's key innovation was immediate feedback. It accepted a start line, end line, and replacement text as a single operation. After every edit, the tool automatically ran a linter on the modified file. If the edit introduced a syntax error, the edit was rejected before it was applied, and the agent received a clear error message showing both the original code and the failed attempt.

This closed the feedback loop that causes cascading failures in naive implementations. Without a linter, an agent introduces a syntax error, runs the test suite, sees a failure that seems unrelated, spends multiple steps chasing the wrong problem, and eventually exhausts its context window chasing a ghost. With the linter integrated into the editor, syntax errors are caught at the moment of introduction, and the fix is localized before the problem propagates.

Compare this to raw bash. With sed or output redirection, there is no integrated feedback. Edits execute silently. Multi-line changes require complex argument formatting prone to mistakes. The agent might successfully run the command and introduce a subtle formatting error that the linter would have caught, then spend the next ten steps wondering why the tests fail.

Context Management

The fourth component addressed the accumulation of stale context over long sessions. As an agent works through a task, its history fills with old observations, intermediate states, and exploratory steps that no longer reflect reality. All of that history takes space in the context window and can actively mislead by providing outdated information.

The ACI's context management collapsed observations preceding the last five turns into single-line summaries. This kept the active context focused on recent, relevant information while preserving a compressed record of the overall trajectory.

What the Numbers Actually Mean

The SWE-agent paper benchmarked against SWE-bench, a collection of real GitHub issues from popular Python repositories. Using GPT-4 Turbo with the purpose-built ACI, the system resolved 12.47% of the full 2,294-issue test set -- a massive jump from the prior state-of-the-art of 3.8% achieved by non-interactive, RAG-based approaches.

On the smaller SWE-bench Lite subset (300 instances), the ablation study isolated each component's contribution. The shell-only baseline resolved roughly 7.3% of issues. The full ACI resolved 18.0% -- a 10.7 percentage point absolute improvement from interface design alone. The linter integration was consistently among the highest-leverage components. The capped search was critical for preventing context flooding. The stateful file viewer with line numbers meaningfully outperformed both raw cat and simpler viewer designs.

These numbers look small by March 2026 standards -- top systems now exceed 50% on SWE-bench Pro and briefly crossed 80% on the older SWE-bench Verified before that benchmark was declared contaminated. But the relative insight has only sharpened. On SWE-bench Pro in early 2026, switching the harness around the same frontier model produces a 22+ point gap -- larger, in absolute terms, than the gap between any two frontier models on the same scaffold. The principle the SWE-agent paper demonstrated in 2024 is not just still valid. It is more consequential than ever.

The paper was published at NeurIPS 2024. The implications extend well beyond coding. Any long-horizon agent task involves the same fundamental challenges: navigating large information spaces, maintaining coherent state across many steps, catching and recovering from errors, and managing limited context window attention. The specific tools change. The underlying architecture of the problem does not.

Shell-only baseline: 7.3%. Full ACI: 18.0%. A 10.7 percentage point improvement from interface design alone -- no model changes whatsoever.

Part Three: Anthropic's Harness Engineering

Why the Context Window Boundary Is the Hard Problem

The SWE-agent paper addressed interface design for a single session. Anthropic's engineering team, building Claude Code and the Claude Agent SDK, encountered a different problem: what happens when a task is too large to complete in one context window?

This is not a niche edge case. Most real software projects are too large to fit in any context window. Even with a million-token window, you cannot hold a full production application in mind simultaneously. Human engineers solve this through external memory, documentation, version control, and accumulated understanding that builds over weeks. An agent starting a fresh session has none of that.

The naive solution is compaction -- summarizing old context when the window fills. The Claude Agent SDK includes this capability, automatically compressing conversation history when token usage approaches the limit. But compaction alone is not enough.

In November 2025, two days after releasing Claude Opus 4.5, Anthropic published "Effective harnesses for long-running agents" -- a blog post that documented what they had learned about multi-session agent work. Their central finding: out of the box, even a frontier coding model like Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows would fall short of building a production-quality web app from a high-level prompt.

The failures clustered around two patterns.

Pattern one: attempting everything at once. Given a prompt like "build a clone of claude.ai," the agent would try to one-shot the entire application. It would begin implementing feature after feature without completing or testing any of them, run out of context in the middle, and leave the next session to start with a half-implemented mess, no documentation of what had been done, and no indication of what state the code was in.

Pattern two: premature victory. After some features had been built, a subsequent agent instance would look around, see progress, and conclude the job was done. It would declare victory on a partially-completed application and stop. This is not stupidity -- it is a reasonable inference from incomplete information. The agent had no structured way to know what "done" actually meant.

Both failures share a root cause: no persistent, structured understanding of project state that survives the context window boundary.

The Two-Agent Architecture

Anthropic's solution was a two-part architecture that has since become a template for long-running agentic work.

The initializer agent runs once. Its entire purpose is to set up the environment that all future coding agents will operate in. It does not write features. It creates scaffolding.

The initializer produces three key outputs. First, an init.sh script that reliably starts the development environment -- saving every subsequent session from spending tokens figuring out how to boot the application. Second, a comprehensive feature list file in JSON format. In the claude.ai clone experiment, this meant over 200 features, each described as an end-to-end user behavior ("a user can open a new chat, type in a query, press enter, and see an AI response"), each initially marked as failing. Third, a claude-progress.txt file and an initial git commit, giving every future session a fast way to orient itself.

The coding agent runs in every subsequent session with a different prompt: work on one feature at a time, leave the environment in a clean state, update the progress file and git history before the session ends.

The Feature List as Cognitive Anchor

The feature list deserves special attention. Without it, an agent must infer project completeness from the code itself. Code can exist that is not functional. Functionality can exist that is incomplete. An agent that reads the code and reasons about what is done will get the wrong answer often enough to be a serious problem.

The feature list makes completeness explicit and unambiguous. Each feature has a passes field that is either true or false. There is no ambiguity. There is no inference required.

Anthropic made a deliberate decision to store this as JSON rather than Markdown. The reason is behavioral: the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files. JSON has a rigid structure that resists casual editing. You want the feature list to be something agents update carefully, not casually rewrite.

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": [
    "Navigate to main interface",
    "Click the 'New Chat' button",
    "Verify a new conversation is created",
    "Check that chat area shows welcome state",
    "Verify conversation appears in sidebar"
  ],
  "passes": false
}

The instruction: it is unacceptable to remove or edit tests, because doing so could lead to missing or buggy functionality. The JSON structure reinforces that instruction architecturally.

Testing: The Failure Mode Nobody Likes to Talk About

Anthropic documented a failure mode that shows up in virtually every serious agentic coding project: agents marking features as complete without properly verifying them end-to-end. An agent would make a code change, run a unit test or a curl command, see a passing result, and mark the feature as done. But the feature would not work when tested through the browser as a user would.

The solution was browser automation via MCP -- at the time, Anthropic used the Puppeteer MCP server (since deprecated in favor of Microsoft's Playwright MCP, now the standard). The tool allowed Claude to navigate the application, click buttons, fill forms, and verify that features worked as a user would experience them. The performance improvement was dramatic. Bugs invisible from the code alone became obvious when the agent could see what a user would see.

One limitation they noted: the agent could not see browser-native alert modals through the MCP bridge, and features relying on these modals tended to be buggier. An honest admission that underlines the general principle: the quality of an agent's work is bounded by the quality of its feedback loops. If your agent cannot observe the consequences of its actions in the domain that matters, it will optimize for proxy metrics that do not correlate with correctness.

The Startup Sequence

Every coding agent session began with a standardized startup: run pwd to confirm the working directory, read the progress file and git log, read the feature list and choose the highest-priority incomplete feature, run init.sh, run an end-to-end test to verify the application works. Only after all of this would the agent begin a new feature. If the startup test revealed breakage, the agent fixed it before touching anything new.

This prevented the compounding problem where an agent starts a new feature on top of a broken foundation, making the underlying issue harder to isolate with every step.

Anthropic's key insight: even Opus 4.5 falls short without scaffolding. The two-agent architecture -- initializer + coding agent -- became the template for long-running agentic work.

Part Four: OpenAI's Codex -- Zero Lines of Manual Code

The Experiment

In late August 2025, OpenAI's Codex team started a git repository with a single constraint: no manually-written code. Every line -- application logic, tests, CI configuration, documentation, observability tooling, internal utilities -- would be written by Codex agents. Humans would steer. Agents would execute.

Five months later, the repository contained on the order of a million lines of code. Roughly 1,500 pull requests had been opened and merged. A small team of three engineers had driven most of it, averaging 3.5 pull requests per engineer per day. As the team grew to seven, per-engineer throughput actually increased. The product had hundreds of internal users, including daily power users.

Ryan Lopopolo, Member of the Technical Staff at OpenAI, published the article describing this experience on February 11, 2026. The central message: the bottleneck was never model capability. It was always environment design.

The Job Became Different

The most important observation is about how engineering work itself changed. When your primary job is no longer writing code, what are you doing?

You are designing environments. You are specifying intent. You are building feedback loops. You are asking, constantly, not "how do I fix this bug?" but "what capability is missing from the environment that is causing this class of bug to appear?"

When something failed, the fix was almost never "try harder." It was almost always "what structural piece of the environment is missing?" This is a profound shift. You stop debugging code. You start debugging the system that produces code.

The Death of the Monolithic AGENTS.md

Early in the project, the team tried a single large instruction file containing everything the agent needed to know. It failed in four specific ways:

  1. Context crowding. A giant instruction file crowds out the task, the code, and the relevant documentation.
  2. Guidance becomes non-guidance. When everything is marked as important, nothing is. The agent pattern-matches locally instead of navigating intentionally.
  3. Instant rot. A monolithic manual becomes a graveyard of stale rules as the codebase evolves.
  4. Impossible to verify. A single blob does not lend itself to coverage checks, freshness tracking, or cross-linking.

The solution was a structured docs/ directory treated as the system of record, with AGENTS.md reduced to roughly 100 lines -- a table of contents, not an encyclopedia. The repository eventually contained 88 AGENTS.md files, one per major subsystem, each pointing to deeper sources of truth. Progressive disclosure: agents started with a small, stable entry point and were taught where to look next.

Making the Application Visible to the Agent

As throughput increased, the bottleneck shifted from code generation to verification. The team was producing code faster than human QA could validate.

The solution was application legibility -- making the system directly observable by agents. They made the application bootable per git worktree, so Codex could launch an isolated instance for each change. They wired the Chrome DevTools Protocol into the agent runtime for DOM snapshots, screenshots, and browser navigation. They built a full local observability stack: logs queryable via LogQL, metrics via PromQL, and distributed tracing -- each agent task running on a fully isolated version of the application with its own observability data, torn down on completion.

Agents could debug production-like issues using the same tools a human engineer would use, rather than inferring behavior from code alone.

Mechanical Architecture Enforcement

Human code review does not scale to agent-driven development. When an agent opens 3.5 PRs per engineer per day, review cannot be the primary quality mechanism.

The solution: encode architectural constraints as mechanical checks. The application was structured around rigid layers -- Types, Config, Repo, Service, Runtime, UI -- with strictly validated dependency directions. Custom linters, themselves written by Codex, enforced these constraints. When a linter caught a violation, the error message included remediation instructions formatted for injection into agent context:

"Error: Service layer cannot import from UI layer. Move this logic to a Provider or restructure the dependency. See docs/ARCHITECTURE.md#layers."

The linters had 100% test coverage. The principle: enforce boundaries while allowing significant freedom within them. Care deeply about dependency directions and data validation at interfaces. Do not dictate how specific features are implemented within those boundaries.

They also encoded "golden principles" -- opinionated, mechanical rules enforced by recurring cleanup background tasks that scanned for deviations, opened targeted refactoring PRs, and auto-merged most of them. Before this approach, the team spent every Friday -- 20% of the week -- cleaning up "AI slop." After: most cleanup PRs could be reviewed in under a minute.

Throughput Changes the Rules

When agent throughput dramatically exceeds human attention capacity, conventional engineering norms become counterproductive. PRs sitting in review block agent work. Test flakes investigated individually consume human attention that should go elsewhere.

OpenAI's team deliberately adopted minimal blocking merge gates. Pull requests were short-lived. Test flakes were addressed with follow-up runs rather than blocking progress. Their philosophy: corrections are cheap and waiting is expensive. The right tradeoff looks irresponsible in a low-throughput environment and obvious in a high-throughput one.

Three engineers. Five months. A million lines of code. 1,500 merged PRs. The bottleneck was never model capability -- it was always environment design.

Part Five: The Wider Ecosystem -- Who Else Got It

Stripe: 1,300 PRs Per Week, Fully Unattended

In February 2026, Stripe revealed its internal "Minions" system: fully unattended coding agents merging 1,000 to 1,300 pull requests per week. Each Minion gets its own isolated VM with no internet access and no production access. During Stripe's "Atlas Fix-It Week," Minions resolved 30% of all bugs autonomously. Growth was 30% week-over-week.

The harness pattern: complete isolation per agent (VMs, not just worktrees), no network access (preventing a class of security and reliability failures), and a workflow that treats agent-generated PRs as first-class outputs requiring the same CI gates as human PRs but without blocking on human review for every merge.

Perplexity Computer: 19 Models, One Orchestration Layer

On February 25, 2026, Perplexity launched Computer -- a multi-model agent orchestration platform that coordinates 19 different AI models. Claude Opus 4.6 serves as the central reasoning engine. Gemini handles deep research. Grok runs lightweight tasks. ChatGPT 5.2 provides long-context recall.

This is the harness thesis made commercial. No single model does everything. The orchestration layer -- which routes tasks to the optimal model, manages sub-agents, handles lifecycle -- is the product. Perplexity's bet is explicitly that value accrues to the orchestration layer, not to any individual model. The model is interchangeable. The harness is the moat.

Google DeepMind: Aletheia

Also in February 2026, Google DeepMind published Aletheia, an autonomous mathematics research agent with a three-part agentic harness: Generator, Verifier, and Reviser. It achieved 95.1% accuracy on IMO-Proof Bench Advanced and produced a research paper with zero human intervention.

The Generator/Verifier/Reviser separation is a textbook harness pattern. The model is the same across all three roles. The role definition, the feedback loop between verification and revision, and the structured handoff protocol -- that is the harness. That is what made it work.

"In harness engineering, the agent is a commodity -- the harness is the differentiator." -- awesome-agent-harness

Part Six: The Design Patterns That Repeat

Across every system -- Princeton, Anthropic, OpenAI, Stripe, Google DeepMind, Perplexity -- several patterns appear repeatedly. They are not coincidences. They are engineering solutions to problems that emerge whenever you deploy agents at scale.

Pattern 1: Progressive Disclosure

Do not give the agent everything upfront. Give it the minimum to orient itself and pointers to find more when needed.

This appears in SWE-agent's capped search (force the agent to refine), in OpenAI's docs/ architecture (a short map pointing to deeper truth), in Anthropic's startup sequence (read the progress file first), and in every harness framework that implements structured context layering.

The cognitive reason: context is finite, and attention is not uniformly distributed. A short, focused entry point that points to richer context elsewhere is more effective than a comprehensive dump that dilutes attention across everything.

The practical reason: a short entry point stays accurate. A monolithic document rots.

Pattern 2: Isolation Per Agent

One agent, one workspace. Git worktrees at minimum. Full VMs at Stripe's scale.

When multiple agents work in parallel on the same codebase, they will conflict without isolation. Even with sequential agents, you want the ability to validate changes in an isolated environment before they affect the main codebase. This is how CI/CD works for human engineers, and it is exactly the right model for agent orchestration.

Pattern 3: Repository as System of Record

Agents are blind to informal knowledge. Anything in a Slack thread, a Google Doc, or someone's head does not exist for the agent. The only reliable source for context is the repository.

This shows up as the feature list in Anthropic's harness, the structured docs/ directory at OpenAI, AGENTS.md files across frameworks, and spec tools in the ecosystem taxonomy. Specifications, requirements, architectural decisions, and constraints must be encoded into machine-readable files before execution begins.

Documentation is no longer just for human readers. It is the mechanism through which human intent becomes legible to agents. Documentation that is ambiguous, stale, or stored outside the repository actively impairs agent performance.

Pattern 4: Mechanical Enforcement Over Human Review

When an agent generates 3.5 PRs per engineer per day, human code review is a bottleneck, not a quality gate. Custom linters, structural tests, and CI pipelines replace much of what review does.

The key design principle: enforce invariants, not implementations. Care deeply about dependency directions, boundary crossing, and data validation at interfaces. Do not care which specific library the agent uses or exactly how a function is decomposed, as long as it satisfies the contract.

A linter that catches an architectural violation and returns remediation instructions in the error message is more effective than a code reviewer catching the same violation three days later.

Pattern 5: Tight Feedback Loops

Every high-performing harness closes the feedback loop as tightly as possible. Syntax errors caught at edit time. Runtime errors surfaced through queryable observability. UI bugs caught through browser automation. Test failures returned with context about what broke and where.

The alternative -- agents writing code that gets tested externally with failure messages feeding back in a later session -- is slower, more expensive in tokens, and more likely to produce cascading failures. Every point where the gap between action and consequence can be reduced is a point where performance improves.

This is the harness version of the classic principle about catching errors early. For agents, it applies with even more force because errors that are not caught immediately accumulate in context and degrade every subsequent step.


Part Seven: The Skill That Transfers

What This Means for Engineers

The harness engineering discipline is systems thinking applied to agent environments. It requires understanding LM cognitive architecture well enough to design environments that work with it rather than against it. State management, feedback loops, error recovery, context optimization -- familiar from distributed systems engineering, applied to a new domain.

The engineers most effective in this paradigm are not the ones with the best prompting skills. They are the ones who understand how the whole system works: how context flows, where it gets corrupted, how feedback loops can be tightened, how state can be preserved across sessions, and how constraints can be enforced without micromanaging behavior.

These are not new skills in the abstract. System design, API design, error handling, testing strategy. What is new is the domain.

The Questions You Should Be Asking

When your agent system underperforms, the harness engineering mindset produces different questions:

Instead of "how do I write a better prompt?" -- "What information does the agent need that it currently cannot access?"

Instead of "why is the model making this mistake?" -- "What feedback loop is missing that would catch this before it propagates?"

Instead of "why isn't the agent doing what I told it to?" -- "What constraint in the environment is preventing correct behavior?"

This shift changes where you invest engineering effort. A better prompt that solves this specific failure is local and temporary. A better tool that prevents a category of failures is general and permanent. The harness is where permanent investment lives.

Building Your Own

You do not need OpenAI's observability stack or Anthropic's two-agent architecture to benefit. The minimal effective harness:

A persistent progress file. Read at session start, written at session end. Prevents "declare victory too early" and ensures continuity across context windows.

A structured task list. Not a vague project description -- a specific, enumerated list of verifiable completion criteria. Each item describes a user-visible behavior that can be tested end-to-end. Mark status only after verification.

Version control as first-class ceremony. Every session ends with a commit. The agent does not consider work done until the code is committed and the progress file is updated.

Browser automation if you are building anything with a UI. Playwright MCP is the current standard -- it gives your agent the ability to navigate, click, fill forms, and take screenshots. The difference between an agent that can only read code and an agent that can use the application is the same as the difference between a developer who can only read code and a developer who can run it.

The Environment Audit

If you have an agent system that underperforms, do an environment audit before reaching for a better model:

  • What information does the agent need that it cannot access?
  • Where does the agent regularly get stuck or make mistakes?
  • What feedback is missing that would catch those mistakes?
  • Where is context getting polluted with irrelevant information?
  • What constraints rely on agent judgment instead of mechanical enforcement?

Each question points to a harness improvement. Missing information becomes a new tool. Missing feedback becomes a new test or linter. Context pollution becomes a new management strategy. Unenforced constraints become mechanical checks.

This is the virtuous cycle: every failure is a signal about what the environment needs. Every improvement reduces that failure across all future sessions.

Instead of "how do I write a better prompt?" ask: "What feedback loop is missing that would catch this before it propagates?"

Part Eight: The Commoditization of Everything Below the Harness

There is an uncomfortable implication that deserves to be stated plainly.

In December 2025, Princeton researcher Sayash Kapoor -- one of the original SWE-bench authors -- published a CORE-Bench update. Claude Opus 4.5 scored 42% on the scientific reproducibility benchmark when run through a generic scaffold. The same model, same weights, same everything -- run through Claude Code's harness -- scored 78%. A 36-point swing. From the same model. By March 2026, on SWE-bench Pro, three different agent systems running identical Opus 4.5 scored between 49.8% and 51.8% -- the spread coming entirely from how each system managed context, tools, and feedback loops.

If the execution layer is a commodity, then the long-term competitive moat is not in the model. It is in the harness.

Organizations that invest in harness engineering -- the scaffolding, feedback loops, observability, spec tooling, and orchestration that allows agents to do reliable work at scale -- will have a durable advantage over those focused primarily on which model to use or how to prompt it.

OpenAI's Codex team built the equivalent of a custom development platform for their specific codebase. Anthropic built a harness architecture that enables months of incremental progress on complex applications. The SWE-agent team built an interface that produced dramatically better results from the same model. Stripe built isolated VM environments that merge over a thousand PRs weekly. Perplexity built an orchestration layer across 19 models. None of these advantages came from the model. They all came from the environment.

There is a pattern in how transformative technologies get misunderstood. The thing that captures public attention -- the raw capability, the benchmark score -- is rarely what determines who wins. The infrastructure layer is where durable value gets created and captured.

The web was transformative not because HTML existed but because search engines and browsers made the web navigable. Mobile was transformative not because smartphones existed but because app stores and developer tools made it possible to build on them at scale. In both cases, the platform layer that organized the underlying capability was where the durable value lived.

AI agents are following the same pattern. The capability exists. The question is who builds the environments that make it reliable, controllable, and continuously improvable.

The model is what thinks. The harness is what it thinks about. Getting that distinction right is the entire game.

Build accordingly.


Sources: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., NeurIPS 2024); Effective harnesses for long-running agents (Anthropic Engineering, November 2025); Harness engineering: leveraging Codex in an agent-first world (Lopopolo, OpenAI, February 2026); awesome-agent-harness (GitHub); Stripe Engineering: Minions Part 1 & 2 (February 2026); Google DeepMind: Aletheia (February 2026); Perplexity Computer launch (February 25, 2026); CORE-Bench update (Kapoor, Princeton, December 2025); SWE-bench Pro / SEAL Leaderboard (Scale AI, March 2026); Why SWE-bench Verified no longer measures frontier coding capabilities (OpenAI, February 23, 2026); Auggie tops SWE-Bench Pro (Augment Code, February 2026).

harness engineeringAI agent harnessSWE-agentClaude Code

Ready to Transform Your Development?

Let's discuss how AI-first development can accelerate your next project.

Book a Consultation

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.