AI Engineering

Harness Engineering: Why the Scaffolding Around the Model Is the Real Source of Agent Power

Frontier models are converging on the same capabilities. What now separates a demo from a system you can trust is the harness — the engineered scaffolding of context, tools, loops, verification, and memory wrapped around the model. This is the discipline of building it, and the kind of agent it produces.

Jithin Kumar PalepuMay 31, 202619 min read

Here is the uncomfortable truth most agent demos hide: the model is no longer the hard part. Swap GPT for Claude for Gemini in a working agent and the behaviour barely changes. Swap the harness — the code that decides what the model sees, what it can do, how its work is checked, and what it remembers — and the agent either ships or collapses.

For three years, the industry optimised the wrong variable. We chased benchmark points on the model and treated everything around it as glue code. But as flagship models converge — comparable reasoning, comparable tool use, comparable context windows — the marginal value of a slightly better model has shrunk, and the marginal value of a better harness has exploded. The agents that feel magical in 2026 are not running on secret models. They are running on extraordinary harnesses.

This piece defines harness engineering as a discipline, dissects the anatomy of a real harness component by component, looks at how Claude Code, Cursor, and Devin actually build theirs, and introduces a term we will use going forward at AgentFlow: hyper agents — the class of agent that emerges when the harness, not the model, becomes the source of capability.

What is an agent harness?

An agent harness is everything around the model call. The model itself is a stateless function: text in, text out. It has no memory between calls, no ability to act in the world, no way to know whether its last answer was correct, and no sense of a goal that spans more than the tokens currently in front of it. Every one of those gaps is filled by code you write. That code is the harness.

Concretely, a harness is responsible for:

Context assembly — deciding which tokens enter the window on every turn: instructions, relevant files, tool results, prior steps, retrieved knowledge.
The action space — the set of tools the model can call, how they are described, and how their results are returned.
The control loop — the cycle that runs the model, executes its chosen action, feeds back the observation, and decides whether to continue or stop.
Verification — checking output against reality (tests, types, schemas, the environment) and feeding failures back in.
Memory — persisting state across turns and sessions so the agent accumulates rather than forgets.
Orchestration — splitting work across sub-agents, running them in parallel, and merging their results.
Guardrails — permissions, sandboxing, and human-in-the-loop checkpoints that bound what can go wrong.

Why did the harness become the moat?

Three shifts moved the centre of gravity from weights to scaffolding.

Models commoditised at the top. The gap between the best frontier model and the second-best — and increasingly the best open model — has narrowed to the point where most production tasks are not model-limited. If your agent fails, it is rarely because the model could not reason. It is because it was handed the wrong context, given a clumsy tool, or never told its answer was wrong.

The expensive failures live outside the model. An agent that hallucinates a file path, edits the wrong function, loops forever, or silently corrupts state is failing at the harness layer. These are engineering failures — bad context, missing verification, no guardrails — not intelligence failures.

The harness is where proprietary value accrues. You cannot out-train Anthropic or Google. But you can build a harness that encodes your domain, your tools, your verification, and your data — and that is defensible. It is the same realisation we explored in context engineering and in the case for compound AI systems: the system around the model is the product.

The model supplies intelligence. The harness supplies competence. In 2026, almost everyone has access to the intelligence. Almost no one has built the competence.

The anatomy of a harness

Let us take the harness apart. Each component below is a self-contained problem with its own failure modes and its own body of technique. Get all of them right and the whole becomes far more than the sum of its parts.

1. Context engineering: the window is the scarce resource

The context window is the agent's entire working memory, and it is finite. Every token spent on irrelevant boilerplate is a token not spent on the file that actually matters. Worse, models degrade as context grows — relevant facts buried in a long window get missed, a phenomenon often called “context rot.” The harness's first job is to put the right tokens in front of the model at the right time, and nothing else.

Good context engineering is a set of deliberate moves:

Retrieve, don't dump. Pull in the specific files, rows, or documents relevant to the current step rather than the whole repository or knowledge base.
Compact aggressively. Summarise old turns once they are no longer load-bearing, keeping a running digest instead of the full transcript.
Structure the window. Put stable instructions where they are cached, volatile state where the model expects it, and the current task last, where attention is strongest.
Offload to the environment. Let the file system, the database, or a scratchpad hold state instead of carrying everything in tokens.

This is its own deep topic — we covered the patterns in detail in our guide to context engineering. The headline for harness builders: treat the window as a budget you actively manage on every turn, not a bucket you fill.

2. The action space: tools are an interface, not a feature

Tools are how the model acts — running code, querying a database, reading a file, calling an API, controlling a browser. The research lineage runs from Toolformer, which showed models could learn to call APIs, to today's standard where tool use is native. In late 2024 Anthropic introduced the Model Context Protocol (MCP), an open standard for connecting models to tools and data sources, and it has since become a common substrate across the ecosystem.

But the hard part of tools is not wiring them up — it is designing them. A tool is an interface the model has to use correctly from a one-line description, with no documentation and no chance to ask questions. The principles that make a good API for humans make a good tool for models, only more so:

Few, deep tools beat many shallow ones. A handful of powerful, well-named tools is easier for the model to wield than forty overlapping ones.
Make the description the contract. The model only knows what the schema and description say. Ambiguity there becomes misuse at runtime.
Return useful errors. A tool result that says exactly what went wrong lets the model self-correct on the next turn; a stack trace or a silent failure does not.
Decide between tool calls and code. Sometimes the right action space is “write and run code” rather than a fixed menu — a trade-off we unpack in tool calling vs. code generation.

3. The control loop: where an agent becomes an agent

A single model call answers a question. A loop around that call is what turns it into an agent. The canonical pattern was named by ReAct (Yao et al., 2022): interleave reasoning and acting so the model thinks, takes an action, observes the result, and thinks again.

while not done:
    thought = model(context)            # reason about the goal
    action  = thought.tool_call         # choose a tool
    result  = execute(action)           # act in the environment
    context = update(context, result)   # observe + compact
    done    = is_goal_satisfied(context)

Every interesting design decision lives inside this loop: how context is rebuilt each turn (context engineering), what execute is allowed to do (the action space and guardrails), and — most neglected of all — how is_goal_satisfied is actually decided. That last one is verification, and it is what separates a toy from a tool.

The loop also needs failure handling the model cannot provide itself: step budgets so it cannot run forever, detection of repeated identical actions so it does not get stuck, and timeouts on tools that hang. This is closer to distributed-systems engineering than to prompting — and it is exactly the kind of reliability work that distinguishes a real harness. Designing this loop deliberately — rather than operating the agent by hand — is its own emerging discipline; see loop engineering for the primitives and the failure modes.

4. Verification and self-correction: closing the loop on reality

A model is a confident guesser. Left unchecked, it will report success whether or not it succeeded. The single highest-leverage thing a harness can do is give the agent an external source of truth and force it to reckon with that truth before declaring victory.

For coding agents this is unusually clean: the compiler, the type checker, the linter, and the test suite are objective oracles. A well-built harness runs them automatically and feeds failures straight back into the loop. The agent does not get to say “done” — the tests do. Reflexion (Shinn et al., 2023) formalised the payoff: when an agent verbally reflects on a failure and carries that reflection into the next attempt, its success rate climbs sharply across retries.

Where there is no natural oracle — open-ended writing, research, judgement calls — the harness has to manufacture one: a separate critic model, a rubric, a schema validator, or a human checkpoint. The discipline is the same. Never let the actor be the sole judge of its own work.

5. Memory: turning sessions into experience

The model forgets everything between calls. The harness decides what survives. Short-term memory is the management of the current context window; long-term memory is what persists across sessions — facts, preferences, prior solutions, learned procedures.

The most striking demonstration remains Voyager (Wang et al., 2023), an agent that played Minecraft by writing executable skills and saving the working ones to a growing skill library. Each new task could build on previously mastered skills, so the agent compounded its abilities over time — lifelong learning implemented entirely in the harness, with a frozen model. We go deeper on the architectures in memory systems for AI agents.

6. Sub-agents, orchestration, and parallelism

A single context window is a bottleneck. Complex work means more context than one window can hold and more steps than one loop can keep coherent. The answer is to decompose: spawn sub-agents, each with a fresh window scoped to one piece of the problem, run them in parallel, and merge their results.

This is powerful and treacherous in equal measure. More agents means more coordination, more places to fail, and more cost — and a single-agent design is often the right call. We walked through the real trade-offs in the architecture wars: multi-agent vs. single-agent, and the related question of when to reach for an autonomous agent at all versus a fixed pipeline in agents vs. workflows. The harness lesson: orchestration is a feature you add when the work genuinely exceeds one window — not a default.

7. Guardrails: bounding the blast radius

An agent that can act can act wrongly. The harness is where you make mistakes survivable: scoped permissions so the agent can only touch what it should, sandboxing so destructive actions are contained, and human-in-the-loop checkpoints before anything irreversible. Guardrails are not friction bolted on at the end — they are what makes it responsible to give an agent real capability in the first place.

Harness engineering as a discipline

Put the components together and a craft emerges with its own sense of taste. A few principles separate harnesses that scale from ones that merely demo:

Design the context, not just the prompt. The prompt is one slice of the window. The whole window is the unit of design, and it is rebuilt every turn.
Give the agent ground truth. Wherever an objective check exists — tests, types, schemas, the environment — wire it into the loop and let it gate “done.”
Make tools deep and few. A small set of powerful, well-described tools beats a sprawling menu the model misuses.
Engineer for failure. Budgets, retries, timeouts, loop detection, and graceful degradation are the difference between a demo and a system.
Keep state outside the model. Files, databases, and memory stores are more reliable than tokens — and free up the window for thinking.
Add orchestration last. Reach for sub-agents only when the work truly outgrows a single window.

Real harnesses, dissected

The clearest place to see harness engineering at work is in coding agents, where the same frontier models produce wildly different results depending on the scaffolding.

Claude Code & the Claude Agent SDK

Anthropic's coding agent and Agent SDK are, in effect, a harness in a box: a control loop, a curated set of tools (read, edit, run shell, search), permission prompts before risky actions, automatic test execution, sub-agents for parallel work, and context compaction baked in. Anthropic's own field guide, Building Effective Agents, reads as a harness-design document: start simple, add complexity only when it earns its keep.

Why it matters — A productised harness you build on, rather than reinventing the loop, tools, and guardrails yourself.

Cursor

Cursor wraps a model in an IDE-shaped harness: it retrieves the right files into context automatically, applies edits with a specialised apply model, and runs an agent loop that can execute and observe.

Why it matters — Capability that comes overwhelmingly from the scaffolding, not a bespoke model.

Devin

Cognition's Devin gives its agent a full sandboxed workstation — shell, editor, and browser — plus a long-horizon planning loop, so it can work on a task over an extended run rather than a single exchange.

Why it matters — When the harness is the product, the model is just one component inside it.

ECC — the operator layer above the harness

Some harness engineering is now packaged and reusable. ECC is an open-source operator layer — skills, agents, hooks, memory persistence, and security scanning — that installs into Claude Code, Codex, Cursor, and other harnesses to make them follow your patterns. It's the clearest sign yet that “engineer your harness” has become a product category of its own.

Why it matters — If the harness is the car, this is the driver who already knows the route.

Same models. Radically different agents. The variable that changed was the harness.

From harness to hyper agents

Which brings us to the term. As harnesses mature, a qualitatively different kind of agent becomes possible — one whose capability comes mostly from the scaffolding rather than the raw model. We call it a hyper agent.

Four properties distinguish a hyper agent from an ordinary one:

Long-horizon. It works across hours or days and many sub-tasks, not a single turn — sustained by memory and compaction rather than a giant context window.
Self-orchestrating. It decomposes its own work, spawns sub-agents, and runs them in parallel without a human routing every step.
Rich substrate. It sits on a deep set of tools and an actively engineered context pipeline, so it can act broadly and see precisely.
Self-verifying. Verify-and-retry is built into its loop — it checks its work against reality and corrects before it ever reports done.

None of these properties is a model capability. Every one is a harness capability. That is the whole point: hyper agents are not waiting on the next model release. They are waiting on better engineering — which means they are buildable today, by teams willing to treat the harness as the product. It is the natural endpoint of the trajectory we sketched in the agency spectrum and the future of agentic systems.

How to start engineering your harness

If you are building agents, the shift is concrete. Stop asking “which model?” first and start asking these:

What context does the model need at each step, and how will I get exactly that — no more — into the window?
What is the smallest set of deep tools that covers the action space, and is each one described well enough to use blind?
What is my ground truth, and how does it gate “done” inside the loop?
How does the agent fail safely — budgets, retries, timeouts, permissions, sandboxing?
What must persist across turns and sessions, and where does it live outside the model?
Only then: does this work genuinely need orchestration, or will one well-built loop do?

Answer those well and the model almost stops mattering. That is the paradox at the heart of harness engineering: the better your scaffolding, the less the intelligence inside it has to carry — and the more reliably the whole system performs.

The bottom line

The model is the engine, but the harness is the car. For years we have been shopping for engines and ignoring that almost no one had built a car worth putting one in. As models converge, harness engineering becomes the discipline that decides who ships agents people trust — and hyper agents are what that discipline produces when it is taken seriously. The intelligence is commodity. The competence is yours to build.

The intelligence is commodity. The competence is yours to build.

References

Keep Reading