Skip to main content
🌐

This article is also available in Chinese閱讀中文版 →

AI in Practice · · 9 min read

Agent Harness: What Actually Determines Whether AI Delivers or Disappoints

Agent Harness: What Actually Determines Whether AI Delivers or Disappoints

Same AI. Same intelligence. But wrap a different system around it, and the output is unrecognizable.

Most people have experienced this firsthand — ask ChatGPT a question, and sometimes it’s brilliant, sometimes it’s terrible. Same model. So what’s actually different?

In 2026, a consensus is forming across the AI industry: what determines whether AI can reliably deliver results isn’t the model itself — it’s the system wrapped around the model.

They call that system the agent harness.


Have We Been Comparing the Wrong Thing?

If you’ve been following AI over the past year, you’ve probably had this conversation:

“Is GPT or Claude better?” “Has Gemini caught up?” “Is o3’s reasoning better than Sonnet?”

Checking leaderboards, comparing benchmark scores, switching between tools — we’ve all done it.

But in 2026, something shifted.

Sean Goedecke, a GitHub engineer who worked directly on GitHub Copilot, put it this way:

“A lot of recent improvements aren’t just the models getting better. It’s the system around them that improved.”

Martin Fowler — one of the most respected voices in software engineering — wrote an article in February 2026 coining the term Harness Engineering.

Anthropic (the company behind Claude) published Effective Harnesses for Long-Running Agents.

OpenAI’s Codex team used this approach to generate over 1 million lines of production code — zero manually typed.

They’re all saying the same thing: what determines whether AI can reliably complete work isn’t the model itself — it’s what’s wrapped around it.

They call it the agent harness.


Where Does the Word Come From?

A harness is equipment placed on a horse to direct its raw power toward useful work. Reins, saddle, breastplate, blinders. A horse is incredibly powerful, but without a harness, it might run in any direction, get spooked by a noise, or stop to eat grass halfway through the job. The harness doesn’t make the horse stronger — it makes the horse’s strength reliably convert into useful work.

AI models work the same way. Claude Sonnet 4, GPT-4.1, Gemini 2.5 Pro — these models have extraordinary cognitive ability. But opening a ChatGPT chat window and typing a question is like riding a horse bareback — sometimes it goes well, sometimes it veers off, sometimes it runs in circles. Most people have experienced:

  • Asking the same question twice and getting completely different answers
  • AI confidently providing wrong information
  • Losing context halfway through a task
  • Code that runs but has bugs the AI doesn’t notice

These problems aren’t because the model isn’t smart enough. It’s because the model is running bare — no harness.

The model is the engine. The harness is the entire car.


What Is an Agent Harness?

In one sentence: the complete system wrapped around an AI model that turns raw cognitive ability into reliable output.

Note the words “complete system.” An agent harness isn’t a prompt, an instruction, or a setting. It’s an entire technology stack.

A question that comes up a lot: “Why does Claude Code feel so much smarter than Claude on the web? Is it using a different model?”

The answer: same model. Claude Chat and Claude Code can run the exact same Sonnet or Opus. The difference is entirely in the harness.

Web-based Claude receives text, returns text, done. But Claude Code does an entire additional stack of work behind the scenes:

  1. Reads the entire project structure to understand context
  2. Decides what to do based on the instruction
  3. Uses tools (read files, write files, run commands) to take real action
  4. Checks whether the result is correct
  5. If not, self-corrects and tries again
  6. Follows a set of rules throughout (don’t delete critical files, don’t leak secrets…)

That entire workflow is the harness. Same brain — one taking an exam with no notes, the other with notebooks, a calculator, scratch paper, and a review checklist. The output quality is naturally different.


What Are the Six Layers of an Agent Harness?

The concept sounds abstract, but it breaks down into six concrete layers:

LayerPlain EnglishWhat It Does
1. LoopKeeps cyclingObserve → decide → act → verify → update, repeating until done
2. ToolsCan take actionLets AI do more than talk — read files, run code, call APIs, search the web
3. ContextWhat it seesControls what information AI receives and what it doesn’t, preventing overload
4. PersistenceRemembersMaintains state across conversations and executions
5. VerificationSelf-checksRuns tests, checks syntax, self-reviews after completing work
6. ConstraintsHas boundariesWhat it can’t do, which files it can’t touch, spending limits

All six layers are systems engineering improvements, not model improvements. No need to wait for the next generation of models — just build a better harness. I’ve written about how your tools set your ceiling — the harness framework explains exactly why.

OpenClaw is a great example. OpenClaw isn’t a model — it runs Claude, GPT, or other LLMs under the hood. What OpenClaw actually sells is a complete harness: SOUL.md defines the agent’s role and constraints (layer six), a Memory system maintains context across conversations (layer four), an Agent Loop keeps it running continuously (layer one), and shell/API tools let it take real action (layer two). Harnesses can also stack. OpenClaw provides the base harness, and users can add their own layer on top via SKILL.md — encoding workflows as SOPs so the AI executes established patterns instead of interpreting ambiguous instructions from scratch every time. The more precise the SKILL.md, the more stable the AI’s output — that’s harness design in action.

The best proof comes from an experiment LangChain ran in February 2026. They tested their coding agent on an industry benchmark and scored 52.8%, ranking outside the Top 30. Then they changed nothing about the model — same model, same API — and only modified the harness. Result: 66.5%, ranking jumped to Top 5. What specifically did they change?

  1. Added a self-verification loop — before the AI says “I’m done,” it’s forced to run through a checklist
  2. Injected environmental context — scanned the directory structure and fed it to the AI before starting
  3. Anti-drift detection — detected when the AI was repeatedly editing the same file (stuck in a loop)
  4. Tuned reasoning budget — more thinking time for planning and verification, less for implementation
  5. Failure analysis — automatically analyzed failure patterns across runs

Not a single change was “use a stronger model.” All harness-level improvements.

Vercel’s case is even more direct. They cut the number of tools available to their AI agent from 15 down to just 2. Result?

Accuracy went from 80% to 100%. Token usage dropped 37%. Speed increased 3.5x.

Less is more. Constraints are part of the harness.


Why Is Everyone Suddenly Talking About Agent Harnesses in 2026?

The agent harness isn’t a new concept, but why is everyone suddenly talking about it? Because the models are good enough now. Before 2024, model capability was the bottleneck — no harness could save a model that couldn’t handle basic logic. In 2025, GPT-4.1, Claude Sonnet 4, and Gemini 2.5 Pro raised baseline capability to a usable level.

Then everyone noticed: the gap between models keeps shrinking, but the same model under different harnesses performs like entirely different species. Two products using the same model feel completely different — different rhythm, different stability, different quality. Same brain, different harness.

That’s why Aakash Gupta wrote a piece titled: “2025 Was Agents. 2026 Is Agent Harnesses.”

The frontier is shifting from “prompt engineering” to “harness engineering” — the systems engineering of designing loops, tool chains, verification workflows, and context strategies.


What Does the Agent Harness Have to Do With Your Life?

This isn’t just a technical observation. Even without writing code or using AI tools, the harness concept offers a powerful lens. Consider this analogy:

  • Your brain = the model (raw cognitive ability)
  • Your environment, habits, tools, feedback loops = the harness

When most people want to be more productive, what do they do? Read more, learn more skills, train willpower — they’re trying to upgrade the model.

But the harness perspective tells us: the fastest change isn’t upgrading the brain — it’s upgrading the environment.

Put the phone in another room and focus improves immediately. Not because willpower got stronger, but because the environment’s constraints (layer six of the harness) changed. Design a daily writing routine at a fixed time and output stabilizes. Not because of sudden talent, but because a loop (layer one of the harness) was established.

Without designing our own harness, we live inside systems designed by others. Social media algorithms, corporate KPI structures, phone notification systems — all harnesses designed by someone else, and not for our goals. I wrote about this idea in building an AI second brain — the core concept is the same: actively designing our own information environment rather than passively accepting what’s fed to us.

Once One Agent’s Harness Is Done, What’s Next?

One agent’s harness is dialed in, and the next question arrives — what if there isn’t just one AI agent, but ten? Who does what? Who reports to whom? How is the budget allocated? Who ensures quality? What happens when agents duplicate work? What happens when one gets stuck?

These questions sound familiar because they’re exactly what every company, every team leader, every PM deals with daily. Except now, the employees are AI.

In the next article, I’ll look at this from a PM’s perspective — what happens when AI agents need “team management,” and how an open-source project called Paperclip is bringing the logic of managing people to managing AI agents.


Enjoyed this? Connect with me on LinkedIn — I’m open to collaboration, consulting, and new opportunities.

#AI #agent harness #harness engineering #agentic harness #agentic coding #Claude Code #AI Agent

FAQ

What is an agent harness?
An agent harness is the complete system wrapped around an AI model that turns raw cognitive ability into reliable output. The word comes from horse harnesses — equipment that directs a horse's raw power toward useful work. In AI, the model is the engine; the harness is the entire car. It consists of six layers: loop, tools, context management, persistence, verification, and constraints.
Why does the agent harness matter more than the model?
LangChain used the same model (gpt-5.2-codex), changed only the harness, and jumped from 52.8% to 66.5% on Terminal Bench 2.0 — going from outside Top 30 to Top 5. Vercel cut their agent's tools from 15 to just 2, and accuracy went from 80% to 100%. The real differentiator isn't the model — it's the system design.
What is harness engineering?
Harness engineering is a term coined by Martin Fowler in February 2026 — it's not about writing prompts, but designing the complete system: execution loops, tool chains, verification workflows, and context strategies. OpenAI's Codex team used this approach to generate over 1 million lines of code with zero manual typing.

Related Posts