The Agentic Harness: Why the Orchestration Layer Is the Product
Models are commodities. The harness — the control plane that governs what an LLM sees, calls, and outputs — is what separates demos from production AI systems. A technical breakdown of the architecture pattern defining 2026.
The Agentic Harness: Why the Orchestration Layer Is the Product
The term “agent harness” has become the defining concept of 2026. Anthropic’s engineering team published the definitive framing: the harness is the control plane that wraps around an LLM agent, managing what information the model sees, what tools it can call, how state persists, and when humans intervene.
This isn’t theoretical. At Veso AI, we’ve been building domain-constrained harnesses across legal, travel, document intelligence and government — arriving at the same architectural patterns independently before the industry had the vocabulary.
This post is a technical deep-dive into how harnesses work, why they matter, and what the production evidence tells us.
The Architecture
A harness sits between the LLM and the real world. It governs the entire interaction surface:
┌──────────────────────────────────────────────┐
│ AGENTIC HARNESS │
│ │
│ Context Engineering │
│ ├── Domain primer (startup-built) │
│ ├── Turn budgeting (max N iterations) │
│ └── Chunk deduplication │
│ │
│ Tool Layer │
│ ├── Registered domain-specific functions │
│ ├── Permission gates per tool │
│ └── Call limits + timeout controls │
│ │
│ Data Layer │
│ ├── Curated domain universe │
│ ├── Write protection (keyword blocklist) │
│ └── Data-layer constraints (not prompts) │
│ │
│ Output Layer │
│ ├── Structured output separation │
│ ├── Fact passthrough (never LLM-rewritten) │
│ └── Citation mapping │
│ │
│ ┌──────────┐ ┌────────────────────┐ │
│ │ LLM │────▶│ Tool Execution │ │
│ │ (any) │◀────│ (deterministic) │ │
│ └──────────┘ └────────────────────┘ │
│ │
└──────────────────────────────────────────────┘
The model is one component. The harness is the system.
Core Principles
1. Deterministic First, LLM Second
Use code for what can be computed. Use the LLM only for what requires judgment — extraction, ranking, composition, natural language understanding.
In practice:
- Deterministic: Vector search, Cypher graph queries, schema validation, access control, supplier constraints, data pipeline operations
- LLM: Query decomposition, result reranking, narrative synthesis, intent classification
This isn’t a preference — it’s unit economics. Deterministic layers are cheap to run, predictable to maintain, and trivial to audit. They reduce per-query LLM token usage by 60-80% compared to naive “send everything to the model” approaches.
The compounding effect: fewer tokens per query means lower latency, lower cost, and a smaller hallucination surface area. Every deterministic stage you add removes an opportunity for the model to introduce error.
2. Domain Constraint at the Data Layer
Prompt-based guardrails fail under adversarial or edge-case inputs. Data-layer constraints cannot fail.
# Prompt-based (fragile)
"Only recommend products from our approved catalogue."
→ Model violates this on edge cases. Guaranteed.
# Data-layer (impossible to violate)
Search index contains only approved catalogue items.
→ Non-catalogue items don't exist to the system.
If it’s not in the index, it doesn’t exist. A search scoped to a curated universe physically cannot hallucinate outside it. A query system bound to one knowledge graph cannot leak data from another.
This is the answer to every enterprise CISO’s first question: “How do you prevent hallucination and data leakage?” The answer is architectural, not prompt-based.
3. Context Engineering via Domain Primer
At startup (or on data change), the harness builds a compressed, structured summary of the domain — entities, relationships, schema, key constraints — and injects it into every LLM prompt.
Anthropic now calls this context engineering: finding the smallest possible set of high-signal tokens that maximise the likelihood of desired outcomes. The primer is a one-time build cost that amortises across every query.
Impact: Reduces retrieval failures, improves first-turn accuracy, and eliminates 30-50% of the tool calls an agent would otherwise need to orient itself.
<!-- Example: Case Primer (legal domain) -->
<case_primer>
<matter id="DPP-2024-0847" name="R v Thompson">
<entities count="142">
<persons>Sarah Chen, Mark Thompson, Det. Rodriguez...</persons>
<exhibits>Exhibit A (CCTV footage), Exhibit B (phone records)...</exhibits>
<locations>34 King St Newtown, Central Station...</locations>
</entities>
<relationships>
<link from="Mark Thompson" to="Sarah Chen" type="KNOWN_TO" />
<link from="Exhibit A" to="34 King St" type="CAPTURED_AT" />
</relationships>
<schema>
<node_labels>Person, Object_Exhibit, Location, Event, Document, Charge</node_labels>
<chunk_count>2,847</chunk_count>
<vector_index>chunk_embedding_idx (1536-dim, ada-002)</vector_index>
</schema>
</case_primer>
</case_primer>
The model reads this once per session and understands the domain structure before making any tool calls.
4. Structured Output Separation
The LLM writes narrative. Structured data passes through from the data layer exactly as returned, never rewritten by the model.
In implementation, this uses a terminal tool pattern — compose_response — that forces the LLM to declare which parts of its answer are:
- Narrative: Written by the model (analysis, explanation, synthesis)
- Data: Passed through from the database (tables, counts, entity lists)
{
"narrative": "The CCTV footage from Exhibit A places Thompson at 34 King St at 21:47 on March 3rd, corroborated by phone records showing...",
"graph_tables": [
{
"title": "Timeline of Events",
"columns": ["Time", "Location", "Source"],
"rows": [
["21:47", "34 King St Newtown", "Exhibit A (CCTV)"],
["21:52", "Central Station", "Exhibit B (Phone records)"]
]
}
],
"citations": ["chunk:1847", "chunk:2103"]
}
The frontend renders these differently. Users can trust the table because it came directly from the database. In regulated industries — legal, financial, medical, government — this is the difference between an interesting demo and a deployable system.
5. Turn Budgeting and Loop Control
Agentic loops need boundaries. Without them, the model can enter infinite tool-calling cycles, burning tokens and producing diminishing returns.
Turn 1: LLM → vector_search("Thompson alibi")
Turn 2: LLM → graph_query("MATCH (p:Person {name: 'Thompson'})...")
Turn 3: LLM → rerank(results)
Turn 4: LLM → [BUDGET WARNING: 1 turn remaining]
Turn 5: LLM → compose_response(forced)
Key mechanisms:
- Max turn count (typically 5-10) with forced composition on the penultimate turn
- Chunk deduplication (
_seen_chunk_ids) preventing re-retrieval of already-seen evidence - Token budget tracking across the entire session
- Graceful degradation: when the budget is exhausted, fall back to deterministic results rather than silence
Production Validation: Who Else Is Building Harnesses
Claude Code
Anthropic’s coding agent is the same Claude model wrapped in a harness. 19 permission-gated tools. Context management with compaction and memory consolidation. Subagent spawning for parallel work.
A background daemon (autoDream) wakes up after 24 hours of inactivity, reads the project’s memory directory, consolidates learnings, deletes contradictions, and rewrites the memory index. The model reasons. The harness acts. The harness decides whether a file read is allowed, what happens to the result, how much context fits in the next prompt.
Key pattern: Permission-gated tool access. The model proposes actions. The harness governs which are allowed.
Cursor
Cursor trained model-specific harnesses for every frontier model they support. Each model gets tailored instructions and tool definitions because different models respond differently to the same prompts — one prefers grep over semantic search, another needs explicit linter instructions after edits.
Critical finding: dropping reasoning traces from the tool-calling loop caused a 30% performance collapse. The harness must preserve reasoning continuity across turns. The model is generic. The harness is specific.
Key pattern: Model-agnostic architecture with model-specific harness configuration. The harness adapts to the model, not the other way around.
Devin
Devin spins up isolated cloud environments — terminal, editor, browser — then runs an agent loop inside that controlled sandbox. Fork and rollback. Machine snapshots. Async session handoffs.
They invented their own compute unit (ACU) to measure harness cost per task, acknowledging that the orchestration layer — not the model inference — is the primary cost driver.
Key pattern: Environment isolation. The harness manages the entire execution context, not just the prompt.
Anthropic’s Long-Running Agent Research
Anthropic’s published research on effective harnesses for long-running agents describes a two-agent architecture:
- Initialiser agent: Sets up state artifacts — feature lists, progress logs, git repos
- Coding agent: Works incrementally, one feature at a time, leaving clean handoff artifacts
The key finding: agents need to quickly understand the state of work when starting with a fresh context window. This is solved by the harness maintaining structured state artifacts — not by giving the model more context.
Key pattern: State management across context windows. The harness is responsible for continuity, not the model.
Why 80% of Enterprise AI Fails
The statistics are clear:
- 72-80% of enterprise RAG implementations fail within their first year (Blits.ai)
- 90% of agentic RAG projects failed in production in 2024 (Composio)
- The compounding failure rate: 95% accuracy × 4 layers = 81% system reliability — wrong one in five queries
These aren’t model failures. They’re harness failures. “Dumb RAG” — dump documents into a vector store, point the model at it — fails because there’s no governance layer. No domain constraints. No structured output. No deterministic fallback.
The fix is architectural:
- Deterministic stages handle what code can handle
- Data-layer constraints prevent hallucination structurally
- Structured output separates facts from narrative
- Fallback behaviour degrades gracefully instead of failing silently
The Market Shift
The progression:
| Year | Paradigm | Focus |
|---|---|---|
| 2023 | RAG | ”Give the LLM your documents” |
| 2024 | Agents | ”Let the LLM use tools” |
| 2025 | Agent Swarms | ”Let agents coordinate” |
| 2026 | Harnesses | ”Control what agents can do, see, and output” |
Models absorbed ~80% of what multi-agent frameworks used to provide. The remaining 20% — persistence, deterministic replay, cost control, observability, error recovery — is what the harness delivers.
The competitive moat is no longer the API key. It’s the orchestration layer.
What This Means for Enterprise
The organisations getting value from AI right now aren’t the ones with the best models. They’re the ones that treated AI as a product problem, not a science experiment.
They built harnesses. They constrained the domain at the data layer. They separated facts from narrative. They gave someone the authority to ship.
At Veso AI, every engagement deploys the same architectural pattern. The harness is reusable infrastructure. The domain data, tools, and constraints are what change per client. Legal. Travel. Government. Document intelligence. Same pattern. Different universe.
The model is a replaceable component. The harness is the product.
For more on how Veso AI builds domain-constrained agentic harnesses for enterprise, get in touch.