Interface

Programmatic Tool Calling

Code Mode — the agent writes a script that calls tools over RPC, and only the script's output returns to context.

The default agentic loop pays a tax that compounds with every tool: each call round-trips through the context window. The model emits a tool call, the harness runs it, the entire result — a 40KB file, a 200-row query, a page of search hits — is appended to the conversation, and the model reads all of it on the next turn just to decide what to do next. A five-step pipeline pays for every intermediate result five times over: once to receive it, again on every subsequent turn it stays resident.

Programmatic Tool Calling inverts this. Instead of the model being the control loop, the model writes the control loop — a short script that calls tools as ordinary functions — and the harness runs the whole script in one turn. Intermediate results live in the script’s local variables, never in the context window. Only what the script chooses to print comes back. The industry name that stuck is Code Mode.

The Shape

The script calls web_search, gets thirty hits, filters to five URLs, calls web_extract, parses the articles, and prints a synthesis. The model never sees the thirty hits or the five raw articles — it sees the synthesis. The same work in a classic loop would deposit all of it into context across five turns.

Why this is a context pattern, not just an ergonomics one: the savings are in tokens that never enter the window, not in fewer characters typed. It composes directly with Memory — Code Mode is the cleanest way to keep large, transient tool output off the L1 working set entirely — and with Cost, where it is one of the few levers that reduces input tokens rather than trading model quality for price.

Why It Converged

This is not one vendor’s trick. Three independent lines arrived at the same design within months of each other, which is the signature of a physics-driven pattern rather than a fashion.

Anthropic documented “code execution with MCP” (late 2025): instead of exposing MCP tools as direct function calls, expose them as a code API the model scripts against, so tool definitions and intermediate results stay out of context. Anthropic’s published framing reports large token reductions on multi-tool workflows.
Cloudflare shipped Code Mode on Workers: convert MCP tools into a TypeScript API and let the model write code against it, executed in a workerd isolate. Their argument is that frontier models are far better at writing idiomatic code calling a familiar API than at emitting long chains of synthetic tool-call JSON they have seen comparatively little of in training.
The CodeAct line (HuggingFace smolagents, the Executable Code Actions research) makes code the action representation itself: the agent’s “action” is a Python snippet, executed each step, with results fed back. Reported benchmarks show fewer steps and higher success than JSON tool-calling on multi-tool tasks.

Treat the specific token-reduction percentages and ship dates above as time-sensitive vendor claims — re-verify against primary sources before citing a number. The architectural convergence is the durable point.

The common thread: models reason in code more reliably than in long sequences of structured tool-call objects, and code keeps intermediate state local. Both properties point at the same implementation.

Reference Implementation: Hermes `execute_code`

Hermes Agent (Nous Research) ships Programmatic Tool Calling as a first-class core tool, execute_code, and its design is a clean checklist of the decisions this pattern forces. The codebase calls it PTC and is explicit that “only the script’s stdout is returned to the LLM; intermediate tool results never enter the context window.”

The load-bearing design choices:

Decision	Hermes’ answer	Why it matters
What returns to context	`stdout` only.	The whole point. If intermediate results leaked back, the pattern would be pure overhead.
Transport	Unix domain socket locally; file-based RPC (request/response files polled by a parent thread) for remote backends where no shared socket exists. Loopback TCP fallback on Windows.	The RPC channel has to survive the sandbox boundary. A socket the parent owns doesn’t reach into a Modal microVM; files shipped through the backend’s exec channel do.
Sandboxed tool surface	A frozen allowlist of 7 tools (`web_search`, `web_extract`, `read_file`, `write_file`, `search_files`, `patch`, `terminal`), intersected with the session’s enabled tools. The script gets stubs only for that intersection.	A script with the full toolset is a second agent with no oversight. Narrowing the surface is the same least-privilege discipline as sub-agent tool isolation.
Call ceiling	A `max_tool_calls` counter enforced by the RPC server; over-limit calls return a structured error, not a crash.	A scripted loop can fan out without the per-call human-visible turn that normally bounds it. The ceiling is the backstop.
Budget accounting	`execute_code` iterations are refunded to the agent’s iteration budget so a script that makes ten tool calls doesn’t burn ten of the agent’s ~90 turns.	The budget models reasoning turns; scripted calls aren’t reasoning turns. Charging them would punish the efficient path.
Generated API	The harness emits a `hermes_tools.py` stub module — typed functions with docstrings — that the script imports. The model writes `hermes_tools.web_search(...)`, not raw RPC.	Echoes the Cloudflare insight: give the model a clean code API, not a protocol to hand-assemble.

The remote-backend variant is the part most implementations miss. Code Mode is trivial when the script and the harness share a process; it becomes an engineering problem the moment the script runs inside an isolated sandbox — which, per Sandboxing, is exactly where untrusted-input agents should run it. File-based RPC over the backend’s own exec channel is the portable answer.

When To Reach For It

Situation	Classic tool loop	Code Mode
One tool call, result drives the next message	✅ Right tool	Overkill
Fan-out → filter → fan-out → synthesize	Floods context with intermediates	✅ Collapses to one turn, one summary
Large tool outputs the model only needs to transform, not read	Pays for the full payload every turn it stays resident	✅ Payload stays in script vars
Deterministic multi-step pipeline (extract, join, aggregate)	Model re-derives control flow each turn	✅ Control flow is code, runs once
Step where the model must reason about a result before continuing	✅ The result belongs in context	The model can’t reason about what it didn’t see

The decision rule: if the model needs to read an intermediate result to decide the next step, that result belongs in context — use the classic loop. If the model only needs the final product of a known pipeline, script it. Code Mode is for transformation, not deliberation.

Implementation Checklist

If you add Programmatic Tool Calling to a harness, the seven decisions Hermes makes are the seven you will face:

Return discipline. Only the script’s stdout crosses back. Suppress tool-handler stdout so it can’t leak into the channel.
Transport that crosses your sandbox boundary. In-process socket for local; file or pipe RPC for isolated backends; a fallback for platforms where your primary transport is flaky.
A narrowed tool allowlist for scripted calls — never the full agent toolset. Intersect with what the session already has enabled.
A hard call ceiling, returned as structured tool-result data, not an exception.
Budget refunding so scripted calls don’t consume the reasoning-turn budget.
A generated, typed code API the model imports — not a raw RPC surface it must hand-assemble.
Graceful degradation: disable the tool, or fall back, on platforms where the transport can’t run (Hermes disables native UDS on Windows and uses loopback TCP).

What Not To Do

Returning intermediate results “for transparency.” It defeats the entire pattern. If you need an audit trail, log the scripted calls out-of-band; do not feed them back into context.
Exposing the full toolset to the script. A script with delegate_task, memory, and unrestricted terminal is an unsupervised agent. Narrow the surface; see the sub-agent isolation rule — the same logic applies.
No call ceiling. A scripted while loop with a tool call inside it and no cap is a token incident waiting for a bad model sample. Bound it.
Scripting deliberation. Forcing a step that genuinely needs the model’s judgment into a silent script trades reliability for a token saving you didn’t need. Reserve Code Mode for pipelines whose control flow is known in advance.
Charging scripted calls against the reasoning budget. It makes the efficient path look expensive in your own accounting and pushes the model back toward the classic loop.

Programmatic Tool Calling is the rare optimization that improves cost, latency, and context pressure at once — but only on the workloads it fits. Used as a default it hides the model’s reasoning; used as a scalpel on transformation-heavy pipelines it is one of the highest-leverage patterns in the guide. See Anti-Patterns → Tool-Result Flooding for the failure mode it cures.