Programmatic Tool Calling
Code Mode — the agent writes a script that calls tools over RPC, and only the script's output returns to context.
The default agentic loop pays a tax that compounds with every tool: each call round-trips through the context window. The model emits a tool call, the harness runs it, the entire result — a 40KB file, a 200-row query, a page of search hits — is appended to the conversation, and the model reads all of it on the next turn just to decide what to do next. A five-step pipeline pays for every intermediate result five times over: once to receive it, again on every subsequent turn it stays resident.
Programmatic Tool Calling inverts this. Instead of the model being the control loop, the model writes the control loop — a short script that calls tools as ordinary functions — and the harness runs the whole script in one turn. Intermediate results live in the script’s local variables, never in the context window. Only what the script chooses to print comes back. The industry name that stuck is Code Mode.
The Shape
The script calls web_search, gets thirty hits, filters to five URLs, calls web_extract, parses the articles, and prints a synthesis. The model never sees the thirty hits or the five raw articles — it sees the synthesis. The same work in a classic loop would deposit all of it into context across five turns.
Why this is a context pattern, not just an ergonomics one: the savings are in tokens that never enter the window, not in fewer characters typed. It composes directly with Memory — Code Mode is the cleanest way to keep large, transient tool output off the L1 working set entirely — and with Cost, where it is one of the few levers that reduces input tokens rather than trading model quality for price.
Why It Converged
This is not one vendor’s trick. Three independent lines arrived at the same design within months of each other, which is the signature of a physics-driven pattern rather than a fashion.
- Anthropic documented “code execution with MCP” (late 2025): instead of exposing MCP tools as direct function calls, expose them as a code API the model scripts against, so tool definitions and intermediate results stay out of context. Anthropic’s published framing reports large token reductions on multi-tool workflows.
- Cloudflare shipped Code Mode on Workers: convert MCP tools into a TypeScript API and let the model write code against it, executed in a
workerdisolate. Their argument is that frontier models are far better at writing idiomatic code calling a familiar API than at emitting long chains of synthetic tool-call JSON they have seen comparatively little of in training. - The CodeAct line (HuggingFace
smolagents, the Executable Code Actions research) makes code the action representation itself: the agent’s “action” is a Python snippet, executed each step, with results fed back. Reported benchmarks show fewer steps and higher success than JSON tool-calling on multi-tool tasks.
Treat the specific token-reduction percentages and ship dates above as time-sensitive vendor claims — re-verify against primary sources before citing a number. The architectural convergence is the durable point.
The common thread: models reason in code more reliably than in long sequences of structured tool-call objects, and code keeps intermediate state local. Both properties point at the same implementation.
Reference Implementation: Hermes execute_code
Hermes Agent (Nous Research) ships Programmatic Tool Calling as a first-class core tool, execute_code, and its design is a clean checklist of the decisions this pattern forces. The codebase calls it PTC and is explicit that “only the script’s stdout is returned to the LLM; intermediate tool results never enter the context window.”
The load-bearing design choices:
| Decision | Hermes’ answer | Why it matters |
|---|---|---|
| What returns to context | stdout only. | The whole point. If intermediate results leaked back, the pattern would be pure overhead. |
| Transport | Unix domain socket locally; file-based RPC (request/response files polled by a parent thread) for remote backends where no shared socket exists. Loopback TCP fallback on Windows. | The RPC channel has to survive the sandbox boundary. A socket the parent owns doesn’t reach into a Modal microVM; files shipped through the backend’s exec channel do. |
| Sandboxed tool surface | A frozen allowlist of 7 tools (web_search, web_extract, read_file, write_file, search_files, patch, terminal), intersected with the session’s enabled tools. The script gets stubs only for that intersection. | A script with the full toolset is a second agent with no oversight. Narrowing the surface is the same least-privilege discipline as sub-agent tool isolation. |
| Call ceiling | A max_tool_calls counter enforced by the RPC server; over-limit calls return a structured error, not a crash. | A scripted loop can fan out without the per-call human-visible turn that normally bounds it. The ceiling is the backstop. |
| Budget accounting | execute_code iterations are refunded to the agent’s iteration budget so a script that makes ten tool calls doesn’t burn ten of the agent’s ~90 turns. | The budget models reasoning turns; scripted calls aren’t reasoning turns. Charging them would punish the efficient path. |
| Generated API | The harness emits a hermes_tools.py stub module — typed functions with docstrings — that the script imports. The model writes hermes_tools.web_search(...), not raw RPC. | Echoes the Cloudflare insight: give the model a clean code API, not a protocol to hand-assemble. |
The remote-backend variant is the part most implementations miss. Code Mode is trivial when the script and the harness share a process; it becomes an engineering problem the moment the script runs inside an isolated sandbox — which, per Sandboxing, is exactly where untrusted-input agents should run it. File-based RPC over the backend’s own exec channel is the portable answer.
When To Reach For It
| Situation | Classic tool loop | Code Mode |
|---|---|---|
| One tool call, result drives the next message | ✅ Right tool | Overkill |
| Fan-out → filter → fan-out → synthesize | Floods context with intermediates | ✅ Collapses to one turn, one summary |
| Large tool outputs the model only needs to transform, not read | Pays for the full payload every turn it stays resident | ✅ Payload stays in script vars |
| Deterministic multi-step pipeline (extract, join, aggregate) | Model re-derives control flow each turn | ✅ Control flow is code, runs once |
| Step where the model must reason about a result before continuing | ✅ The result belongs in context | The model can’t reason about what it didn’t see |
The decision rule: if the model needs to read an intermediate result to decide the next step, that result belongs in context — use the classic loop. If the model only needs the final product of a known pipeline, script it. Code Mode is for transformation, not deliberation.
Implementation Checklist
If you add Programmatic Tool Calling to a harness, the seven decisions Hermes makes are the seven you will face:
- Return discipline. Only the script’s stdout crosses back. Suppress tool-handler stdout so it can’t leak into the channel.
- Transport that crosses your sandbox boundary. In-process socket for local; file or pipe RPC for isolated backends; a fallback for platforms where your primary transport is flaky.
- A narrowed tool allowlist for scripted calls — never the full agent toolset. Intersect with what the session already has enabled.
- A hard call ceiling, returned as structured tool-result data, not an exception.
- Budget refunding so scripted calls don’t consume the reasoning-turn budget.
- A generated, typed code API the model imports — not a raw RPC surface it must hand-assemble.
- Graceful degradation: disable the tool, or fall back, on platforms where the transport can’t run (Hermes disables native UDS on Windows and uses loopback TCP).
What Not To Do
- Returning intermediate results “for transparency.” It defeats the entire pattern. If you need an audit trail, log the scripted calls out-of-band; do not feed them back into context.
- Exposing the full toolset to the script. A script with
delegate_task,memory, and unrestrictedterminalis an unsupervised agent. Narrow the surface; see the sub-agent isolation rule — the same logic applies. - No call ceiling. A scripted
whileloop with a tool call inside it and no cap is a token incident waiting for a bad model sample. Bound it. - Scripting deliberation. Forcing a step that genuinely needs the model’s judgment into a silent script trades reliability for a token saving you didn’t need. Reserve Code Mode for pipelines whose control flow is known in advance.
- Charging scripted calls against the reasoning budget. It makes the efficient path look expensive in your own accounting and pushes the model back toward the classic loop.
Programmatic Tool Calling is the rare optimization that improves cost, latency, and context pressure at once — but only on the workloads it fits. Used as a default it hides the model’s reasoning; used as a scalpel on transformation-heavy pipelines it is one of the highest-leverage patterns in the guide. See Anti-Patterns → Tool-Result Flooding for the failure mode it cures.