Six months on, the industry is converging: code-execution over tool registries. Anthropic’s data and Cloudflare’s Code Mode show the path. Here’s how.
17 min read1 hour ago
–
Press enter or click to view image in full size
When this series first documented the shift away from static, pre-registered tool calling (see Part 1, Part 2, Part 3), it argued that schema/version drift, context-bloat and tooling fragility would force the move toward agents that write an…
Six months on, the industry is converging: code-execution over tool registries. Anthropic’s data and Cloudflare’s Code Mode show the path. Here’s how.
17 min read1 hour ago
–
Press enter or click to view image in full size
When this series first documented the shift away from static, pre-registered tool calling (see Part 1, Part 2, Part 3), it argued that schema/version drift, context-bloat and tooling fragility would force the move toward agents that write and run code in sandboxes. Now, Anthropic’s blog “Code execution with MCP: ” confirms this direction.
In the post, Anthropic describes what happens when agents are connected to hundreds or thousands of tools via the Model Context Protocol (MCP) servers:
“Agents must process hundreds of thousands of tokens before even reading a request.”
The conclusion:
Loading all tool definitions into context and passing each intermediate result through the model becomes a bottleneck. The remedy? Have the model write code to call the MCP servers as APIs rather than invoking each tool directly.
The result? A
98.7% reduction in token usage by switching from tool calls to code execution.
That’s not an optimization — that’s a paradigm shift.
This is the exact transformation we’ve been exploring throughout this series — the fundamental rearchitecting of how AI agents interact with tools and systems.
This reflects the transformation described in the series: orchestration moves out of prompts and into executable code, governed by runtime policy. Static, pre-registered tool registries (including MCP tool collections) struggle in real-world enterprise systems. They lead to schema drift, version rot, context window bloat from large tool definitions, multi-hop tool chains that compound cost and failure risk, and governance/security challenges from dynamic tool access.
The alternative approach: code-first, policy-gated execution. Agents generate small programs; intent is validated through AST or code inspection; execution happens in sandboxes; everything is audited.
What Anthropic Just Confirmed
Their post details how MCP — their own protocol for tool connectivity — breaks down at scale:
- Tool definition overload: Loading thousands of tool definitions into context consumed hundreds of thousands of tokens before the agent could even start working
- Intermediate result bloat: Multi-step workflows (for example: fetch a document from Google Drive → update a Salesforce record) required large transcripts to traverse the model, increasing both latency and token cost.
- The solution: Present MCP servers as filesystem APIs, let agents write code to orchestrate them
These results align closely with the pattern the series described: keep prompts lean, orchestration in code, and enforce security at runtime.
The Promise Was Seductive
When the Model Context Protocol (MCP) emerged, it felt like the answer we’d been waiting for.
A unified way for AI agents to interact with tools. No more custom wrappers for every API. Register a tool once, let the agent discover it at runtime, call it when needed. Clean. Simple. Elegant.
Enterprise architects looked at their sprawling internal systems — billing platforms, CRMs, HR databases, data warehouses, CI/CD pipelines — and thought: “Perfect. We’ll just surface these as MCP tools.”
It made sense. On paper.
But here’s what actually happened.
The Reality: You Already Have APIs
Walk into any mid-to-large organization and ask about their internal systems. You’ll find:
- REST APIs that have been running for years
- GraphQL endpoints serving dozens of clients
- gRPC services handling millions of requests daily
- Batch processing pipelines with well-defined interfaces
- Stream processors with documented schemas
These systems work. They’re battle-tested. They have monitoring, alerting, SLAs. Teams know how to operate them.
Now, turning each of these into an “MCP tool” meant:
Duplicating schema definitions — You already have an OpenAPI spec. Now you need a tool manifest that describes the same thing.
Running separate MCP servers — Just to wrap endpoints that already exist. Another process to deploy, monitor, scale.
Maintaining versioning twice — Once for the real API. Again for the tool layer.
Managing discovery twice — Your service mesh already handles discovery. Now MCP needs its own.
Handling authentication twice — Your API gateway does auth. Now the MCP layer needs it too.
Look at that list again.
You didn’t simplify anything. You added a layer. An entire parallel infrastructure that promises to reduce complexity while actively introducing it.
This is the abstraction tax. And it’s expensive.
Cloudflare discovered this exact problem. In their post “Code Mode: the better way to use MCP,” they put it bluntly:
“It turns out we’ve all been using MCP wrong.”
Their engineering team found that when agents need to string together multiple calls using traditional MCP tools, the output of each tool call must feed into the LLM’s just to be copied over to the inputs of the next call, wasting time, energy, and tokens.
This is the serialization tax in action. Every intermediate result bloats your context window. Every hop adds latency. The model becomes a data bus instead of a reasoning engine.
The Schema Drift Problem Gets Exponentially Worse
Here’s where theory meets reality in the worst possible way.
MCP tools assume stability. You define a tool: inputs, outputs, behavior. The agent learns to use it. Beautiful.
But internal systems don’t sit still.
Week 1: Your billing API adds a new discount_code field.
Week 2: The product team renames user_id to customer_id for consistency.
Week 3: Auth team updates scopes; that endpoint now requires billing:write instead of billing:admin.
Week 4: The new payment provider integration means three fields are deprecated and two new ones are required.
Week 5: Someone merges two endpoints into one because they were always called together anyway.
This is normal. This is how systems evolve. This is healthy.
But now every change cascades:
- Update the API
- Update the tool registration
- Update the tool examples
- Update the discovery logic
- Update the agent prompts that reference this tool
- Update the documentation
- Hope no cached tool definitions are floating around
- Deal with version mismatches between what the agent expects and what the API provides
What was supposed to be “change the API, agents adapt” becomes “change the API, then update six other things, then debug why the agent is still calling the old version.”
This is schema drift, multiplied.
The “tool” stopped being an interface. It became a maintenance nightmare.
Your Context Window Is Not a Runtime
One of the most insidious problems with MCP-style tool invocation is where the integration logic lives.
It lives in the prompt.
The model’s context window gets stuffed with:
- Tool definitions — Names, descriptions, and input schemas (and sometimes output schemas) for tools discovered across connected MCP servers, surfaced so the model can decide what to call.
- Result schemas (optional) — Some tools declare an outputSchema so clients (and models) know the structure of structured results; this can also be surfaced in context alongside definitions.
- Error-handling instructions — Not part of MCP itself, but many client prompts add guidance on failures/retries so the model knows how to proceed between tool calls.
- Retry logic — Again, outside the MCP schema; typical agent loops feed each tool result back into the model and ask it what to do next, so prompts often include retry/backoff guidance.
Register many tools, and a large share of context can be consumed before the agent even starts real work — because clients load catalogs up front and then pipe intermediate results back through the model. A context window is working memory, not an execution environment.
The Three Consequences
1. Cost and latency spiral
Every request now processes tens of thousands of tokens just to know what tools exist. The model reads massive manifests before it can think about the user’s question.
Moving orchestration to code execution with MCP cut one worked example from ~150k down to ~2k tokens (≈98.7% reduction).
2. Prompt truncation becomes a real risk
Context is finite and behaves like scarce working memory; as more definitions/results are stuffed into it, retrieval precision degrades (“context rot”). Anthropic’s guidance is to curate context and prefer just-in-time loading rather than preloading everything — precisely to avoid drowning the model’s attention budget.
Or worse: you start truncating the actual conversation history to make room for tool definitions. The agent forgets what the user asked for three messages ago because** it needs to remember how to call the billing API.**
3. The model becomes the integration layer
In direct tool-calling mode, every intermediate result is routed through the model for the next step. A simple task: “Get document from google drive, update Salesforce record, notify team in Slack.”
With MCP tools, that becomes:
Agent → Tool: gdrive.getDocumentTool → Agent: [50KB document content]Agent processes documentAgent → Tool: salesforce.updateRecordTool → Agent: [2KB result with record ID]Agent composes messageAgent → Tool: slack.postMessageTool → Agent: [1KB confirmation]Agent summarizes to user
Look at that flow. The model is:
- The orchestrator (deciding what happens next)
- The data bus (shuttling results between tools)
- The transformer (converting outputs to inputs)
- The error handler (catching failures and retrying)
Anthropic’s post shows this exact pattern leading to 50,000-token intermediate transcripts for simple multi-step operations.
The context window was never meant to be an integration platform or a runtime environment. It’s working memory. It’s where the agent thinks.
But with tool registries, it becomes the place where data lives between hops, where transformation happens, where the entire orchestration state resides.
This is fundamentally backward.
Cloudflare measured this exact pattern and concluded:
“When the LLM can write code, it can skip all that, and only read back the final results it needs.
Their finding:
LLMs are better at writing code to call MCP, than at calling MCP directly.”
Why?
The special tokens used in tool calls are things LLMs have never seen in the wild. They must be specially trained to use tools, based on synthetic training data. They aren’t always that good at it.
LLMs have seen real-world code from millions of open source projects.
Cloudflare’s metaphor is perfect:
Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It’s just not going to be his best work.
Security Teams Hated It
When MCP adoption started, security and governance teams had a feeling something was wrong. The conversations went like this:
Security: "So the agent can call any registered tool?"Engineering: "Yes, that's the point. Dynamic discovery."Security: "How do we know which tool it will call?"Engineering: "The model decides based on the task."Security: "And we approve this… how?"Engineering: "You approve the tools that get registered."Security: "So we approve 150 tools, and the agent picks which ones to use at runtime?"Engineering: "Exactly."Security: "…"
Here’s what they couldn’t articulate yet, but felt intuitively:
The Surface Area Exploded
Every tool becomes a runtime endpoint that needs its own controls and observability:
- Audit logging (who called it, when, with what parameters)
- Permission boundaries (can this agent use this tool?)
- Rate limiting (how often can it be called?)
- Egress control (what data can leave through this tool?)
- Version management (which tool version is running?)
- Incident response (what happens when a tool misbehaves?)
These aren’t “nice-to-haves” — they’re baseline requirements the MCP spec and standard security guidance call out (validate inputs, access controls, rate-limit, sanitize outputs; least-privilege)
With traditional APIs, you secure the API. One thing to audit, one place to set permissions, one system to monitor.
With MCP tools wrapping those APIs, you now secure:
- The MCP server
- The tool registration process
- The discovery mechanism
- The invocation layer
- The underlying API (still need that)
Five attack surfaces where there used to be one.
Approval Processes Were Mis-Scoped
Security teams are used to approving specific integrations: “Service A can call endpoint X on Service B with these credentials under these conditions.” Clear boundaries. Explicit flows. Risk-assessable.
But with tool registries, the approval question fundamentally changes: “The agent can call any of these 150 tools, we don’t know in what order, depending on how it interprets the user’s request, and it might chain them in ways we haven’t seen yet.”
This isn’t actually impossible to approve — it’s just that we’re approving the wrong thing.
The problem is we’re trying to approve tools when what actually matters is flows. Which data sources can flow to which data sinks, under what conditions, with what constraints. When you approve a registry of 150 tools, you’re implicitly approving every possible composition of those tools — and that’s where the risk lives.
This is exactly what OWASP calls “Excessive Agency” in their Gen AI Security framework: giving an agent too much functionality and autonomy to pre-approve cleanly, because the model can freely compose allowed tools in ways you haven’t anticipated.
The Real Risk: Allowed + Allowed = Breach
Consider this scenario that plays out in production systems: Your agent has access to your HR system (Workday, BambooHR, whatever you use) where it can look up employee records — names, job titles, salary information, performance ratings. It also has access to Slack where it can post messages to various channels. Both are “approved.” Both seem fine in isolation.
Then someone asks the agent: “Summarize this quarter’s performance reviews and share key themes with the team in #general.”
The agent does exactly what you’d expect:
- Queries Workday for Q4 performance reviews (allowed)
- Reads through the reviews and extracts patterns (working as designed)
- Posts to #general: “Key themes: Sarah Chen exceeded expectations with 15% raise to $145K. Mike Johnson needs improvement on client communication. Jennifer Williams promoted to Senior with compensation adjustment to $160K…” (catastrophic)
Why did approvals fail? Because you approved the Workday integration (needed for legitimate HR inquiries) and you approved the Slack integration (needed for team communication). But you never explicitly approved the flow: “Employee salary and performance data from Workday → Public Slack channels where the whole company can see it.”
How to scope it right: Instead of approving tools, approve data flows. “Employee compensation and performance review data cannot be sent to Slack unless it’s been anonymized and a manager has reviewed it.” Or more broadly: “Any data classified as ‘confidential employee information’ is blocked from going to communication tools by default.”
This requires checking what kind of data is moving through your system — exactly what enterprises already do with Data Loss Prevention (DLP) tools like Microsoft Purview. When an employee tries to email a spreadsheet with 500 social security numbers, Purview blocks it. Same principle here.
Why code execution makes this easier: When orchestration happens in a governed runtime , you get a single enforcement point. The **runtime **can inspect what the agent is trying to do: “You’re reading from Workday’s compensation table and trying to post to Slack — blocked.” You’re not trying to encode these policies in prompts or hoping the model respects them — you’re enforcing them in infrastructure.
The Confused Deputy Problem
Here’s another pattern: An agent has permission to read API keys from your secrets vault (HashiCorp Vault, AWS Secrets Manager, whatever you use) because it needs legitimate access to third-party services. It also has permission to process customer refunds through your billing system (Stripe, Chargebee) for customer service workflows. Both capabilities are necessary and both are “approved.”
Then a subtle prompt injection or a logical error causes the agent to do this:
- Someone asks: “Check our payment processor status”
- Agent fetches the Stripe master admin key from the vault (allowed — it needs this to check API status)
- Due to a malicious prompt, it then processes a $50,000 refund using that admin key (technically allowed — it has refund permissions)
- But this particular refund should never have been authorized — the agent just used elevated credentials to do something it shouldn’t
The agent became a “confused deputy” — it got tricked into using high-privilege credentials for an operation that should have required additional approval.
Audit Trails Fragmented
The approval challenge connects directly to another failure mode: when something goes wrong, you can’t figure out what happened.
A traditional API call produces one log entry. Request comes in, response goes out, trace ID connects it to upstream and downstream services. When you need to debug or investigate, you follow the trace. Done.
With MCP tool invocation chains, the story fragments across multiple systems:
- Agent reasoning lives in model logs (why it chose this approach)
- Tool discovery lives in MCP server logs (what capabilities it found)
- Tool invocation lives in MCP server logs again (in a different format, what it called)
- The actual API call lives in your API gateway logs (what happened on the backend)
- Result processing lives back in model logs (how it interpreted the response)
Now an incident occurs — a developer reports that internal API keys appeared in a public Slack channel — and you need to reconstruct what happened. You’re stitching together five different log sources with five different timestamp formats, five different retention policies, and five different access control systems. Security teams know this pain: the longer it takes to reconstruct an incident, the more damage occurs.
Why this became impossible: You approved capabilities but never instrumented the end-to-end path. Each system logged its own slice, but nobody owns the complete story.
The Warnings Were There All Along
and then the patterns started emerging:
Tool Registries Ballooned
Teams would start with 5 tools. Then 15. Then 40. Then 100.
Each incremental addition made sense in isolation:
- “Just add one tool for the new API”
- “Just register this slight variation”
- “Just create a tool for this edge case”
But collectively, they created bloat. Similar tools with similar schemas and incremental differences.
The agent couldn’t tell them apart. Neither could the developers.
Performance Degraded Predictably
Teams noticed agents getting slower. Not because the model got worse — because:
- Context windows filled with tool definitions
- Multi-hop tool chains introduced latency at each step
- The model spent more time parsing tool manifests than thinking about the problem
One team measured: Their agent took 3.2 seconds to complete a task with direct API calls. The same task with MCP tools: 14.7 seconds. Same model, same task, 4.6x slower.
Why? Tool discovery overhead, serialization between hops, context processing cost.
Governance Teams Started Asking the Right Questions
“Why are we approving 150 tools instead of just securing the runtime?”
That’s the question that broke everything open.
Because the answer was: “We’re not sure. That’s how MCP works.”
And the response was: “Then MCP doesn’t work for us.”
There’s a Better Way — and It Was Always Obvious
For enterprises, the solution isn’t “register every API as a tool.” It’s simpler than that, and it builds on infrastructure you already have.
1. Expose Your Real APIs (They Already Exist)
You have REST endpoints. You have GraphQL schemas. You have gRPC services.
Don’t wrap them. Don’t duplicate them. Don’t create a parallel tool layer.
Just… use them. They work. They’re tested. They’re monitored. They’re governed.
2. Provide a Secure Runtime Where Agents Write Code
Instead of:
Agent → Tool: salesforce.getAccountTool → Agent: [account data]Agent → Tool: salesforce.updateAccountTool → Agent: [success]
Do this:
const account = await salesforce.getAccount(accountId);account.status = 'active';account.lastModified = new Date();await salesforce.updateAccount(account);
One execution. No hops. No serialization through context. No token cost for intermediate results.
The code runs in a sandbox. It has access to real APIs. It’s fast.
3. Apply Policy at the Runtime Layer
This is where the governance model fundamentally improves. Instead of trying to approve 150 tools and hoping the agent chains them safely, you approve capabilities and enforce them at the runtime boundary.
// Runtime policy for customer support agent{ permissions: { salesforce: ['read', 'write'], // Can read and modify customer records billing: ['read'], // Can view invoices, cannot refund sendgrid: ['send'], // Can send emails slack: ['post'], // Can post to specific channels admin: [] // No admin operations }, dataFlows: { // Employee data cannot go to customer-facing channels 'workday.employeeData': { allowedDestinations: ['slack.internal'], blockedDestinations: ['sendgrid', 'slack.customer-support'] }, // Customer PII must be redacted before logging 'salesforce.customerPII': { redactInLogs: true } }, egress: [ 'salesforce.com', 'api.stripe.com', 'api.sendgrid.com' ], // Can only make outbound requests to these domains auditLog: true, // Every API call is logged with trace ID maxDuration: '30s', // Code execution timeout maxMemory: '256MB', // Memory limit maxApiCalls: 50 // Rate limit per execution}
The agent writes code. The runtime enforces policy. Security teams approve capabilities, not tools.
When the agent writes code to read from Salesforce and post to Slack, the runtime checks:
- Does this agent have
salesforce: ['read']permission? Yes. - Does this agent have
slack: ['post']permission? Yes. - Is
slack.customer-supportin the allowed destinations forsalesforce.customerPII? No—blocked.
The code never executes. The security violation is caught before any data moves.
Why this is better than tool approvals:
Security teams aren’t trying to predict “which combinations of these 150 tools might be dangerous.” They’re specifying “customer PII from Salesforce cannot go to external email or public Slack channels.”
This is the security model they already use for service-to-service communication. It’s** Zero Trust** applied to agent operations.
Simple. Auditable. Enforceable.
4. Keep Tool Surface Small; Delegate to Runtime
You might still have a handful of tools for truly generic operations:
execute_code— Run sandboxed codesearch_docs— Semantic search over documentationanalyze_data— Run analytical queries
Three tools. Not 150.
The complexity moves to where it belongs: in code that executes in a governed runtime, not in prompt definitions that bloat your context.
Where complexity lives now:
With tool registries: In the prompt. 150 tool definitions consuming 50,000+ tokens. The model must parse all of them, understand which combinations are valid, remember which parameters each requires, know how to chain them together. Before it even starts thinking about the user’s question.
With code execution: In the agent-generated code. The prompt stays lean — just the capability definitions and the task at hand. The model writes a program using patterns it knows extremely well because it’s seen millions of real-world code examples. That code is:
- Parseable by static analysis tools (you can inspect intent before execution)
- Auditable (explicit statements you can log and review)
- Enforceable (the runtime blocks operations that violate policy)
- Familiar to security teams (it’s code review, not prompt archaeology)
The complexity didn’t disappear. It moved to where it’s manageable: executable statements in a governed runtime, not scattered across prompt definitions and model context.
Where MCP Still Makes Sense
The protocol solves real problems in specific contexts. Understanding where it works well helps clarify where it doesn’t.
1. Integrating External Ecosystems
You’re connecting to third-party SaaS vendors with divergent interfaces. An abstraction layer adds value here. One protocol to learn instead of fifty vendor-specific patterns.
Why it works: The tools are external (you don’t control them), relatively stable (vendor APIs don’t change daily), and you’re genuinely gaining abstraction value. The overhead of the protocol layer is justified by the complexity it hides.
2. Stable, Well-Defined Tool Sets
You have 5–10 tools that rarely change. They’re well-documented. They’re highly reusable. The tool registry is an asset, not a liability.
Think of something like “calculator,” “web_search,” “image_generation” — generic capabilities that are stable over time and useful across countless scenarios. The cost of loading these definitions once is justified across many uses.
Why it works: The ratio of “tool definition overhead” to “tool usage value” makes sense. You pay the token cost to load 10 well-designed, highly reusable tools, and you get massive leverage from them. Not 150 tools where the agent uses 3 per task.
What Anthropic’s Findings Mean
When Anthropic published their code execution post, they documented the overhead problem that becomes prohibitive at scale:
Tool definition bloat: Loading tool definitions into context can consume hundreds of thousands of tokens in complex scenarios — overhead that exists before the agent even begins working on the actual task.
Intermediate result processing: Multi-step operations require passing data through the model repeatedly. Simple workflows like “transcribe a meeting, extract action items, update project tracker” can mean processing tens of thousands of additional tokens just to shuttle intermediate results between tool calls.
The breakthrough: Switching from tool calls to code execution achieved a 98.7% reduction in token usage.
This isn’t theoretical optimization. This is measured, in production, by the company that built MCP. They didn’t set out to prove tool registries were broken — they encountered the scaling problem while building real systems and documented what they found.
The Arc: From April to Now
Six months ago, in Part 1 of this series, the argument was:
“The future isn’t a giant toolbox you have to predefine. It’s agents that generate the tools they need, on the fly.”
The comparison was to DLL Hell — fragile tool registries, version mismatches, cascading failures.
The solution:** code-first orchestration where agents write programs in secure sandboxes instead of selecting from pre-registered tools.**
That’s now production architecture. Cloudflare’s Code Mode converts tools into TypeScript APIs and lets agents write code. Anthropic presents MCP servers as filesystem APIs with code execution. OpenAI’s Code Interpreter remains their most successful pattern. The model writes programs because it’s trained on millions of real code examples, not synthetic tool calls. Different companies, same conclusion: this is how agents scale.
Conclusion
MCP promised to simplify agent-tool interaction. For internal systems, it delivered the opposite:
- Added maintenance burden (schema duplication)
- Introduced drift (versioning across layers)
- Bloated context windows (tool definitions everywhere)
- Complicated governance (security at the wrong layer)
- Degraded performance (multi-hop serialization)
The better path: Runtime-governed code execution.
Agents write programs. Runtimes enforce policy. APIs stay as they are. Security gets a control plane they understand.
MCP wasn’t wrong in intent. It was solving the wrong problem. External tool integration? Yes. Internal system orchestration? No.
Internal systems don’t need more registries.
They need a governed runtime where agents can do what they do best — write logic — while your systems stay secure, efficient, and ready to evolve.
The evidence from Anthropic and Cloudflare aligns with this direction. The question now is how fast organizations adapt.