Why Multi-Agent Systems Fail in Production: The Missing Control Plane

Most production failures are system failures, not model failures.

9 min read12 hours ago

–

The Real Problem

In November 2025, a multi-agent research toolslipped into a recursive loop that ran for 11 days before anyone noticed — resulting in a $47,000 API bill. No stop condition, no budget limit, no alert.

When agent systems fail in production, the root cause is rarely a poor model response. Most failures trace back to missing controls: no safeguards against duplicate actions, no rules for shared data, and no clear stopping conditions when something goes wrong.

**Consider a common scenario. **An agent helps with operational work.…

Most production failures are system failures, not model failures.

9 min read12 hours ago

–

The Real Problem

In November 2025, a multi-agent research toolslipped into a recursive loop that ran for 11 days before anyone noticed — resulting in a $47,000 API bill. No stop condition, no budget limit, no alert.

**Consider a common scenario. **An agent helps with operational work. It opens a support ticket, but the tool call times out. The agent retries. Now there are two tickets. Another agent sees both and escalates them. The result: duplicate work, duplicate customer contact, and cleanup costs that exceed the original issue.

A quieter failure is even more costly. A tool call times out after the external system has already processed the request. The agent interprets the timeout as “nothing happened” and retries. Now you have a duplicate refund, a duplicate message, or a duplicate job. The incident is not caused by a bad model output. It is caused by uncertain outcomes combined with unsafe retry logic.

In many organizations, the dominant production failures are control failures: no way to prevent duplicate actions, no rules for updating shared data, and no mechanism to pause or escalate when tool results are uncertain.

Press enter or click to view image in full size

Most production failures are system failures, not model failures.

Operating Contracts: The Foundation of Reliability

MIT’s 2025 State of AI in Business report found 95% of AI pilots fail to reach production. The causes: trust breakdown, integration fragility, cost overruns, lack of observability — all control layer gaps.

Every agent workflow needs an explicit definition of “done” that the system can enforce. These operating contracts include:

Acceptance criteria and stop conditions — what must be true, and when to stop
Decision rules and handoff rules — who has authority, when to escalate
Budgets — limits on time, cost, and tool calls
Tool boundaries — scope of access, approvals, reversibility
State rules — who owns data, how conflicts resolve
Audit trail — what gets recorded for replay

This is the boundary between a demo and a production system. Once agents interact with real systems, teams encounter the same problems regardless of architecture: runaway costs, duplicate actions, wrong decisions shipped, low trust.

The Four Coordination Patterns

Most multi-agent systems in production converge on a small set of coordination patterns. Hybrid approaches exist, but these four appear repeatedly.

Press enter or click to view image in full size

The four coordination patterns most multi-agent systems converge on.

1. ReAct (Reasoning and Acting)

Best suited for investigative work where each step depends on what the agent just learned. The loop is simple:observe context, decide the next step, call a tool, read results, repeat. A customer service agent investigating a billing dispute works this way — pulling records, checking logs, looking up policy, each step informing the next.

This works when acceptance criteria are verifiable (“ticket created once and linked to incident”). It breaks when completion criteria are vague — loops expand, retries create duplicates, and retrieval quietly feeds bad context forward.

2. Manager-Worker

Designed for throughput — getting more done in parallel. A manager breaks a goal into tasks, workers execute simultaneously, the manager consolidates results. Vendor due diligence works this way — parallel workers pull financials, check reputation, review security certs, and analyze references while the manager assembles findings.

This works when tasks are genuinely separable and each worker has a clear lane. It breaks when workers write to shared state without coordination — two agents updating the same CRM field differently, or both triggering customer outreach because no one owned that action.

3. Critic/Reviewer

Used when output needs to meet a standard before it ships. A generator produces output, a reviewer checks it against acceptance criteria, the loop continues until it passes or escalates. Drafting regulated customer communications works this way — one agent writes, another checks compliance and brand guidelines, failures go back with specific feedback.

This works when the reviewer has something concrete to check: required fields, sources cited, no prohibited claims.

It breaks when criteria are subjective — “make it more professional” leads to endless revision cycles where you pay in latency and cost without gaining reliability.

4. Mixture-of-Agents (Parallel Propose + Arbiter)

Useful for decisions under uncertainty — when multiple perspectives help before committing. Multiple agents produce alternatives in parallel, an arbiter selects or aggregates the final output. Investment recommendations work this way — three agents analyze the same data, an arbiter weighs consensus and disagreement, and produces a final recommendation with confidence levels.

A clarification: Mixture-of-Agents (MoA) means inference-time ensembling with explicit selection or aggregation — not Mixture-of-Experts (MoE), which is internal single-model routing.

This works when selection is tied to evidence and explicit rules. It breaks when termination criteria are loose — parallelism multiplies costs, and shared context errors produce multiple confident options that are wrong in the same way.

Coordination patterns define how agents collaborate, but they do not remove the need for operating contracts. Once tools and shared state are involved, failure modes converge across patterns, so controls remain mandatory.

Many “agent patterns” discussed online are simply layered capabilities that change behavior and cost, not substitutes for coordination or governance.

Agentic RAG adds retrieval to ReAct or Manager-Worker loops — the agent pulls information from documents or databases as part of its reasoning. The recurring failure: treating retrieved snippets as ground truth without checking provenance, freshness, or relevance.

Self-Reflection adds a reviewer loop to any pattern, implemented as a separate critic or an internal check step. It can improve quality, but increases latency and cost. Unless having clear acceptance criteria and a stop condition, it becomes expensive looping.

Code-as-Action is ReAct where actions are code execution rather than tool calls. Powerful for data analysis and automation, but it raises the bar on isolation, permissions, and rollback. The production failure mode is rarely “bad code” — it is unsafe execution and weak reversibility.

Memory and tool protocols expand what agents can remember and affect. Memory is not truth — it is a cache. Without write rules, agent notes overwrite systems of record, different agents read different versions of truth, and yesterday’s context leaks into today’s case. Standardized connector protocols like MCP make it easier to add tools quickly, but faster tool expansion increases surface area for failure unless scopes, rate limits, and audit are already enforced.

Orchestration: The Control Layer

Enforcement should be deterministic by default.

Orchestration is the control layer for agent work. It assigns work, enforces constraints, gates risky actions, and records what happened.

Get Nick Vasylyna’s stories in your inbox

Join Medium for free to get updates from this writer.

The model can propose routing or draft plans, but budgets, permissions, stop conditions, and audit logging should not depend on the model’s judgment. If the orchestrator cannot prove a constraint is satisfied, the agent does not proceed.

Press enter or click to view image in full size

Keeping State Under Control

Research from Galileo* found coordination failures account for 37% of multi-agent system breakdowns — state inconsistency, deadlocks, and resource contention.*

Shared state is where most multi-agent systems quietly fall apart.

When state lives in prompts, scattered caches, or ad hoc memory notes, agents start reading different versions of truth. One agent updates a customer record while another is still working from stale data. The result is drift — small inconsistencies that compound until someone notices the wrong email went out or the same refund was issued twice.

Production systems treat workflow state as the single source of truth, make write ownership explicit, and handle conflicts deterministically. When that discipline is missing, parallel workers overwrite each other and incident reviews turn into “who touched this last.”

Preventing Duplicate Actions

Tool calls are where agents touch the real world — creating tickets, changing records, sending messages, issuing refunds. These operations fail in messy ways: timeouts, partial success, eventual consistency.

The common failure: an agent makes a call, the system times out, but the action actually went through. The agent retries, and now you have two tickets, two refunds, two emails sent. The customer gets confused, the team scrambles to clean up, and trust erodes.

An action ledger solves this. Each external action gets a stable key and a recorded outcome — attempted or confirmed. When something times out, the system checks the ledger and the external system before trying again. No more blind retries.

Knowing When to Stop

Agents do not naturally stop — they keep going until something tells them to. If that “something” is vague, you get two outcomes: the agent stops too early and ships something broken, or it keeps iterating and burns through your budget.

Production systems need clear completion rules. What conditions must be true for this task to be done? If those conditions cannot be verified, or if key information is missing, the agent should hand off to a human rather than guess.

Stop conditions that work in production:

Maximum steps, maximum tool calls, maximum wall-clock time
Bounded retries with verification (check external state before retrying)
Abstain policy: insufficient evidence triggers escalation to a human

Budgets provide a backstop — set limits on how many steps an agent can take, how many tools it can call, how much it can spend, and how long it can run. When something fails or returns an unclear result, the agent should escalate, not keep trying alternate paths indefinitely.

Controlling What Agents Can Touch

The scope of what an agent can access determines how bad a failure can get.

Read operations are low risk — let agents query broadly to gather context. Write operations are where damage happens. Agents should only have write access to what the workflow actually needs.

Irreversible actions — sending emails, issuing refunds, modifying customer records — need approval gates and, where possible, a way to undo them if something goes wrong.

Security follows the same principle. Treat retrieved content and user input as untrusted. Validate what agents send to external systems. Isolate data between customers. Gate outbound actions. Have kill switches ready and logs detailed enough to reconstruct what happened.

Building the Audit Trail

Post-incident, you either have the data to understand what happened, or you are guessing.

Logging the final output is not enough. The audit trail needs every tool call, every parameter, every result, every state change, every decision point. With that in place, teams can replay real incidents with fixed inputs and figure out exactly where things went wrong.

This is what turns reliability from an art into engineering. Instead of tweaking prompts and hoping, you run the same case through the same logic and verify the fix actually works.

Summary: What Your Control Layer Needs

A production-grade control layer does not need to be large. It needs to be strict in a few places:

Shared state with clear write ownership and conflict handling
Action ledger to track and prevent duplicates
Clear completion criteria that can be verified
Stop conditions and enforced budgets
Scoped tool access, approvals for irreversible actions, rollback paths where feasible
Audit trail that supports replay

To know if it is working, track what matters: spend per completed case, duplicate action rate, correction rate (how often humans undo agent work), and stability under replay (same evidence produces same decision).

Stage autonomy gradually — recommend first, then execute-with-approval, then narrow autonomous execution where reversibility is strong. Keep kill switches ready.

Scale agent work by scaling controls, not agent count.

References

Yao, S., et al. (2022). “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv:2210.03629
Wu, Q., et al. (2023). “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.” arXiv:2308.08155
Shinn, N., et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv:2303.11366
Google Cloud (2025). “Choose a Design Pattern for Your Agentic AI System.” Cloud Architecture Center
Microsoft Azure (2025). “AI Agent Orchestration Patterns.” Azure Architecture Center
Model Context Protocol (2025). “MCP Specification.” modelcontextprotocol.io
Stripe. “Idempotent Requests.” API Documentation
Australian Government (2025). “Risk Analysis Tools for Governed LLM-Based Multi-Agent Systems.” Gradient Institute Report

Most production failures are system failures, not model failures.

The Real Problem

Most production failures are system failures, not model failures.

The Real Problem

Operating Contracts: The Foundation of Reliability

The Four Coordination Patterns

1. ReAct (Reasoning and Acting)

2. Manager-Worker

3. Critic/Reviewer

4. Mixture-of-Agents (Parallel Propose + Arbiter)

Orchestration: The Control Layer

Get Nick Vasylyna’s stories in your inbox

Keeping State Under Control

Preventing Duplicate Actions

Knowing When to Stop

Controlling What Agents Can Touch

Building the Audit Trail

Summary: What Your Control Layer Needs

References

Similar Posts