Most production failures are system failures, not model failures.
9 min read12 hours ago
–
The Real Problem
In November 2025, a multi-agent research toolslipped into a recursive loop that ran for 11 days before anyone noticed — resulting in a $47,000 API bill. No stop condition, no budget limit, no alert.
When agent systems fail in production, the root cause is rarely a poor model response. Most failures trace back to missing controls: no safeguards against duplicate actions, no rules for shared data, and no clear stopping conditions when something goes wrong.
**Consider a common scenario. **An agent helps with operational work.…
Most production failures are system failures, not model failures.
9 min read12 hours ago
–
The Real Problem
In November 2025, a multi-agent research toolslipped into a recursive loop that ran for 11 days before anyone noticed — resulting in a $47,000 API bill. No stop condition, no budget limit, no alert.
When agent systems fail in production, the root cause is rarely a poor model response. Most failures trace back to missing controls: no safeguards against duplicate actions, no rules for shared data, and no clear stopping conditions when something goes wrong.
**Consider a common scenario. **An agent helps with operational work. It opens a support ticket, but the tool call times out. The agent retries. Now there are two tickets. Another agent sees both and escalates them. The result: duplicate work, duplicate customer contact, and cleanup costs that exceed the original issue.
A quieter failure is even more costly. A tool call times out after the external system has already processed the request. The agent interprets the timeout as “nothing happened” and retries. Now you have a duplicate refund, a duplicate message, or a duplicate job. The incident is not caused by a bad model output. It is caused by uncertain outcomes combined with unsafe retry logic.
In many organizations, the dominant production failures are control failures: no way to prevent duplicate actions, no rules for updating shared data, and no mechanism to pause or escalate when tool results are uncertain.
Press enter or click to view image in full size
Most production failures are system failures, not model failures.
Operating Contracts: The Foundation of Reliability
MIT’s 2025 State of AI in Business report found 95% of AI pilots fail to reach production. The causes: trust breakdown, integration fragility, cost overruns, lack of observability — all control layer gaps.
Every agent workflow needs an explicit definition of “done” that the system can enforce. These operating contracts include:
- Acceptance criteria and stop conditions — what must be true, and when to stop
- Decision rules and handoff rules — who has authority, when to escalate
- Budgets — limits on time, cost, and tool calls
- Tool boundaries — scope of access, approvals, reversibility
- State rules — who owns data, how conflicts resolve
- Audit trail — what gets recorded for replay
This is the boundary between a demo and a production system. Once agents interact with real systems, teams encounter the same problems regardless of architecture: runaway costs, duplicate actions, wrong decisions shipped, low trust.
The Four Coordination Patterns
Most multi-agent systems in production converge on a small set of coordination patterns. Hybrid approaches exist, but these four appear repeatedly.
Press enter or click to view image in full size
The four coordination patterns most multi-agent systems converge on.
1. ReAct (Reasoning and Acting)
Best suited for investigative work where each step depends on what the agent just learned. The loop is simple:observe context, decide the next step, call a tool, read results, repeat. A customer service agent investigating a billing dispute works this way — pulling records, checking logs, looking up policy, each step informing the next.
This works when acceptance criteria are verifiable (“ticket created once and linked to incident”). It breaks when completion criteria are vague — loops expand, retries create duplicates, and retrieval quietly feeds bad context forward.
2. Manager-Worker
Designed for throughput — getting more done in parallel. A manager breaks a goal into tasks, workers execute simultaneously, the manager consolidates results. Vendor due diligence works this way — parallel workers pull financials, check reputation, review security certs, and analyze references while the manager assembles findings.
This works when tasks are genuinely separable and each worker has a clear lane. It breaks when workers write to shared state without coordination — two agents updating the same CRM field differently, or both triggering customer outreach because no one owned that action.
3. Critic/Reviewer
Used when output needs to meet a standard before it ships. A generator produces output, a reviewer checks it against acceptance criteria, the loop continues until it passes or escalates. Drafting regulated customer communications works this way — one agent writes, another checks compliance and brand guidelines, failures go back with specific feedback.
This works when the reviewer has something concrete to check: required fields, sources cited, no prohibited claims.
It breaks when criteria are subjective — “make it more professional” leads to endless revision cycles where you pay in latency and cost without gaining reliability.
4. Mixture-of-Agents (Parallel Propose + Arbiter)
Useful for decisions under uncertainty — when multiple perspectives help before committing. Multiple agents produce alternatives in parallel, an arbiter selects or aggregates the final output. Investment recommendations work this way — three agents analyze the same data, an arbiter weighs consensus and disagreement, and produces a final recommendation with confidence levels.
A clarification: Mixture-of-Agents (MoA) means inference-time ensembling with explicit selection or aggregation — not Mixture-of-Experts (MoE), which is internal single-model routing.
This works when selection is tied to evidence and explicit rules. It breaks when termination criteria are loose — parallelism multiplies costs, and shared context errors produce multiple confident options that are wrong in the same way.
Coordination patterns define how agents collaborate, but they do not remove the need for operating contracts. Once tools and shared state are involved, failure modes converge across patterns, so controls remain mandatory.
Many “agent patterns” discussed online are simply layered capabilities that change behavior and cost, not substitutes for coordination or governance.
Agentic RAG adds retrieval to ReAct or Manager-Worker loops — the agent pulls information from documents or databases as part of its reasoning. The recurring failure: treating retrieved snippets as ground truth without checking provenance, freshness, or relevance.
Self-Reflection adds a reviewer loop to any pattern, implemented as a separate critic or an internal check step. It can improve quality, but increases latency and cost. Unless having clear acceptance criteria and a stop condition, it becomes expensive looping.
Code-as-Action is ReAct where actions are code execution rather than tool calls. Powerful for data analysis and automation, but it raises the bar on isolation, permissions, and rollback. The production failure mode is rarely “bad code” — it is unsafe execution and weak reversibility.
Memory and tool protocols expand what agents can remember and affect. Memory is not truth — it is a cache. Without write rules, agent notes overwrite systems of record, different agents read different versions of truth, and yesterday’s context leaks into today’s case. Standardized connector protocols like MCP make it easier to add tools quickly, but faster tool expansion increases surface area for failure unless scopes, rate limits, and audit are already enforced.
Orchestration: The Control Layer
Enforcement should be deterministic by default.
Orchestration is the control layer for agent work. It assigns work, enforces constraints, gates risky actions, and records what happened.
Get Nick Vasylyna’s stories in your inbox
Join Medium for free to get updates from this writer.
The model can propose routing or draft plans, but budgets, permissions, stop conditions, and audit logging should not depend on the model’s judgment. If the orchestrator cannot prove a constraint is satisfied, the agent does not proceed.
Press enter or click to view image in full size
Keeping State Under Control
Research from Galileo* found coordination failures account for 37% of multi-agent system breakdowns — state inconsistency, deadlocks, and resource contention.*
Shared state is where most multi-agent systems quietly fall apart.
When state lives in prompts, scattered caches, or ad hoc memory notes, agents start reading different versions of truth. One agent updates a customer record while another is still working from stale data. The result is drift — small inconsistencies that compound until someone notices the wrong email went out or the same refund was issued twice.
Production systems treat workflow state as the single source of truth, make write ownership explicit, and handle conflicts deterministically. When that discipline is missing, parallel workers overwrite each other and incident reviews turn into “who touched this last.”
Preventing Duplicate Actions
Tool calls are where agents touch the real world — creating tickets, changing records, sending messages, issuing refunds. These operations fail in messy ways: timeouts, partial success, eventual consistency.
The common failure: an agent makes a call, the system times out, but the action actually went through. The agent retries, and now you have two tickets, two refunds, two emails sent. The customer gets confused, the team scrambles to clean up, and trust erodes.
An action ledger solves this. Each external action gets a stable key and a recorded outcome — attempted or confirmed. When something times out, the system checks the ledger and the external system before trying again. No more blind retries.
Knowing When to Stop
Agents do not naturally stop — they keep going until something tells them to. If that “something” is vague, you get two outcomes: the agent stops too early and ships something broken, or it keeps iterating and burns through your budget.
Production systems need clear completion rules. What conditions must be true for this task to be done? If those conditions cannot be verified, or if key information is missing, the agent should hand off to a human rather than guess.
Stop conditions that work in production:
- Maximum steps, maximum tool calls, maximum wall-clock time
- Bounded retries with verification (check external state before retrying)
- Abstain policy: insufficient evidence triggers escalation to a human
Budgets provide a backstop — set limits on how many steps an agent can take, how many tools it can call, how much it can spend, and how long it can run. When something fails or returns an unclear result, the agent should escalate, not keep trying alternate paths indefinitely.
Controlling What Agents Can Touch
The scope of what an agent can access determines how bad a failure can get.
Read operations are low risk — let agents query broadly to gather context. Write operations are where damage happens. Agents should only have write access to what the workflow actually needs.
Irreversible actions — sending emails, issuing refunds, modifying customer records — need approval gates and, where possible, a way to undo them if something goes wrong.
Security follows the same principle. Treat retrieved content and user input as untrusted. Validate what agents send to external systems. Isolate data between customers. Gate outbound actions. Have kill switches ready and logs detailed enough to reconstruct what happened.
Building the Audit Trail
Post-incident, you either have the data to understand what happened, or you are guessing.
Logging the final output is not enough. The audit trail needs every tool call, every parameter, every result, every state change, every decision point. With that in place, teams can replay real incidents with fixed inputs and figure out exactly where things went wrong.
This is what turns reliability from an art into engineering. Instead of tweaking prompts and hoping, you run the same case through the same logic and verify the fix actually works.
Summary: What Your Control Layer Needs
A production-grade control layer does not need to be large. It needs to be strict in a few places:
- Shared state with clear write ownership and conflict handling
- Action ledger to track and prevent duplicates
- Clear completion criteria that can be verified
- Stop conditions and enforced budgets
- Scoped tool access, approvals for irreversible actions, rollback paths where feasible
- Audit trail that supports replay
To know if it is working, track what matters: spend per completed case, duplicate action rate, correction rate (how often humans undo agent work), and stability under replay (same evidence produces same decision).
Stage autonomy gradually — recommend first, then execute-with-approval, then narrow autonomous execution where reversibility is strong. Keep kill switches ready.
Scale agent work by scaling controls, not agent count.
References
- Yao, S., et al. (2022). “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv:2210.03629
- Wu, Q., et al. (2023). “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.” arXiv:2308.08155
- Shinn, N., et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv:2303.11366
- Google Cloud (2025). “Choose a Design Pattern for Your Agentic AI System.” Cloud Architecture Center
- Microsoft Azure (2025). “AI Agent Orchestration Patterns.” Azure Architecture Center
- Model Context Protocol (2025). “MCP Specification.” modelcontextprotocol.io
- Stripe. “Idempotent Requests.” API Documentation
- Australian Government (2025). “Risk Analysis Tools for Governed LLM-Based Multi-Agent Systems.” Gradient Institute Report