From ReAct to LLM orchestration — how intelligent routing transforms monolithic agents into production-grade multi-agent systems that actually work
When you’re building agentic AI systems that need to scale beyond toy examples, you quickly discover that the routing mechanism — how tasks get distributed across your system — becomes the make-or-break factor. This article explores how router-based agents work at multiple levels, from the micro-decisions within a single agent to orchestrating dozens of specialized agents across production systems.

Introduction: Why Routing Ma...
From ReAct to LLM orchestration — how intelligent routing transforms monolithic agents into production-grade multi-agent systems that actually work
When you’re building agentic AI systems that need to scale beyond toy examples, you quickly discover that the routing mechanism — how tasks get distributed across your system — becomes the make-or-break factor. This article explores how router-based agents work at multiple levels, from the micro-decisions within a single agent to orchestrating dozens of specialized agents across production systems.

Introduction: Why Routing Matters in Agentic AI
Think about how your brain works when you’re solving a complex problem. You don’t throw your entire cognitive capacity at every tiny decision. Instead, you route different aspects of the problem to specialized mental processes — visual processing here, logical reasoning there, memory recall somewhere else. This natural division of labor is precisely what makes router-based agents so powerful in AI systems.
A router in agentic AI acts as an intelligent traffic controller, but unlike traditional API routing or load balancing, these routers make semantic decisions. They understand what a task actually means and route it accordingly. Here’s what sets them apart:
- Task semantics: Understanding the intent and context behind each request
- Agent capabilities: Matching tasks to agents with the right expertise
- System constraints: Working within resource limitations and state dependencies
- Cost-performance trade-offs: Choosing the right model size for the job
- Transparency: Making decisions that can be traced and audited
As systems evolve from single-agent implementations to complex multi-agent orchestrations, the routing mechanism becomes the critical factor that determines whether your system scales gracefully or collapses under its own complexity.
The Monolithic Agent Problem
Let’s be honest about why we need routers in the first place. Traditional monolithic agent architectures hit hard limits in production:
Token bloat becomes unavoidable. When a single agent handles everything from simple math to complex research, it carries around context that’s irrelevant 90% of the time. You’re paying for those tokens on every request.
Semantic drift creeps in. An agent trying to be a jack-of-all-trades ends up being a master of none. Its reasoning degrades as competing objectives blur its focus.
Failure cascades spread like dominoes. One agent fails, and your entire system comes crashing down.
Cost explosion hits fast. You’re using GPT-4 for tasks that GPT-4o-mini or even a simple rule could handle just fine.
Auditability vanishes. Try explaining to compliance officers why your monolithic agent made a particular decision when all you have is a black box.
Routers solve these problems through what we call scope management — each agent gets a focused responsibility with clear, limited scope. Think of it as the Unix philosophy applied to AI: make each agent do one thing well.
Three Levels of Router Architecture
Routers operate at three distinct levels in agentic systems, each with different characteristics and implementation patterns:
Level 1: Intra-Agent Routers
These are the micro-decisions happening inside individual agents. When an agent implements the ReAct pattern (Reasoning and Acting), it’s essentially routing between different action paths based on its current state. The agent might decide: “Should I search the web, or do I have enough information to answer directly? Should I use chain-of-thought reasoning, or is this straightforward enough for a direct response?”
Level 2: Inter-Agent Routers
This is where specialized agents hand off work to each other within a workflow. Imagine a research pipeline where one agent gathers raw data, another analyzes it, and a third synthesizes findings. The routers coordinate these handoffs, determining which downstream agent should receive work based on what came before.
Level 3: Orchestrator-Level Routers
At the top level, master agents or orchestrators implement what’s known as the Supervisor pattern. They decompose user requests into subtasks and delegate to appropriate specialized agents. This is the architectural pattern that lets you build systems with dozens of specialized agents without descending into chaos.
Implementing Intra-Agent Routing: The ReAct Pattern
The ReAct pattern, introduced by Yao et al. in their 2023 paper “ReAct: Synergizing Reasoning and Acting in Language Models,” represents the foundation of intra-agent routing. It’s elegant in its simplicity: agents alternate between thinking (reasoning), doing (acting), and learning (observing).

Here’s how it works in practice. Your agent receives a task and enters a loop:
- Thought: The agent reasons about the current situation — what information does it have, what’s missing, what should it do next?
- Action: Based on that reasoning, it executes a specific action like calling a tool or API
- Observation: It examines the result and updates its understanding
- Repeat: The cycle continues until the task is complete
This might seem verbose compared to just having an agent directly answer a question, but the benefits are substantial. Every step is traceable, which is crucial for debugging and compliance. The agent can course-correct when an action doesn’t produce expected results. And crucially, it can integrate external information dynamically rather than relying solely on its training data.
Two Approaches to Intra-Agent Routing
LLM-Based Routing: The default approach where the language model itself decides which tool to invoke next. This is flexible and handles ambiguous situations well, but it’s token-intensive and introduces latency with each decision.
Confidence-Based Routing: A more deterministic approach where you evaluate confidence scores for different paths and follow the highest-confidence option. This adds overhead but can improve decision quality and provides better auditability.
The code patterns for these are straightforward with frameworks like LangGraph:
from langgraph.graph import StateGraph from pydantic import BaseModel class AgentState(BaseModel): messages: list current_task: str reasoning_trace: list def agent_node(state: AgentState): # LLM decides which tool to invoke response = llm.invoke([ {"role": "system", "content": AGENT_PROMPT}, {"role": "user", "content": state.current_task} ]) # Parse and execute tool decision tool_call = response.tool_calls[0] tool_result = tools[tool_call["name"]].invoke(tool_call["args"]) # Track reasoning for auditability state.reasoning_trace.append({ "decision": tool_call["name"], "reasoning": response.content }) return stateThe key insight here is that intra-agent routing isn’t just about efficiency — it’s about creating agents that can explain their decisions step by step.
Inter-Agent Routing: Coordinating Specialized Experts
Once you have multiple specialized agents, you need patterns for routing work between them. This is where things get interesting because you’re no longer just routing within a single reasoning loop — you’re orchestrating a dance between different expert systems.

The Semantic Routing Pattern
Semantic routing uses the meaning of the request to determine which agent should handle it. Unlike keyword matching or rule-based routing, semantic routing understands context and nuance.
Here’s a practical example. Imagine you’re building a customer service system with three specialized agents:
- A product expert that knows your catalog inside and out
- A policy agent that handles returns, warranties, and terms
- A technical support agent for troubleshooting
When a customer asks “My new laptop won’t charge,” a keyword-based router might get confused — is this about products or technical support? Semantic routing understands that this is a technical issue requiring troubleshooting, not a product inquiry.
The implementation looks like this:
from langchain.embeddings import OpenAIEmbeddings from sklearn.metrics.pairwise import cosine_similarity class SemanticRouter: def __init__(self): self.embeddings = OpenAIEmbeddings() self.agent_profiles = { "product_expert": "Handles product catalogs, features, comparisons, availability", "policy_agent": "Manages returns, refunds, warranties, terms of service", "technical_support": "Troubleshoots technical issues, setup, configuration" } # Pre-compute embeddings for agent profiles self.profile_embeddings = { name: self.embeddings.embed_query(description) for name, description in self.agent_profiles.items() } def route(self, user_request: str) -> str: # Embed the user request request_embedding = self.embeddings.embed_query(user_request) # Find most similar agent profile similarities = { agent: cosine_similarity([request_embedding], [profile_emb])[0][0] for agent, profile_emb in self.profile_embeddings.items() } return max(similarities, key=similarities.get)This approach scales well because adding new agents just means adding new profile descriptions — no retraining or complex rule updates required.
The Hierarchical Routing Pattern
Sometimes routing decisions need to happen in stages. You might have a first-level router that determines the domain (sales, support, billing), then second-level routers within each domain that select specific agents.
Think of it like a hospital’s triage system. First, you determine which department (emergency, outpatient, specialist). Then within that department, which specific doctor or resource is needed. This hierarchical approach reduces decision complexity at each level and makes the system more maintainable.
Orchestrator-Level Routing: The Supervisor Pattern
The Supervisor pattern, widely implemented in frameworks like LangGraph and now codified in the langgraph-supervisor library, represents the most sophisticated level of routing. A supervisor agent acts as a project manager, breaking down complex user requests into subtasks and delegating them to specialized worker agents.

Here’s what makes the Supervisor pattern powerful:
Dynamic task decomposition: The supervisor analyzes the user’s request and determines what subtasks are needed, in what order, and which agents should handle them. This isn’t hardcoded — it adapts based on the specific request.
Failure handling: If a worker agent fails, the supervisor can reroute the work, try a different approach, or gracefully escalate to a human. The system doesn’t just crash.
Result synthesis: Worker agents might return partial results in different formats. The supervisor collects these, resolves any conflicts, and synthesizes a coherent final response.
State management: The supervisor maintains the overall state of the workflow, tracking what’s been completed, what’s in progress, and what’s still needed.
A practical implementation looks like this:
from langgraph_supervisor import create_supervisor from langgraph.prebuilt import create_react_agent from langchain_openai import ChatOpenAI model = ChatOpenAI(model="gpt-4") # Create specialized agents research_agent = create_react_agent( model=model, tools=[web_search, document_retrieval], name="research_expert", prompt="You are a research expert. Gather comprehensive information." ) analysis_agent = create_react_agent( model=model, tools=[data_analysis, statistical_tools], name="analysis_expert", prompt="You are a data analyst. Find patterns and insights." ) # Create supervisor that orchestrates them workflow = create_supervisor( [research_agent, analysis_agent], model=model, prompt="""You are a team supervisor managing research and analysis experts. For each user request: 1. Determine which agents are needed 2. Delegate subtasks to appropriate agents 3. Collect and synthesize results 4. Provide a coherent final answer""" ) # Compile with memory for stateful interactions app = workflow.compile(checkpointer=InMemorySaver())
The beauty of this pattern is that it scales. Need to add a new specialized agent? Just add it to the supervisor’s agent list. The supervisor handles the orchestration logic, and individual agents remain focused on their expertise.
LLM-Based Routing: Intelligence at the Core
One of the most powerful — and often overlooked — capabilities in modern multi-agent systems is using LLMs themselves as the routing decision-makers. Instead of hardcoding rules or using simple pattern matching, you let the language model’s intelligence determine which agent should handle each task. This represents a fundamental shift in how we think about orchestration.

Why LLM-Based Routing Changes Everything
Traditional routing relies on predefined rules: “If the request contains ‘SQL’, route to the database agent.” But what happens when someone asks, “Show me sales trends from last quarter”? That requires database access, but doesn’t mention SQL. Rule-based routing breaks down.
LLM-based routing understands intent. It recognizes that “sales trends from last quarter” implies:
- Data retrieval (database agent)
- Possibly visualization (charting agent)
- Temporal analysis (analytics agent)
The LLM can reason about these implications and make intelligent routing decisions that adapt to context, nuance, and even ambiguity.
The Dynamic Supervisor Pattern with LLM Routing
Here’s how to implement a supervisor that uses an LLM to make routing decisions:
from langchain_openai import ChatOpenAI from pydantic import BaseModel from typing import Literal, List class RoutingDecision(BaseModel): """Structured output for routing decisions""" selected_agent: Literal["research", "analysis", "writing", "coding", "database"] reasoning: str confidence: float backup_agent: str = None class IntelligentSupervisor: def __init__(self): self.llm = ChatOpenAI(model="gpt-4", temperature=0) # Define agent capabilities clearly self.agent_profiles = { "research": "Gathers information from web, documents, and APIs. Expert at finding current data and synthesizing sources.", "analysis": "Performs statistical analysis, finds patterns, creates insights from data. Strong with numbers and trends.", "writing": "Creates polished written content, reports, and documentation. Excellent at clear communication.", "coding": "Writes, debugs, and optimizes code across multiple languages. Handles technical implementation.", "database": "Queries databases, manages data, performs CRUD operations. SQL and database expertise." } def route(self, user_request: str, context: dict = None) -> RoutingDecision: """ Let the LLM intelligently decide which agent should handle the request. Context can include conversation history, previous results, user preferences. """ # Build rich context for the LLM context_str = "" if context: context_str = f"\nContext: {context}" routing_prompt = f"""You are an intelligent routing coordinator for a multi-agent system. Available agents and their capabilities: {self._format_agent_profiles()} User request: {user_request} {context_str} Analyze the request carefully: 1. What is the user really asking for? 2. What expertise is needed to fulfill this request? 3. Which agent is best positioned to help? 4. How confident are you in this routing decision? 5. What would be a good backup if the primary agent struggles? Provide a routing decision with clear reasoning.""" # LLM makes the routing decision with structured output decision = self.llm.with_structured_output(RoutingDecision).invoke(routing_prompt) # Log for observability print(f" Routing Decision:") print(f" Agent: {decision.selected_agent}") print(f" Confidence: {decision.confidence:.2%}") print(f" Reasoning: {decision.reasoning}") if decision.backup_agent: print(f" Backup: {decision.backup_agent}") return decision def _format_agent_profiles(self) -> str: return "\n".join([ f"- {name}: {profile}" for name, profile in self.agent_profiles.items() ]) # Usage example supervisor = IntelligentSupervisor() # The LLM understands nuance and context decision = supervisor.route( "I need to understand our customer churn rate and what's driving it", context={"previous_task": "Retrieved customer data from database"} ) # Routes to: analysis (with database as backup) decision = supervisor.route( "Can you explain how gradient descent works in simple terms?" ) # Routes to: writing (technical explanation task) decision = supervisor.route( "Find recent papers on transformer attention mechanisms" ) # Routes to: researchAdaptive Orchestration: LLM Routing in Real-Time
The most sophisticated systems don’t just route once — they continuously adapt based on how the task unfolds. The LLM observes results from one agent and dynamically decides what should happen next:
from typing import Dict, Any from dataclasses import dataclass @dataclass class TaskState: original_request: str completed_steps: List[Dict[str, Any]] current_context: Dict[str, Any] is_complete: bool class AdaptiveOrchestrator: """ LLM continuously decides routing based on evolving context. This is more powerful than static workflows. """ def __init__(self): self.llm = ChatOpenAI(model="gpt-4") self.max_iterations = 10 async def orchestrate(self, user_request: str) -> Dict[str, Any]: """ The LLM adaptively routes through agents until the task is complete. Each routing decision considers all previous context. """ state = TaskState( original_request=user_request, completed_steps=[], current_context={}, is_complete=False ) iteration = 0 while not state.is_complete and iteration < self.max_iterations: iteration += 1 # LLM decides: what should happen next? next_step = await self._decide_next_step(state) print(f"\n Iteration {iteration}:") print(f" Next Agent: {next_step.agent}") print(f" Task: {next_step.task}") print(f" Reasoning: {next_step.reasoning}") # Execute the agent result = await self._execute_agent(next_step.agent, next_step.task, state) # Track what was done state.completed_steps.append({ "iteration": iteration, "agent": next_step.agent, "task": next_step.task, "result": result }) # Update context with results state.current_context.update(result) # LLM evaluates: are we done? state.is_complete = await self._evaluate_completion(state) return self._synthesize_final_result(state) async def _decide_next_step(self, state: TaskState) -> Dict[str, Any]: """ LLM analyzes current state and decides what to do next. This is the core intelligence of adaptive routing. """ decision_prompt = f"""You are orchestrating a complex task across multiple specialized agents. Original request: {state.original_request} What has been done so far: {self._format_completed_steps(state.completed_steps)} Current context and data available: {state.current_context} Analyze the situation: 1. What progress has been made? 2. What still needs to be done to fully satisfy the original request? 3. Which agent should work on the next piece? 4. What specific task should that agent perform? 5. Are we ready to synthesize a final answer, or do we need more work? Provide your decision on what should happen next.""" response = await self.llm.ainvoke(decision_prompt) # Parse LLM decision return self._parse_orchestration_decision(response) async def _evaluate_completion(self, state: TaskState) -> bool: """ LLM evaluates whether the original request is fully satisfied. """ evaluation_prompt = f"""Review this multi-agent workflow execution: Original request: {state.original_request} Work completed: {self._format_completed_steps(state.completed_steps)} Available data and results: {state.current_context} Question: Has the original request been fully satisfied? Consider: - Is there enough information to provide a complete answer? - Have all aspects of the request been addressed? - Is additional work needed, or can we synthesize a final response? Answer with: COMPLETE or CONTINUE (and explain why)""" response = await self.llm.ainvoke(evaluation_prompt) return "COMPLETE" in response.content.upper() def _format_completed_steps(self, steps: List[Dict]) -> str: if not steps: return "Nothing completed yet." return "\n".join([ f"Step {s['iteration']}: {s['agent']} - {s['task']}\n Result: {s['result']}" for s in steps ])Parallel Agent Selection with LLM Intelligence
Sometimes a request benefits from multiple perspectives simultaneously. An LLM can recognize these situations and orchestrate parallel execution:
class ParallelIntelligentRouter: """ LLM decides when to route to multiple agents in parallel. Useful for complex requests requiring diverse expertise. """ def __init__(self): self.llm = ChatOpenAI(model="gpt-4") async def route_intelligently(self, request: str) -> Dict[str, Any]: """ LLM determines: 1. Should this go to one agent or multiple? 2. If multiple, which ones and why? 3. How should results be synthesized? """ analysis_prompt = f"""Analyze this request for multi-agent routing: Request: {request} Available agents: research, analysis, writing, coding, fact_checking Decide: 1. Should this request be handled by ONE agent or MULTIPLE agents in parallel? 2. If multiple, which agents should work on it simultaneously? 3. What unique perspective or capability does each selected agent bring? 4. How should their outputs be combined? Provide a routing strategy.""" strategy = await self.llm.ainvoke(analysis_prompt) if "MULTIPLE" in strategy.content: # LLM identified this needs parallel processing agents_to_invoke = self._parse_agent_list(strategy.content) print(f"Parallel Routing to: {', '.join(agents_to_invoke)}") # Execute all agents in parallel results = await asyncio.gather(*[ self._execute_agent(agent, request) for agent in agents_to_invoke ]) # LLM synthesizes the diverse results final_result = await self._synthesize_parallel_results( request, agents_to_invoke, results ) return final_result else: # Single agent routing return await self._single_agent_routing(request) async def _synthesize_parallel_results( self, request: str, agents: List[str], results: List[Any] ) -> str: """ LLM intelligently combines results from multiple agents. """ synthesis_prompt = f"""Original request: {request} Multiple agents provided their perspectives: {self._format_parallel_results(agents, results)} Your task: Synthesize these diverse perspectives into a single, coherent response. - Identify where agents agree and where they differ - Highlight unique insights from each agent - Resolve any contradictions - Provide a unified, high-quality answer""" synthesis = await self.llm.ainvoke(synthesis_prompt) return synthesis.contentLearning from Routing Decisions: The Evolution Pattern
The most advanced systems use feedback to improve routing over time. The LLM learns which routing decisions lead to successful outcomes:
from datetime import datetime from collections import defaultdict class EvolvingRouter: """ Tracks routing decisions and outcomes to improve over time. LLM uses historical patterns to make better routing choices. """ def __init__(self): self.llm = ChatOpenAI(model="gpt-4") self.routing_history = [] self.success_patterns = defaultdict(list) def route_with_learning(self, request: str) -> str: """ LLM considers historical success patterns when routing. Gets smarter over time. """ # Classify request type for pattern matching request_type = self._classify_request(request) # Get successful routing patterns for this type successful_patterns = self.success_patterns.get(request_type, []) routing_prompt = f"""You are routing a request with historical context: Request: {request} Request Type: {request_type} Historical data - successful routing patterns for similar requests: {self._format_success_patterns(successful_patterns)} Agent performance metrics: {self._get_agent_performance_metrics()} Based on both your understanding of the request AND historical evidence, decide which agent should handle this. Explain your reasoning.""" decision = self.llm.with_structured_output(RoutingDecision).invoke(routing_prompt) # Track this routing decision self.routing_history.append({ "timestamp": datetime.now(), "request": request, "request_type": request_type, "selected_agent": decision.selected_agent, "reasoning": decision.reasoning, "confidence": decision.confidence }) return decision def learn_from_outcome( self, request: str, routing_decision: RoutingDecision, outcome_quality: float, user_feedback: str = None ): """ Record whether this routing decision was successful. Future routing decisions benefit from this knowledge. """ request_type = self._classify_request(request) # Store successful patterns (quality > 0.7) if outcome_quality > 0.7: self.success_patterns[request_type].append({ "agent": routing_decision.selected_agent, "request_sample": request[:100], "quality": outcome_quality, "reasoning": routing_decision.reasoning }) print(f"Learned: {request_type} → {routing_decision.selected_agent} works well") elif outcome_quality < 0.3: print(f"Learned: {request_type} → {routing_decision.selected_agent} performed poorly") def _classify_request(self, request: str) -> str: """Use LLM to classify request into categories""" classification = self.llm.invoke( f"Classify this request into one category: {request}\n\n" "Categories: data_analysis, information_retrieval, content_creation, " "technical_implementation, problem_solving\n\nCategory:" ) return classification.content.strip().lower() def _get_agent_performance_metrics(self) -> Dict[str, float]: """Calculate agent success rates from history""" # Simplified - in production, track detailed metrics return { "research": 0.85, "analysis": 0.78, "writing": 0.92, "coding": 0.81 }Production Considerations for LLM Routing
When deploying LLM-based routing in production, several considerations become critical:
1. Latency Management
LLM routing adds latency. Mitigate this with:
- Caching routing decisions for similar requests
- Using faster models (GPT-4o-mini) for simple routing, GPT-4 for complex
- Parallel routing evaluation when multiple agents might work
2. Cost Optimization
Every routing decision costs money. Smart strategies:
- Use rule-based routing for obvious cases, LLM for ambiguous ones
- Batch routing decisions when possible
- Track routing costs separately from agent execution costs
3. Reliability and Fallbacks
LLMs can fail or return unexpected outputs:
class RobustLLMRouter: """Production-ready LLM routing with comprehensive error handling""" def route_with_fallbacks(self, request: str) -> str: try: # Try LLM routing with structured output decision = self.llm.with_structured_output(RoutingDecision).invoke( self.create_routing_prompt(request) ) # Validate decision if not self._validate_decision(decision): raise ValueError("Invalid routing decision") # Check confidence threshold if decision.confidence < 0.5: # Use fallback router for low-confidence decisions return self._fallback_routing(request) return decision.selected_agent except Exception as e: logger.error(f"LLM routing failed: {e}") # Fall back to rule-based routing return self._rule_based_fallback(request) def _validate_decision(self, decision: RoutingDecision) -> bool: """Ensure LLM decision is valid and safe""" # Check agent exists if decision.selected_agent not in self.available_agents: return False # Check confidence is reasonable if not (0 <= decision.confidence <= 1): return False # Check reasoning is provided if not decision.reasoning or len(decision.reasoning) < 10: return False return True4. Observability and Debugging
LLM routing decisions must be traceable:
@dataclass class RoutingTrace: """Complete audit trail for LLM routing decision""" timestamp: datetime request: str request_embedding: List[float] # For similarity search routing_prompt: str llm_response: str parsed_decision: RoutingDecision validation_passed: bool selected_agent: str execution_result: Any quality_score: float cost_usd: float latency_ms: int class ObservableLLMRouter: def __init__(self): self.llm = ChatOpenAI(model="gpt-4") self.traces = [] def route_with_tracing(self, request: str) -> str: start_time = time.time() # Create trace trace = RoutingTrace( timestamp=datetime.now(), request=request, request_embedding=self._embed(request) ) # Store prompt used trace.routing_prompt = self.create_routing_prompt(request) # Get LLM decision response = self.llm.invoke(trace.routing_prompt) trace.llm_response = response.content # Parse and validate trace.parsed_decision = self._parse_decision(response) trace.validation_passed = self._validate_decision(trace.parsed_decision) trace.selected_agent = trace.parsed_decision.selected_agent # Record metrics trace.latency_ms = int((time.time() - start_time) * 1000) trace.cost_usd = self._calculate_cost(response) # Store trace for analysis self.traces.append(trace) return trace.selected_agent
When to Use LLM-Based Routing
LLM routing isn’t always the answer. Use it when:
- Requests are ambiguous or conversational
- Context and intent matter more than keywords
- Agent boundaries overlap or are fuzzy
- You need the system to generalize to new request types
- Explainability of routing decisions is important
Don’t use it when:
- Latency is critical (< 100ms response time)
- Routing logic is simple and deterministic
- Cost optimization is the primary concern
- Request types are well-defined and limited
The best production systems use a hybrid approach:
- Fast rule-based routing for obvious cases
- LLM routing for ambiguous or complex requests
- Learned patterns from historical data to improve both
Hybrid Production Architecture: Combining the Best of All Patterns
The most successful production AI systems don’t pick just one routing strategy — they combine multiple approaches in a tiered architecture that optimizes for both cost and intelligence. This hybrid approach represents the current state-of-the-art in production deployments.

The Tiered Routing Strategy
Real-world systems like Netflix, Stripe, and Amazon use a three-tiered approach:
Tier 1: Rule-Based Routing (60% of traffic)
- Handles straightforward, well-defined requests
- Zero latency overhead, zero cost
- Examples: Direct catalog browsing, API endpoints, known command patterns
- Implementation: Simple pattern matching or keyword detection
Tier 2: Semantic Routing (30% of traffic)
- Handles conversational queries with clear intent
- Low cost (~$0.001 per route), fast (50–150ms)
- Examples: Product searches, category navigation, basic support queries
- Implementation: Embedding-based similarity matching
Tier 3: LLM-Based Routing (10% of traffic)
- Handles complex, ambiguous, or novel requests
- Higher cost ($0.01–0.05 per route), slower (300–800ms)
- Examples: Strategic questions, multi-dimensional analysis, edge cases
- Implementation: LLM reasoning with structured outputs
Real-World Example: Netflix Content Recommendation
Netflix’s recommendation system exemplifies hybrid routing at scale:
class HybridContentRouter: """ Netflix-style hybrid routing for content recommendations. Combines rule-based, semantic, and LLM routing. """ def __init__(self): self.rule_patterns = self._load_rule_patterns() self.semantic_router = SemanticRouter() self.llm_router = IntelligentLLMRouter() self.metrics = RoutingMetrics() def route(self, user_interaction: dict) -> dict: """Route content request using tiered strategy""" # Tier 1: Try rule-based routing first (fastest, cheapest) rule_match = self._check_rule_patterns(user_interaction) if rule_match: self.metrics.record('rule_based') return self.handle_direct_browse(rule_match) # Tier 2: Try semantic routing (balanced) if self._is_semantic_suitable(user_interaction): self.metrics.record('semantic') agent = self.semantic_router.route(user_interaction['query']) return self.execute_agent(agent, user_interaction) # Tier 3: Fall back to LLM routing (most flexible) self.metrics.record('llm_based') routing_decision = self.llm_router.route(user_interaction['query']) return self.execute_complex_routing(routing_decision) def _check_rule_patterns(self, interaction: dict) -> Optional[str]: """Fast rule-based classification""" # Direct browse actions if interaction.get('action') == 'browse_category': return 'catalog_browse' # Continue watching if interaction.get('action') == 'continue_watching': return 'resume_playback' # Genre selection if interaction.get('genre') in self.rule_patterns['known_genres']: return 'genre_filter' return None def _is_semantic_suitable(self, interaction: dict) -> bool: """Determine if semantic routing is appropriate""" query = interaction.get('query', '') # Well-formed search queries if len(query.split()) <= 5 and any( keyword in query.lower() for keyword in ['show', 'movie', 'documentary', 'comedy', 'drama'] ): return True return False # Usage in production router = HybridContentRouter() # Example 1: Direct browse (Rule-based - Tier 1) result = router.route({ 'action': 'browse_category', 'category': 'action' }) # Routes instantly to catalog, 0ms overhead # Example 2: Genre search (Semantic - Tier 2) result = router.route({ 'query': 'show me sci-fi movies' }) # Routes via embeddings, ~100ms # Example 3: Complex preference (LLM - Tier 3) result = router.route({ 'query': 'movies like Inception but lighter and more uplifting' }) # Routes via LLM reasoning, ~500ms but handles complexityCost-Performance Analysis: Hybrid vs. Single Strategy
Let’s look at real numbers for a system handling 1 million requests per month:
Scenario A: All LLM Routing (Naive Approach)
- 1M requests × $0.03 per route = $30,000/month
- Average latency: 500ms
- Quality: Excellent for all requests
Scenario B: Hybrid Routing (Production Optimized)
- 600K rule-based × $0 = $0
- 300K semantic × $0.001 = $300
- 100K LLM × $0.03 = $3,000
- Total: $3,300/month (89% savings)
- Average latency: 150ms (70% faster)
- Quality: Excellent where it matters
Implementing Hybrid Routing with Fallbacks
Production systems need robust fallback chains:
class ProductionHybridRouter: """ Production-grade hybrid router with comprehensive fallbacks. Used in enterprise systems handling millions of requests. """ def __init__(self): self.rule_router = RuleBasedRouter() self.semantic_router = SemanticRouter() self.llm_router = LLMRouter() self.fallback_agent = DefaultAgent() # Track success rates for adaptive routing self.success_rates = { 'rule_based': 0.98, 'semantic': 0.96, 'llm': 0.94 } def route_with_fallbacks(self, request: str) -> AgentResponse: """ Route with comprehensive fallback chain. Never fails - degrades gracefully. """ start_time = time.time() routing_path = [] try: # Attempt 1: Rule-based (fastest) result = self._try_rule_based(request) if result and result.confidence > 0.9: routing_path.append('rule_based') return self._finalize_response(result, routing_path, start_time) except Exception as e: self._log_failure('rule_based', e) try: # Attempt 2: Semantic (balanced) result = self._try_semantic(request) if result and result.confidence > 0.7: routing_path.append('semantic') return self._finalize_response(result, routing_path, start_time) except Exception as e: self._log_failure('semantic', e) try: # Attempt 3: LLM (most capable) result = self._try_llm(request) if result and result.confidence > 0.5: routing_path.append('llm') return self._finalize_response(result, routing_path, start_time) except Exception as e: self._log_failure('llm', e) # Attempt 4: Fallback agent (always succeeds) routing_path.append('fallback') result = self.fallback_agent.handle(request) return self._finalize_response(result, routing_path, start_time) def _try_rule_based(self, request: str) -> Optional[AgentResponse]: """Try rule-based routing with timeout""" with timeout(seconds=0.1): # Must be fast return self.rule_router.route(request) def _try_semantic(self, request: str) -> Optional[AgentResponse]: """Try semantic routing with timeout""" with timeout(seconds=0.5): # Moderate timeout return self.semantic_router.route(request) def _try_llm(self, request: str) -> Optional[AgentResponse]: """Try LLM routing with timeout""" with timeout(seconds=2.0): # Longer timeout OK return self.llm_router.route(request) def _finalize_response( self, result: AgentResponse, routing_path: List[str], start_time: float ) -> AgentResponse: """Add metadata and log metrics""" latency = time.time() - start_time result.metadata = { 'routing_path': routing_path, 'latency_ms': int(latency * 1000), 'final_router': routing_path[-1], 'timestamp': datetime.now().isoformat() } # Log for monitoring and optimization self._log_successful_route(routing_path, latency, result.confidence) return resultAdaptive Routing: Learning from History
The most sophisticated hybrid systems learn which routing strategy works best for different request types:
class AdaptiveHybridRouter: """ Learns optimal routing strategy based on historical performance. Continuously improves routing decisions. """ def __init__(self): self.routers = { 'rule': RuleBasedRouter(), 'semantic': SemanticRouter(), 'llm': LLMRouter() } # Track performance by request pattern self.pattern_performance = defaultdict(lambda: { 'rule': {'success': 0, 'attempts': 0}, 'semantic': {'success': 0, 'attempts': 0}, 'llm': {'success': 0, 'attempts': 0} }) def route(self, request: str) -> AgentResponse: """Route based on learned patterns""" # Classify request pattern pattern = self._classify_pattern(request) # Get historical success rates for this pattern perf = self.pattern_performance[pattern] # Calculate expected success rate for each router strategies = [] for router_type, stats in perf.items(): if stats['attempts'] > 10: # Need minimum data success_rate = stats['success'] / stats['attempts'] strategies.append((router_type, success_rate)) # Sort by success rate (highest first) strategies.sort(key=lambda x: x[1], reverse=True) # Try strategies in order of historical success for router_type, expected_success in strategies: try: result = self.routers[router_type].route(request) if result and result.confidence > 0.7: self._record_success(pattern, router_type) return result except Exception: self._record_failure(pattern, router_type) # Fallback to LLM if all learned strategies fail return self.routers['llm'].route(request) def _classify_pattern(self, request: str) -> str: """Classify request into pattern category""" # Simple classification - production systems use ML here if len(request.split()) <= 3: return 'short_query' elif '?' in request: return 'question' elif any(word in request.lower() for word in ['find', 'show', 'get']): return 'retrieval' else: return 'general'Production Deployment Strategy
When deploying hybrid routing to production:
Phase 1: Shadow Mode (Week 1–2)
- Run hybrid router alongside existing system
- Don’t route real traffic yet
- Compare results and collect metrics
- Identify discrepancies
Phase 2: Canary Deployment (Week 3–4)
- Route 5% of traffic through hybrid system
- Monitor error rates, latency, cost
- Gradually increase to 10%, 25%, 50%
Phase 3: Full Deployment (Week 5–6)
- Route 100% of traffic
- Maintain old system as fallback
- Monitor cost savings and performance
Phase 4: Optimization (Ongoing)
- Analyze routing patterns
- Adjust tier percentages based on actual data
- Add new rules based on common patterns
- Retrain semantic models on production data
Key Metrics to Track
class HybridRoutingMetrics: """Track comprehensive metrics for hybrid routing""" def __init__(self): self.metrics = { 'routing_distribution': Counter(), 'latency_by_tier': defaultdict(list), 'cost_by_tier': defaultdict(float), 'success_rate_by_tier': defaultdict(lambda: {'success': 0, 'total': 0}) } def record_routing(self, tier: str, latency_ms: int, cost: float, success: bool): """Record metrics for a routing decision""" self.metrics['routing_distribution'][tier] += 1 self.metrics['latency_by_tier'][tier].append(latency_ms) self.metrics['cost_by_tier'][tier] += cost self.metrics['success_rate_by_tier'][tier]['total'] += 1 if success: self.metrics['success_rate_by_tier'][tier]['success'] += 1 def get_summary(self) -> dict: """Generate summary report""" total_requests = sum(self.metrics['routing_distribution'].values()) total_cost = sum(self.metrics['cost_by_tier'].values()) return { 'total_requests': total_requests, 'total_cost': total_cost, 'cost_per_request': total_cost / total_requests if total_requests > 0 else 0, 'distribution': { tier: (count / total_requests * 100) for tier, count in self.metrics['routing_distribution'].items() }, 'avg_latency': { tier: sum(latencies) / len(latencies) if latencies else 0 for tier, latencies in self.metrics['latency_by_tier'].items() }, 'success_rates': { tier: (stats['success'] / stats['total'] * 100) if stats['total'] > 0 else 0 for tier, stats in self.metrics['success_rate_by_tier'].items() } }When Hybrid Routing Makes Sense
Use a hybrid approach when:
- High volume — Processing thousands of requests per day
- Mixed complexity — Some requests are simple, others complex
- Cost matters — Budget constraints require optimization
- Performance critical — Latency requirements vary by request type
- Quality important — Can’t sacrifice accuracy for speed
The hybrid approach is now the de facto standard for production AI systems. It combines the speed of rules, the intelligence of embeddings, and the flexibility of LLMs into a single, optimized architecture that delivers the best possible cost-performance ratio.
Cost-Aware Routing: Right-Sizing Your Model Selection
Here’s a dirty secret about production AI systems: most of the time, you’re overpaying for compute. You’re using Claude Opus or GPT-4 when GPT-4o-mini would work just fine. Cost-aware routing addresses this by matching model tier to task complexity.
The principle is simple: classify tasks by complexity, then route to the cheapest model that can handle that complexity reliably. But the implementation requires careful thought.
Implementing Cost-Aware Routing
from enum import Enum from dataclasses import dataclass class ModelTier(Enum): SMALL = "gpt-4o-mini" MEDIUM = "gpt-4" LARGE = "claude-opus-4" @dataclass class ModelCost: tier: ModelTier cost_per_1k_tokens: float COSTS = { ModelTier.SMALL: ModelCost(ModelTier.SMALL, 0.002), ModelTier.MEDIUM: ModelCost(ModelTier.MEDIUM, 0.03), ModelTier.LARGE: ModelCost(ModelTier.LARGE, 0.15) } class CostAwareRouter: def classify_complexity(self, task: str) -> ModelTier: """Classify task complexity using a small, cheap model""" complexity_prompt = f""" Analyze this task and classify its complexity: Task: {task} Classification criteria: - SIMPLE: Basic lookup, simple calculations, straightforward questions - MEDIUM: Multi-step reasoning, moderate analysis, standard workflows - COMPLEX: Deep reasoning, creative tasks, ambiguous requirements Return only: SIMPLE, MEDIUM, or COMPLEX """ # Use cheapest model for classification response = cheap_llm.invoke(complexity_prompt) complexity = response.content.strip() # Map complexity to model tier return { "SIMPLE": ModelTier.SMALL, "MEDIUM": ModelTier.MEDIUM, "COMPLEX": ModelTier.LARGE }[complexity] def route_with_cost_tracking(self, task: str) -> tuple[str, float]: """Route task and track actual cost""" # Determine appropriate model tier tier = self.classify_complexity(task) model = self.get_model(tier) # Execute task and track tokens response = model.invoke(task) tokens_used = response.usage.total_tokens # Calculate actual cost cost = (tokens_used / 1000) * COSTS[tier].cost_per_1k_tokens # Log for cost analysis self.log_routing_decision(task, tier, cost) return response.content, costThe key insight is that you use a cheap model to classify complexity, then route to the appropriate tier. Yes, this adds one extra API call, but the savings on the main task typically far outweigh this overhead.
Fallback Strategies
Cost-aware routin