Microservices work perfectly fine while you’re just returning simple JSON. But the moment you start real-time token streaming from multiple AI agents simultaneously

Yuriy Pevzner’s Post

49m

Yuriy Pevzner’s Post

49m

Microservices work perfectly fine while you’re just returning simple JSON. But the moment you start real-time token streaming from multiple AI agents simultaneously — distributed architecture turns into hell. Why? Because TTFT (Time To First Token) does not forgive network hops. Picture a typical microservices chain where agents orchestrate LLM APIs: Agent -> (gRPC) -> Internal Gateway -> (Stream) -> Orchestrator -> (WS) -> Client Every link represents serialization, latency, and maintaining open connections. Now multiply that by 5-10 agents speaking at once. You don’t get a flexible system; you get a distributed nightmare: 1. Race Conditions: Try merging three network streams in the right order without lag. 2. Backpressure: If the client is slow, that signal has to travel back through 4 services to the model. 3. Total Overhead: Splitting simple I/O-bound logic (waiting for LLM APIs) into distributed services is pure engineering waste. This is exactly where the Modular Monolith beats distributed systems hands down. Inside a single process, physics works for you, not against you: — Instead of gRPC streams — native async generators. — Instead of network overhead — instant yield. — Instead of pod orchestration — in-memory event multiplexing. Technically, it becomes a simple subscription to generators and aggregating events into a single socket. Since we are mostly I/O bound (waiting for APIs), Python’s asyncio handles this effortlessly in one process. But the benefits don’t stop at latency. There are massive engineering bonuses: 1. Shared Context Efficiency: Multi-agent systems often require shared access to large contexts (conversation history, RAG results). In microservices, you are constantly serializing and shipping megabytes of context JSON between nodes just so another agent can "see" it. In a monolith, you pass a pointer in memory. Zero-copy, zero latency. 2. Debugging Sanity: Trying to trace why a stream broke in the middle of a 5-hop microservice chain requires advanced distributed tracing setup (and lots of patience). In a monolith, a broken stream is just a single stack trace in a centralized log. You fix the bug instead of debugging the network. 3. In microservices, your API Gateway inevitably mutates into a business-logic monster (an Orchestrator) that is a nightmare to scale. In a monolith, the Gateway is just a ‘dumb pipe’ Load Balancer that never breaks. In the AI world, where users count milliseconds to the first token, the monolith isn’t legacy code. It’s the pragmatic choice of an engineer who knows how to calculate a Latency Budget. Or has someone actually learned to push streams through a service mesh without pain?

Yuriy Pevzner’s Post

Yuriy Pevzner’s Post

Explore content categories

Similar Posts