Scaling AI Interactions: How to Load Balance Streamable MCP

The Model Context Protocol (MCP) is evolving. With the recent adoption of Streamable HTTP in early 2025, the protocol is poised for mainstream success, moving beyond developer command lines and into the world of “MCP Servers as a Service.”

This growth brings a new, exciting challenge: scaling. As your MCP service becomes more popular, you’ll need to run it on multiple servers. That means you’ll need a load balancer.

This guide will show you how to use HAProxy, an open source load balancer, to build a scalable, resilient and compliant load-balancing layer for your […

This growth brings a new, exciting challenge: scaling. As your MCP service becomes more popular, you’ll need to run it on multiple servers. That means you’ll need a load balancer.

This guide will show you how to use HAProxy, an open source load balancer, to build a scalable, resilient and compliant load-balancing layer for your Streamable MCP servers.

**Key Takeaways: **

**Session persistence is crucial: **MCP uses an mcp-session-id header to maintain a continuous conversation. We’ll use HAProxy’s stick tables to ensure requests from the same session always land on the same backend server.
**Edge validation is smart: **MCP has specific rules about which Accept headers clients must send. We can validate these rules at the HAProxy layer, protecting your backend servers from invalid traffic.

Why Streamable HTTP Changes Everything

Until now, most MCP interactions have happened locally. The original Server-Sent Events (SSE) endpoint proved difficult to implement, so developers relied on local command-line tools. This was fine for testing and development, but it was a major roadblock for wider adoption.

While the initial HTTP+SSE method served its purpose, it presented significant limitations as enterprise adoption expanded. This dual-channel approach, requiring separate HTTP POST for client-to-server messages and a dedicated Server-Sent Events (SSE) endpoint for server-to-client streaming, introduced complexities that hindered MCP deployments at scale.

Streamable HTTP changes the game. It provides a standardized, easy-to-use transport layer that works just like any other modern web API. This unlocks the potential for cloud providers and SaaS companies to offer managed MCP services, making the technology accessible to a much broader audience.

As with any successful service, the journey quickly goes from “Can we make this work?” to “How do we keep this running for thousands of users?” The answer is load balancing.

The Load Balancing Challenge

The MCP specification gives us two key requirements we must handle at the load balancer.

1. Session Stickiness

The spec states: *“The server MAY assign a session ID, and the client MUST use it in subsequent requests as a header called mcp-session-id.” *

This is the most critical rule for load balancing MCP servers. A user’s session is a continuous conversation. If one request is sent to Server A and the next is sent to Server B, the context is lost, and the session is broken. We must ensure that once a client is assigned to a server, all of the subsequent requests in that session “stick” to that same server.

2. Protocol Validation

The spec also defines strict rules for the Accept header based on the HTTP method:

For HTTP GET requests, the client must include an Accept header containing text/event-stream.
For HTTP POST requests, the client must include an Accept header containing both application/json and text/event-stream.

Handling this at the load balancer is a powerful optimization. By denying invalid requests at the edge, we prevent malformed traffic from ever reaching our MCP servers, reducing their workload and making the entire system more robust.

The Solution: An HAProxy Configuration

HAProxy’s single-process,event-driven architecture makes it ideal for handling tens of thousands of concurrent, stateful connections with minimal resource overhead, crucial for managing persistent AI conversations at scale.

We can use this popular open source tool to fulfill the two requirements above. Let’s build the configuration.

1. Achieving Session Stickiness With Stick Tables

HAProxy’s stick tables are a powerful in-memory key-value store. The configuration creates a new in-memory key-value store for sessions. When a new client connects, they are assigned to a specific backend. HAProxy will send them to the same backend every time for the rest of the session. This is called session stickiness.

Because stick tables are an integral part of HAProxy’s in-memory core, they operate with extremely low latency. This avoids the performance penalty and complexity of querying an external session store, ensuring that session lookups are never a bottleneck, even under heavy load.

We’ll configure HAProxy to do exactly that using the mcp-session-id header. Our backend configuration will look like this:


12345678910111213	backend mcp_servers # Define a stick table to track sessions. stick-table type string len 64 size 1m expire 1h # For subsequent requests, stick to a server if the client # sends the mcp-session-id header we already know. stick on hdr(mcp-session-id) # For the first response, learn and store the session ID # that the server sends back. stick store-response res.hdr(mcp-session-id) # Define our backend MCP servers. server mcp_server_1 10.0.1.10:8000 check server mcp_server_2 10.0.1.11:8000 check

12345678910111213

backend mcp_servers # Define a stick table to track sessions. stick-table type string len 64 size 1m expire 1h # For subsequent requests, stick to a server if the client # sends the mcp-session-id header we already know. stick on hdr(mcp-session-id) # For the first response, learn and store the session ID # that the server sends back. stick store-response res.hdr(mcp-session-id) # Define our backend MCP servers. server mcp_server_1 10.0.1.10:8000 check server mcp_server_2 10.0.1.11:8000 check

Let’s break down the key directives:

stick-table: This creates our list of sessions. We define its size and how long to keep inactive session records.
stick on hdr(mcp-session-id): This tells HAProxy to inspect the incoming request. If it finds the mcp-session-idheader, it looks up that ID in our table and forwards the request to the associated server.
stick store-response res.hdr(mcp-session-id): This is how we learn the session ID in the first place. When an MCP server responds to a new client, it includes the new session ID. This directive captures that value from the response and stores it in our table, linking it to the server that generated it.

With these three lines, we have perfect session persistence.

2. Adding Request Validation at the Edge

Next, let’s offload protocol validation from our MCP servers. We can use access control lists (ACLs) to check for the required Accept headers.

Handling this validation at the load balancer simplifies the architecture. By enforcing protocol rules at the edge, HAProxy prevents invalid traffic from consuming backend resources, a task that might otherwise require a separate, dedicated API gateway layer. This reduces latency, cost, and complexity.

We’ll add these rules to our frontend section:


123456789101112131415	None frontend mcp_frontend bind :80 # ACL to check if Accept header contains ‘text/event-stream’ acl accept_events req.hdr(accept) -m str text/event-stream # ACL to check if Accept header contains ‘application/json’ acl accept_json req.hdr(accept) -m str application/json # Block invalid GET requests http-request deny if METH_GET !accept_events # Block invalid POST requests http-request deny if METH_POST !accept_events or METH_POST !accept_json default_backend mcp_servers

123456789101112131415

None frontend mcp_frontend bind :80 # ACL to check if Accept header contains ‘text/event-stream’ acl accept_events req.hdr(accept) -m str text/event-stream # ACL to check if Accept header contains ‘application/json’ acl accept_json req.hdr(accept) -m str application/json # Block invalid GET requests http-request deny if METH_GET !accept_events # Block invalid POST requests http-request deny if METH_POST !accept_events or METH_POST !accept_json default_backend mcp_servers

This logic is clear and efficient:

We define two ACLs, accept_events and accept_json, which simply check if the specified strings are substrings (-m sub) of the Accept header.
We then create two http-request deny rules that use these ACLs.
The first rule blocks any GET request that is missing text/event-stream.
The second rule blocks any POST request that is missing either text/event-stream or application/json.

Our stick table logic then takes over and forwards any request that passes these checks to our backend.

The Complete Configuration

Here is the complete, copy-paste-ready haproxy.cfg file that combines everything into a robust load-balancing solution for Streamable MCP. Please note that this guide focuses on our open source project, not our commercial enterprise load balancer; however, this configuration will work for both.

The introduction of Streamable HTTP is a major milestone for the Model Context Protocol, paving the way for scalable, enterprise-grade applications. With HAProxy, you can build a load-balancing layer that distributes traffic intelligently and enforces protocol rules at the edge, enhancing your AI gateway strategy. By implementing session persistence and request validation, you can ensure your MCP service is fast, scalable and resilient enough for the mainstream.