It started with a seemingly simple problem: intermittent timeouts in our payment authorization service. About 1 in 28,000 requests to an external HTTP service would timeout after 2 seconds, even though everything looked healthy.
The weird part? The external service was responding in under 300ms. We had the logs to prove it.
The response arrived at our pod. Istio logged it. But somehow, our application never processed it.
When you see intermittent timeouts, the first instinct is to blame infrastructure. We’ve all been there.
**CPU Throttling? **“Maybe we’re being throttled by Kubernetes?” We checked container CPU metrics. Everything was well within limits. No throttling events.
Garbage Collection? “Classic GC pause?” We analyzed GC logs and JFR recordings. Sure, there …
It started with a seemingly simple problem: intermittent timeouts in our payment authorization service. About 1 in 28,000 requests to an external HTTP service would timeout after 2 seconds, even though everything looked healthy.
The weird part? The external service was responding in under 300ms. We had the logs to prove it.
The response arrived at our pod. Istio logged it. But somehow, our application never processed it.
When you see intermittent timeouts, the first instinct is to blame infrastructure. We’ve all been there.
**CPU Throttling? **“Maybe we’re being throttled by Kubernetes?” We checked container CPU metrics. Everything was well within limits. No throttling events.
Garbage Collection? “Classic GC pause?” We analyzed GC logs and JFR recordings. Sure, there were some GC pauses, but nothing that would explain a 2-second freeze. And crucially, during the problematic requests, other threads were processing requests normally. If it were GC, everything would freeze.
Resource Exhaustion? “Connection pool exhausted?” Our HTTP client metrics showed pool_waiters at 0 and pool_num_waited at 0. Plenty of connections available.
We ruled out infrastructure. The problem was somewhere else.
Since Istio confirmed the response arrived at the pod, maybe something was wrong with the service mesh?
We compared successful and failed requests in the Istio access logs. Both requests completed successfully at the network level. Same response sizes. Similar durations. The external service was doing its job perfectly.
The response was reaching our pod. But something inside our application wasn’t picking it up.
Frustrated with infrastructure red herrings, we dove into application logs. And that’s when we noticed something strange.
A Netty thread was blocked for almost 2 seconds. Right during our timeout window.
For those unfamiliar with Netty: it’s an event-driven network framework. Each Netty thread runs an “event loop” that handles I/O for multiple network connections. **You should never block these threads. **If you do, all connections assigned to that thread stop processing.
But why would our code block a Netty thread?
Our service uses Finagle, Twitter’s RPC framework. Finagle is built on Futures - its whole point is to be non-blocking and asynchronous.
Here’s where it gets interesting. When a Future completes, Finagle needs to execute the callbacks (like flatMap, map, onSuccess). For performance, Finagle uses something called LocalScheduler which executes callbacks on the same thread that completed the Future.
Let me repeat that: if a Netty thread receives a network response and completes a Future, the callback runs on that Netty thread.
This is usually fine. Callbacks should be fast and non-blocking. But what if someone puts blocking code in a callback?
Here’s a simplified version of what our code was doing (in Scala for clarity):
// Make two calls in parallel
val fastCall: Future[Response] = httpClient(fastRequest)
val slowCall: Future[Response] = httpClient(slowRequest)
// Chain processing using flatMap
fastCall.flatMap { fastResponse =>
// This callback runs on the Netty thread!
// Now we need the slow response too...
val slowResponse = Await.result(slowCall, 2.seconds) // BLOCKING!
// Process both responses
Future.value(process(fastResponse, slowResponse))
}
Spot the problem?
1. fastCall completes - a Netty thread receives the response
2. The flatMap callback starts executing on that Netty thread
3. Inside the callback, we call Await.result which blocks the thread
4. The thread waits for slowCall to complete
5. But what if slowCall’s response needs to be processed by the same Netty thread?
Deadlock.
At this point, you might be thinking: “Who puts blocking code inside a flatMap? That’s obviously wrong!”
Fair point. But here’s the thing: nobody did it on purpose. It was an emergent property of abstraction layers.
We write most of our code in Clojure, not Scala. Clojure has its own concurrency primitives, and dealing with Finagle Futures directly in Clojure would make the code verbose and hard to read.
So years ago, someone made a pragmatic decision: our internal HTTP client library would hide the Finagle Future. It would make the HTTP call, wait for the response internally, and return the plain response object. Simple, clean, idiomatic Clojure:
;; Our HTTP library - returns the response directly, not a Future
(defn http-get [url]
(let [finagle-future (finagle-http-client url)]
(await finagle-future))) ;; Blocks internally, returns response
;; Usage is simple and clean
(let [response (http-get "http://external-service/api")]
(process response))
This worked great for years. Most of our services are pure HTTP, and the blocking happens on regular worker threads - no problem.
But this particular service is different. For historical reasons, it uses Finagle Thrift for some internal calls. When you use Finagle Thrift directly, you get Finagle Futures, and you naturally use flatMap to chain operations:
;; Thrift call returns a Finagle Future
(-> (thrift-client/call-service request)
(flatMap (fn [thrift-response]
;; Now we need to make an HTTP call...
;; Our HTTP library is right there, let's use it!
(let [http-response (http-get "http://other-service/api")]
;; ... process both responses ...
))))
See what happened? The developer used the existing HTTP library - the one that’s been working fine for years. They didn’t realize that inside a flatMap callback, they were on a Netty thread. The abstraction that made life easier in 99% of cases became a trap in this 1% case.
The blocking code wasn’t written inside the flatMap. It was imported from a library that had been safe everywhere else.
This is a classic example of a leaky abstraction. Our HTTP library abstracted away the async nature of Finagle. But abstractions can’t hide everything. The threading model leaked through in the worst possible way - an intermittent deadlock that only happened under specific conditions.
The lesson? When mixing abstraction levels (synchronous libraries with async frameworks), you need to understand what’s happening underneath. The abstraction might be lying to you.
This explains why the issue was intermittent. With multiple Netty threads (typically 2x CPU cores), the deadlock only happens when:
1. The fast call’s response is processed by Thread-X
2. The slow call’s connection is also assigned to Thread-X
Network connections are assigned to Netty threads when they’re established. With connection pooling, the same connection might be reused across different requests. So the probability depends on the number of Netty threads, connection pool behavior, and timing of requests.
With 16 Netty threads, the probability of hitting the same thread is roughly 6.25% for each potentially problematic request. Low enough to be rare, high enough to be annoying.
The solution is simple: never block on Netty threads.
If you need to wait for something, offload the blocking operation to a dedicated thread pool:
// Create a thread pool for blocking operations
val blockingPool = FuturePool(Executors.newCachedThreadPool())
fastCall.flatMap { fastResponse =>
// This callback still runs on Netty thread, but we return immediately
blockingPool {
// This code runs on a different thread - safe to block!
val slowResponse = Await.result(slowCall, 2.seconds)
process(fastResponse, slowResponse)
}
}
The key insight is that FuturePool returns a Future immediately and executes the blocking code on a separate thread. The Netty thread is freed to continue processing network I/O.
You might wonder: why does Finagle execute callbacks on the same thread? Isn’t that dangerous?
It’s actually a performance optimization. Context switching between threads is expensive. If your callback is fast (which it should be!), running it on the same thread avoids thread scheduling overhead, cache misses from switching contexts, and memory barriers.
The Finagle documentation warns: “Never block in a Future callback.” But warnings are easy to miss, especially when working with libraries that wrap Finagle.
To prove our theory, we created a minimal reproduction. Forcing Finagle to use a single Netty thread makes the deadlock 100% reproducible:
object DeadlockDemo extends App {
// Force single Netty thread
System.setProperty("com.twitter.finagle.netty4.numWorkers", "1")
val client = Http.client.newService("httpbin.org:80")
val fast = Request(Method.Get, "/delay/0")
val slow = Request(Method.Get, "/delay/2")
val result = client(fast).flatMap { _ =>
println(s"[${Thread.currentThread.getName}] In flatMap callback")
println(s"[${Thread.currentThread.getName}] About to block...")
// This will deadlock!
val response = Await.result(client(slow), 5.seconds)
Future.value(response)
}
// This times out
Await.result(result, 10.seconds)
}
Output:
[finagle/netty4-1-1] In flatMap callback
[finagle/netty4-1-1] About to block...
... hangs for 5 seconds ...
Exception: TimeoutException
1. Netty Threads are Sacred
Event loop threads in Netty should never be blocked. They handle I/O for potentially thousands of connections. Block one, and you block all those connections.
2. Understand Your Framework’s Threading Model
Finagle’s LocalScheduler is a performance optimization that assumes callbacks are non-blocking. If you don’t know about it, you can easily introduce deadlocks.
3. Infrastructure Isn’t Always the Problem
When facing intermittent issues, it’s tempting to blame infrastructure - more CPU, more memory, better network. But sometimes the problem is in the code, hiding in plain sight.
4. Logs Are Your Best Friend
The breakthrough came from carefully reading thread names in logs. The pattern “finagle/netty4-1-X” told us exactly where to look.
5. Abstractions Can Become Traps
Our HTTP library worked perfectly for years by hiding Finagle’s async nature. But when mixed with code that did use Finagle directly, the abstraction became a trap. A library that’s safe in one context can be dangerous in another.
6. Be Careful When Mixing Abstraction Levels
When you have sync wrappers around async frameworks, document where they’re safe to use. Better yet, consider providing both sync and async versions, so developers can choose consciously.
What started as “we need more infrastructure” ended with a PR changing a few lines of code. The fix was simple, but finding it required understanding how Netty’s event loop model works, how Finagle schedules Future callbacks, how connection pooling affects thread assignment, and why probabilistic bugs are the worst to debug.
No posts