7 min readJust now
–
Press enter or click to view image in full size
Resiliency is one of those words that gets thrown around a lot in system design discussions. Teams say they want “resilient systems.” Job postings list “building resilient microservices” as a requirement. But what does resiliency actually mean?
The Four Concepts of Resiliency
David D. Woods, a researcher in resilience engineering, has categorized resiliency into four distinct concepts. This framing is useful because it helps us think beyond just “how do I keep my services from crashing.”
1. Robustness
Robustness is the ability to absorb expected problems. A host fails, a network connection times out, a database goes down for maintenance. These are things we know can happen, and we put measures in pl…
7 min readJust now
–
Press enter or click to view image in full size
Resiliency is one of those words that gets thrown around a lot in system design discussions. Teams say they want “resilient systems.” Job postings list “building resilient microservices” as a requirement. But what does resiliency actually mean?
The Four Concepts of Resiliency
David D. Woods, a researcher in resilience engineering, has categorized resiliency into four distinct concepts. This framing is useful because it helps us think beyond just “how do I keep my services from crashing.”
1. Robustness
Robustness is the ability to absorb expected problems. A host fails, a network connection times out, a database goes down for maintenance. These are things we know can happen, and we put measures in place to handle them.
This is where most of our technical patterns live: timeouts, retries, circuit breakers, redundancy. When someone talks about “building resilient systems,” they’re usually talking about robustness.
However, robustness requires prior knowledge. You’re putting measures in place for problems you can foresee. You knew that a database could go down, so you added read replicas. You knew that network calls could time out, so you added retry logic.
2. Rebound
Rebound is how well you recover after something goes wrong. Not preventing the outage, but dealing with the aftermath.
Do you have backups? Are they actually tested? When an outage happens, do people know their roles? Who’s the incident commander? How quickly do you communicate with users? Do you have runbooks for common failure scenarios?
I’ve seen teams pour enormous effort into preventing outages, then be completely unprepared when one actually happens. They’re scrambling to figure out who should be doing what, while the system is down. Having an agreed plan of action in place beforehand makes recovery much faster.
3. Graceful Extensibility
This is where things get interesting. Graceful extensibility is how well you deal with unexpected situations — things you didn’t plan for.
Robustness and rebound deal with the expected. But what happens when you hit something completely new? A failure mode you never considered? A combination of events that your monitoring doesn’t even detect?
Usually, automation can’t handle surprises. Your circuit breakers won’t trip for a failure pattern they weren’t designed for. Your runbooks won’t help with a scenario nobody anticipated.
Graceful extensibility comes from people. People with the right skills, experience, and authority to handle novel situations as they arise. Organizations that are too rigid — where people have to follow strict procedures and can’t make judgment calls — will struggle here.
This is also why over-optimizing can backfire. If you’ve automated everything and reduced staff to the minimum, you’ve gained efficiency but lost graceful extensibility. When the unexpected happens, there’s nobody left who can adapt.
4. Sustained Adaptability
This is the long game. Sustained adaptability is the ability to continuously adapt to changing environments, requirements, and demands.
The key word is sustained. It’s not about adapting once. It’s about building a culture where you’re constantly questioning, constantly learning, constantly improving.
A quote from David Woods captures this well:
No matter how good we have done before, no matter how successful we’ve been, the future could be different, and we might not be well adapted. We might be precarious and fragile in the face of that new future.
Just because you haven’t had a catastrophic outage doesn’t mean you can’t have one. Past success doesn’t guarantee future resilience. Sustained adaptability requires you to keep asking “what don’t we know?” even when things are going well.
A Modern Perspective
Why does this framing matter in 2026?
Because microservices, cloud infrastructure, and Kubernetes have made robustness much easier. We have circuit breakers built into service meshes. We have auto-scaling that spins up new instances when pods die. We have observability tools that show us exactly where requests are failing.
And yet systems still have major outages. Teams still get caught off guard. Why?
Because robustness is only one piece of the puzzle. You can have the most sophisticated technical resilience patterns in the world, and still be unprepared when something truly unexpected happens. Or be unable to recover quickly. Or fail to learn from incidents.
The AI systems we’re building now add another layer of complexity. LLM inference calls have unpredictable latency. Agentic workflows chain multiple services together, compounding failure probabilities. RAG systems add vector database dependencies. Each new component is another place where things can go wrong in ways we haven’t anticipated yet.
Technical patterns help with robustness. But resiliency as a whole is a property of people, processes, and culture — not just software.
Robustness Patterns (The Technical Stuff)
That said, robustness still matters. Here’s a quick overview of the key patterns. I’m keeping these short because I’ve covered some of them in more depth in other articles.
Timeouts
Put timeouts on every network call.
A missing timeout is implicitly saying “I’m willing to wait forever.” You’re not. Systems that respond slowly are often worse than systems that fail fast — they tie up resources, cause cascading delays, and make everything else slow.
Don’t just think about individual call timeouts either. Consider the overall operation. If rendering a page should take at most 2 seconds, and you’ve already spent 1.5 seconds, maybe you should skip that last external call rather than risk going over.
Circuit Breakers
When a downstream service starts failing, stop hammering it. A circuit breaker tracks failures and, after a threshold, “trips” — causing subsequent calls to fail immediately without even attempting the network call.
This does two things: it protects your service from wasting resources on calls that will fail anyway, and it gives the failing service a chance to recover without being overwhelmed by retry traffic. Also it prevents cascading failures to a degree.
I wrote more about the principle of failing fast in my failing fast article.
Bulkheads
Isolation. Don’t let a problem in one area bring down everything else.
The classic example: connection pools. If you have a single HTTP connection pool for all downstream services, one slow service can exhaust all the connections and block calls to healthy services too. Use separate pools for each dependency.
The same principle applies at higher levels. Separate microservices, separate databases, separate infrastructure. The more isolated your components, the less a failure in one can cascade to others.
Idempotency
If an operation might be retried, make sure handling it twice doesn’t cause problems.
This is especially important with message queues. Most queues offer “at-least-once” delivery — your message will definitely arrive, but it might arrive more than once. If your processing is idempotent, this is fine. If it’s not, you might charge a customer twice or create duplicate records.
I covered this in detail in my idempotency article.
Redundancy
Having more than one of something. Multiple service instances, multiple database replicas, multiple availability zones.
The trade-off is cost and complexity. More redundancy means more infrastructure to manage and pay for. Work out how much redundancy you actually need based on the criticality of each component.
Example: E-commerce Checkout
Let’s make this concrete with an example. Imagine an e-commerce checkout flow:
User → Checkout Service → Payment Service → Inventory Service → Notification Service
The payment service is having a bad day. It’s not completely down — just slow. Requests that normally take 200ms are now taking 15 seconds.
Without robustness patterns:
Checkout Service calls Payment Service and waits. And waits. The thread is blocked. More users try to check out. More threads get blocked waiting on Payment Service. Checkout Service’s thread pool fills up. Now it can’t handle any requests at all — even ones that don’t need Payment Service.
Meanwhile, users are staring at loading spinners. They refresh. Now there are twice as many requests. The whole system grinds to a halt.
With robustness patterns:
Checkout Service has a 2-second timeout on Payment Service calls. After that, it fails fast. The circuit breaker notices the failures and trips. Now calls to Payment Service fail immediately — no waiting.
Users see an error quickly: “Payment processing is temporarily unavailable. Please try again in a few minutes.” Not ideal, but much better than a hung page. And critically, the rest of the system stays healthy. Users can still browse products, add items to their cart, or complete checkouts with saved payment methods.
The Human Element
Technical patterns are necessary but not sufficient.
Graceful extensibility requires people who can think on their feet when something unexpected happens. This means:
- Not automating away all human judgment
- Giving people authority to make decisions during incidents
- Having experienced engineers who’ve seen enough failure modes to recognize patterns
Sustained adaptability requires a culture of learning:
- Blameless post-mortems that focus on system improvements, not finger-pointing
- Chaos engineering to discover weaknesses before they become outages
- Actually implementing the lessons learned, not just writing them down
On chaos engineering: tools like Chaos Monkey (which randomly kills production instances) can help you verify that your robustness measures actually work. But the real value is in building a culture that constantly questions “what happens if X fails?” and then tests those assumptions.
On blame: when something goes wrong, there’s often pressure to find someone to blame. This is counterproductive. If one person making a mistake can bring down your entire system, the problem is the system, not the person. More importantly, a culture of blame discourages people from reporting issues early, which makes future incidents worse.