Last winter, my city Richmond VA suffered water distribution outages for days after a blizzard. Not because of one big failure, but because backup pumps failed, sensors misread, alerts got buried, and then another pump died during recovery. The whole city ended up under a boil‑water advisory. Sound familiar? Replace “water pumps” with “microservices” and you’ve got every cascading outage I’ve debugged in over 15 years.
The timeline mapped perfectly to Dr. Richard Cook’s observations on complex systems: failures are multi‑factor, systems constantly run in degraded mode (not everything is perfect all the time...
Last winter, my city Richmond VA suffered water distribution outages for days after a blizzard. Not because of one big failure, but because backup pumps failed, sensors misread, alerts got buried, and then another pump died during recovery. The whole city ended up under a boil‑water advisory. Sound familiar? Replace “water pumps” with “microservices” and you’ve got every cascading outage I’ve debugged in over 15 years.
The timeline mapped perfectly to Dr. Richard Cook’s observations on complex systems: failures are multi‑factor, systems constantly run in degraded mode (not everything is perfect all the time), and single‑cause incident findings often mislead. Modern outages aren’t “one bad deploy.” They’re three or four small issues that line up at exactly the wrong moment. It could be timing or load and a dependency wobble, an alert nobody saw because they were fixing something else.
Here’s what I’ve learned: if you design systems expecting a single root cause to explain everything, you’ll ship brittle architectures and have the same post‑incident arguments every time a major incident happens. The lesson isn’t “never fail”. It’s to shape your systems so small failures can’t cascade, and to practice recovery when things are calm, not during the next crisis.
The Wildfire Prevention Model
Once I watched one bad regex take down three services in a cascade and I wondered if we had circuit breakers in place. Now I think about resilience like wildfire management. You don’t build a fortress, instead you cut firebreaks, clear underbrush, and run controlled burns so when sparks fly (and they will), they hit dirt instead of any other plantations.
In production, firebreaks are
- SLO gates (service level objectives—your promise to users; if you blow the budget, stop the rollout)
- **Concurrency caps **(don’t let one thing use all the threads/connections)
- **Circuit breakers **(fail fast if an upstream is sick, rather than queuing forever)
- **Per‑tenant bulkheads **(separate pools so one noisy neighbor can’t sink the ship)
In addition to that, prescribed burns are like short game days that can test and surface dangerous assumptions before an actual system event. We’re not preventing the incident, but are preparing that the incident is a non-event when that incident truly strikes.
Below are four practices that consistently reduce blast radius and speed up learning.
1) Design for Degraded Mode From Day One
When Richmond’s water system wobbled, they kept pressure flowing for fire safety while fixing the root cause. That’s the model: define “good enough for now” before you need it.
For every critical user journey, write down your minimum viable service. What can you shed (nice‑to‑haves)? What can you delay (batch jobs)? What can you simplify (a lighter model or cached view)? Make degraded mode visible in both dashboards and the product (users handle “we’re slower than usual” better than a mysterious timeout or a site down).
Netflix is a great example, and has written publicly about designing for degraded modes: if the personalized recommendations system is down or slow, they fall back to simpler options (e.g., showing popular titles instead of personalized picks) so the page still loads and streaming continues. That’s “good enough for now” in action—protect the core journey while a dependency recovers.
Try this week: Write a one‑page Degraded Operation Plan for one critical user journey; add the toggles and the dashboard banner and rehearse it for 20 minutes.
2) Keep Problems Contained
Big outages happen when small problems line up perfectly. Your job is to stop the lineup from becoming a cascade. Think of it like Richmond’s water valves—when one section failed, they isolated it instead of draining the whole city.
- Bulkhead isolation: separate thread pools/queues by tenant and dependency.
- Backpressure: slow callers down so retry requests don’t make it worse.
- Circuit breakers: trip early on toxic inputs; fail fast and recover quickly.
The aim isn’t zero faults; it’s faults that stay local. That’s why valves and overflow paths helped Richmond keep water moving while crews fixed the fault, instead of letting one failure drain the whole city.
Netflix’s Hystrix provides the capability to isolate dependencies with per‑service thread pools and circuit breakers. When a backend slows or fails, only that pool is saturated and other features stay healthy, and the fallback responses keep the upstream usable. That’s bulkheads + backpressure in practice.
Try this week: Add per‑dependency concurrency caps to one critical service. Start at ~80% of what you think you need. Publish the limits so other teams plan around reality, not hope. Watch how many near‑misses you catch in week one.
3) Ship in Thin Slices With Automatic Rollbacks
I have observed this several times over the years, a team is making a “trivial” caching change and that results in a 2 hour outage. While trivial the change had the potential to cause a large outage, and that’s why I’m religious about progressive delivery. No more cliff‑edge releases, just small, observable experiments.
- Canary to 1-5% of traffic.
- Watch the SLOs (error rate, latency, business outcomes).
- Promote only if the numbers hold.
- Auto‑rollback when error budget thresholds are crossed.
- Treat infrastructure toggles like feature flags (yes, that cache TTL needs a guard).
Google Chrome ships changes as staged rollouts with guardrails and kill-switches. If crash or performance rates cross a threshold in the first few percent, promotion is halted and the release is flipped off. Most users never notice; the canary metrics caught it and impact stays tiny.
Try this week: Add a promotion guard/check that compares canary SLOs to a budget threshold and flips a rollback flag automatically and rollback the release.
4) Learn From Near‑Misses, Not Just Disasters
Most teams only do postmortems after user impact. That’s like studying only plane crashes but ignoring when pilots report weird engine noises. Real learning happens in near‑misses, those times your error budget threshold was hit for 12 minutes then self‑heals.
On a cadence, review your near‑misses. Keep it lightweight: what triggered it, what was the first signal, where did people get confused, and what’s the smallest change that prevents a repeat? No blame, no heroes, just learning. The City of Richmond’s after‑action review showed most of the pain was preventable, this is exactly what near‑miss loops catch while the sun’s shining.
Google’s SRE teams document “non‑outage postmortems” (near‑misses) when SLOs burn briefly and self‑heal. Capturing the trigger, first signal, and a smallest fix (often retry/backoff tuning or better dashboard correlation) turns almost‑incidents into permanent improvements, this makes the next hiccup invisible to users.
Try this week: Create a Near‑Miss template: 300 words max, one diagram, one smallest fix. Review them during the next operational excellence stand‑up. Make it lighter than a postmortem so people actually file them
Make Observability Actually Useful
None of this works if your logs, metrics, and traces don’t line up. Emit small, machine‑readable events at every key decision: build started, deploy finished, feature flag toggled, circuit breaker tripped, degraded mode activated. Use the same correlation ID everywhere.
One screen should show SLOs next to deploy events next to business metrics. When you roll back, log it as an event you can query later. This turns war rooms into timelines:
“deploy #4821 finished 3:14:22 → error rate up 3:16:45 → rollback 3:18:00.”
Anti‑Patterns That Kill Resilience (and What To Do Instead)
- If your “safe” path is slower than cowboy mode, people route around it. Make the paved road the fastest road by bundling CI, SLOs, flags, and rollback into one template.
- If checks always pass or always fail, they become noise. Tier controls by risk.
- If heroics become the plan, resilience turns into a sport. Ship safety in defaults, not documentation, nobody reads the wiki at 3 a.m.
How You’ll Know It’s Working
Within one quarter, you should measure:
- Fewer customer‑impacting incidents (count them, not the drama)
- Faster recovery when things do break (actual MTTR)
- Calmer on‑call rotations (ask the humans)
- Boring postmortems (“the circuit breaker caught it”)
Expect steadier user experience under stress, fewer rollbacks, faster restoration and calmer on‑calls. Not because incidents vanish, but because faults combine less catastrophically and recovery paths are paved.
If you don’t see improvement within a quarter, revisit your degraded plans, isolation limits, promotion guards, and visibility. Are the firebreaks in the right places? Can everyone see them, and export them? Did anyone run the game day, or is it still on the backlog?
Start This Week
Pick one critical journey. Publish its Degraded Operation Plan. Add a visible degraded‑mode indicator. Schedule a 20‑minute controlled burn next sprint—on purpose—before reality does it for you.
