The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations (opens in new tab)
On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the...
Read the original article