Preview
Open Original
Summary of key incident response metrics concepts
These are SLO-aligned incident response (IR) metrics designed to capture the real-world impact on users rather than just infrastructure health.
| Metric | Description |
| Error budget burn | An IR metric that measures the portion of the service’s error budget consumed during an incident. It matters because it signals urgency and the risk of breaching SLOs. Teams compare actual errors or downtime against the budget to guide escalation or rollback decisions. |
| SLI degradation | A measurable drop in success rate, latency, or availability compared to the target SLO. This matters because it reflects user impact in real time, often surfacing issues earlier than infrastructure alerts. Teams calculate… |
Summary of key incident response metrics concepts
These are SLO-aligned incident response (IR) metrics designed to capture the real-world impact on users rather than just infrastructure health.
| Metric | Description |
| Error budget burn | An IR metric that measures the portion of the service’s error budget consumed during an incident. It matters because it signals urgency and the risk of breaching SLOs. Teams compare actual errors or downtime against the budget to guide escalation or rollback decisions. |
| SLI degradation | A measurable drop in success rate, latency, or availability compared to the target SLO. This matters because it reflects user impact in real time, often surfacing issues earlier than infrastructure alerts. Teams calculate success rates and act by rolling back deployments or rerouting traffic. |
| Time to budget recovery (TTBR) | The total time a service remains out of compliance with its SLO before recovery. It matters because it captures how long users experienced degraded performance, not just how fast engineers responded. TTBR is measured from the first SLO breach until metrics return above the target threshold. |
| Error budget burn rate (short vs long) | The pace at which a service consumes its error budget. It’s important because it highlights how quickly reliability deteriorates, even when the full budget hasn’t been spent yet. Teams track burn rate across short and long windows to catch both sudden spikes and sustained failures. |
| Burn rate trends | The long-term pattern of how quickly error budgets are consumed. This matters because it exposes systemic risks and shows when reliability may drift toward instability. Teams review these trends over weeks or months to inform reliability improvements. |
| Incident recurrence impact | A measure of how often incidents repeat and how severely they affect users. It matters because recurring issues waste error budgets and erode trust. Teams track recurrence frequency and impact to prioritize permanent fixes. |