Turning Kubernetes observability into reliability with SLOs and runbooks

We run everything on Kubernetes, and we’ve got solid observability. OpenTelemetry collectors, Prometheus scraping everything, Grafana dashboards. We can see pod metrics, request traces, error rates. All the data.

But during incidents, we still ended up guessing. “Should we restart pods? Scale horizontally? Check the database? Roll back?” No one knew. Every incident was a research project.

The issue wasn’t Kubernetes or observability tooling. It was having frameworks to act on the data. Here’s what we did:

But during incidents, we still ended up guessing. “Should we restart pods? Scale horizontally? Check the database? Roll back?” No one knew. Every incident was a research project.

The issue wasn’t Kubernetes or observability tooling. It was having frameworks to act on the data. Here’s what we did:

Availability SLI from OpenTelemetry: We use the spanmetrics connector in our OpenTelemetry Collector with a namespace like traces.spanmetrics. This generates metrics in Prometheus that we use for SLOs. For availability, we calculate percentage of successful requests by comparing successful calls (2xx/3xx status codes) against total calls for each service. We set 99.5% as our SLO. The OpenTelemetry Collector’s spanmetrics connector automatically generates these metrics from traces, so we instrument once and get both detailed traces for debugging and aggregated metrics for SLOs.

Latency SLI: We use histogram quantiles from the duration metrics that spanmetrics generates. We track 99th percentile response time for successful requests (2xx status codes). This tells us how fast the service is for most users, not just the average.

Runbooks: We connect them to Prometheus alerts via annotations. When an alert fires for high error rate, the PagerDuty notification includes: service name, current error rate vs SLO threshold (e.g., current is 2.3%, SLO allows 0.5%), dashboard link for the service overview, trace query for Tempo to investigate failing requests, and runbook link with remediation steps. The runbook tells us exactly what to check and do. We structure runbooks with sections for symptoms (what you see in Grafana), verification (how to confirm the issue), remediation (step-by-step actions like restart pods, scale horizontally, check database), escalation (when to involve others), and rollback (if remediation fails).

Post-mortems: We do them within 48 hours of incident resolution while details are fresh. Template includes Impact (users affected, SLO impact showing error budget consumed), Timeline (key events from alert fired through resolution), Root Cause (what changed, why it caused the problem, why safeguards didn’t prevent it), What Went Well/Poorly, and Action Items with owners, priorities, and due dates. We prioritize action items in sprint planning. This is critical, otherwise post-mortems become theater where everyone nods and changes nothing.

The post covers how to build SLIs from your existing OpenTelemetry span-metrics (those traces you’re already collecting), set SLOs that create error budgets, connect runbooks to alerts (so notifications include remediation steps), and structure post-mortems that drive real improvements.

It includes practical templates and examples: From Signals to Reliability: SLOs, Runbooks and Post-Mortems

What’s your incident response workflow in Kubernetes environments? How do you decide when to scale vs restart vs rollback?

Similar Posts