- SRE is not about hope-based reliability.
- It is about numbers, thresholds, and consequences.
- At the core of SRE are SLIs, SLOs, and SLAs.
- If you cannot measure it, you cannot make it reliable.
🎯 Big Picture (Mental Model)
- 🟢 SLIs → What we measure
- 🟡 SLOs → What we aim for
- 🔴 SLAs → What we promise legally
SLIs feed SLAs, and SLAs define SLOs
📏 Service Level Indicators (SLIs)
✔ Raw measurements of service behavior
SLIs are quantitative metrics that describe how a service behaves from the user’s perspective.
🔍 Common SLI Types
| SLI Type | What It Measures |
|---|---|
| 🟢 Availability | Was the service reachable |
| ⚡ Latency | How fast responses are |
| ❌ Error Rate… |
- SRE is not about hope-based reliability.
- It is about numbers, thresholds, and consequences.
- At the core of SRE are SLIs, SLOs, and SLAs.
- If you cannot measure it, you cannot make it reliable.
🎯 Big Picture (Mental Model)
- 🟢 SLIs → What we measure
- 🟡 SLOs → What we aim for
- 🔴 SLAs → What we promise legally
SLIs feed SLAs, and SLAs define SLOs
📏 Service Level Indicators (SLIs)
✔ Raw measurements of service behavior
SLIs are quantitative metrics that describe how a service behaves from the user’s perspective.
🔍 Common SLI Types
| SLI Type | What It Measures |
|---|---|
| 🟢 Availability | Was the service reachable |
| ⚡ Latency | How fast responses are |
| ❌ Error Rate | How many requests failed |
| 📦 Throughput | Requests per second |
| 🔄 Freshness | Data staleness |
✅ Example SLIs (API Service)
Availability SLI = Successful requests / Total requests
Latency SLI = % of requests under 300ms
Error Rate SLI = 5xx responses / Total requests
📌 SLIs do not define targets. They only provide truthful signals.
🎯 Service Level Objectives (SLOs)
✔ Target reliability goals
SLOs define how reliable the service must be.
They are engineering targets, not legal contracts.
🔢 Example SLOs
Availability SLO: 99.9% monthly uptime
Latency SLO: 95% of requests under 300ms
Error Rate SLO: Less than 0.1% failed requests
📌 SLOs are based on user expectations, not perfection.
🔥 Error Budget (Why SLOs Matter)
99.9% uptime = 43.2 minutes of downtime per month
That downtime is your error budget.
| If error budget exists | If error budget is exhausted |
|---|---|
| 🚀 Ship features | 🛑 Freeze releases |
| 🧪 Experiment | 🔧 Focus on stability |
This is SRE discipline in action.
📜 Service Level Agreements (SLAs)
✔ Legal and business commitments
SLAs are contracts with customers.
They reference SLIs and define:
- Acceptable performance
- Measurement windows
- Penalties or credits
🧾 Example SLA Clause
The service will maintain 99.5% monthly availability.
If availability falls below 99.5%, customers receive a 10% service credit.
📌 SLAs are intentionally less strict than SLOs.
Why? Because breaking an SLA costs money and trust.
🔗 How SLIs, SLOs, and SLAs Connect
SLI → Measured data
SLA → Contractual minimums using SLIs
SLO → Internal reliability target set above SLA
🧠 Visual Flow
📊 SLIs (metrics)
↓
📜 SLAs (legal promises)
↓
🎯 SLOs (engineering goals)
🏗️ Real-World Example (E-commerce App)
📊 SLIs
- Availability: Successful HTTP responses
- Latency: Request duration
- Error Rate: 5xx responses
📜 SLA (Customer-Facing)
99.5% monthly availability
🎯 SLO (Engineering Target)
99.9% monthly availability
95% requests < 250ms
Error rate < 0.1%
Why higher than SLA?
✔ Buffer for incidents ✔ Protect customer trust ✔ Avoid financial penalties
❌ Common Mistakes (Callout)
🚫 Setting SLOs without SLIs 🚫 100% uptime targets 🚫 SLAs tighter than SLOs 🚫 Measuring system metrics instead of user experience
✅ SRE Golden Rules
- Measure what users feel
- Target less than perfect
- Use error budgets to guide decisions
- Protect engineers from endless firefighting
🏁 Final Takeaway
SLIs tell the truth SLOs define reliability goals SLAs define consequences
This trio is what turns reliability from wishful thinking into engineering discipline 💪