Introduction
Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are fundamental concepts in Site Reliability Engineering. Understanding and implementing them correctly is crucial for maintaining reliable services.
Core Concepts
SLI (Service Level Indicator)
Definition: A quantitative measure of service reliability from the user’s perspective.
Common SLIs:
- Availability: Percentage of successful requests
- Latency: Proportion of requests served faster than threshold
- Throughput: Requests processed per second
- Error Rate: Percentage of failed requests
Example SLI Definitions:
# Availability SLI
availability_sli = (successful_requests / total_requests) * 100
# Latency SLI
latency_sli = (requests_under_300ms / total_requests) * 100
# Error Rate SLI
error_rate_sli = (failed_requests / total_requests) * 100
SLO (Service Level Objective)
Definition: A target value or range for an SLI over a time window.
Example SLOs:
# 99.9% of requests succeed (availability)
availability_slo: 99.9%
time_window: 30 days
# 95% of requests complete under 300ms (latency)
latency_slo: 95%
threshold: 300ms
time_window: 7 days
# Error rate stays below 0.1%
error_rate_slo: 0.1%
time_window: 30 days
SLO Best Practices:
- Start with achievable targets, iterate upward
- Base SLOs on user expectations, not technical limits
- Define multiple SLOs for different aspects (availability, latency, throughput)
- Choose meaningful time windows (typically 7, 28, or 90 days)
SLA (Service Level Agreement)
Definition: A formal commitment to customers, typically with financial consequences if violated.
SLA vs SLO:
- SLA: External commitment with penalties
- SLO: Internal target, typically stricter than SLA
Example:
# External SLA (customer-facing)
availability_sla: 99.5%
penalty: Credits for downtime beyond 0.5%
# Internal SLO (engineering target)
availability_slo: 99.9%
buffer: 0.4% (SLO - SLA)
Error Budgets
Definition: The acceptable amount of unreliability, calculated as 100% - SLO.
Error Budget Calculation
# Example: 99.9% availability SLO over 30 days
slo_target = 99.9 # percent
time_window = 30 * 24 * 60 # minutes in 30 days = 43,200
# Error budget in minutes
error_budget = time_window * (1 - slo_target / 100)
# Result: 43.2 minutes of downtime allowed per 30 days
# Error budget in requests (for 1M requests/month)
total_requests = 1_000_000
error_budget_requests = total_requests * (1 - slo_target / 100)
# Result: 1,000 failed requests allowed
Error Budget Policy
When error budget is healthy (>25% remaining):
- Deploy new features aggressively
- Experiment with new technologies
- Take calculated risks
When error budget is low (<25% remaining):
- Freeze feature launches
- Focus on reliability improvements
- Slow down deployment velocity
- Investigate recent changes
When error budget is exhausted:
- Stop all risky changes
- Emergency reliability work only
- Root cause analysis mandatory
- Postmortem all incidents
Burn Rate
Definition: The rate at which error budget is being consumed.
Burn Rate Calculation
# Example: Monitoring burn rate
time_window = 30 # days
observed_availability = 99.8 # percent (measured)
slo_target = 99.9 # percent
# Error budget consumption
budget_consumed = (slo_target - observed_availability) / (100 - slo_target)
# Result: (99.9 - 99.8) / (100 - 99.9) = 0.1 / 0.1 = 100% consumed
# Burn rate (relative to time window)
elapsed_days = 5
burn_rate = (budget_consumed * time_window) / elapsed_days
# If 20% budget consumed in 5 days: burn_rate = 6x
Burn Rate Alerting
Multi-window, multi-burn-rate alerts:
# Fast burn (critical)
- alert: HighBurnRate
expr: |
(1 - availability_sli{service="api"}) / (1 - 0.999) > 14.4
for: 1h
labels:
severity: critical
annotations:
summary: "Error budget will exhaust in 2 days at current rate"
# Moderate burn (warning)
- alert: ModerateBurnRate
expr: |
(1 - availability_sli{service="api"}) / (1 - 0.999) > 6
for: 6h
labels:
severity: warning
annotations:
summary: "Error budget will exhaust in 5 days at current rate"
# Slow burn (info)
- alert: SlowBurnRate
expr: |
(1 - availability_sli{service="api"}) / (1 - 0.999) > 3
for: 24h
labels:
severity: info
annotations:
summary: "Error budget will exhaust in 10 days at current rate"
Practical Implementation
Step 1: Define SLIs
Prometheus Example:
# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI (95th percentile under 300ms)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.3
Step 2: Set SLO Targets
services:
- name: user-api
slos:
- type: availability
target: 99.9
window: 30d
- type: latency
target: 99.0
threshold: 300ms
window: 7d
- type: error_rate
target: 0.1
window: 30d
Step 3: Calculate Error Budgets
Python Script:
from datetime import datetime, timedelta
class ErrorBudget:
def __init__(self, slo_target, time_window_days):
self.slo_target = slo_target
self.time_window = time_window_days
self.allowed_downtime = self._calculate_downtime()
def _calculate_downtime(self):
"""Calculate allowed downtime in seconds"""
total_seconds = self.time_window * 24 * 60 * 60
return total_seconds * (1 - self.slo_target / 100)
def remaining_budget(self, actual_downtime_seconds):
"""Calculate remaining error budget"""
remaining = self.allowed_downtime - actual_downtime_seconds
percentage = (remaining / self.allowed_downtime) * 100
return {
'remaining_seconds': remaining,
'remaining_percentage': percentage,
'consumed_percentage': 100 - percentage
}
def burn_rate(self, downtime_in_period, period_duration_days):
"""Calculate current burn rate"""
expected_budget_use = period_duration_days / self.time_window * 100
actual_budget_use = (downtime_in_period / self.allowed_downtime) * 100
return actual_budget_use / expected_budget_use
# Usage
budget = ErrorBudget(slo_target=99.9, time_window_days=30)
print(f"Allowed downtime: {budget.allowed_downtime / 60:.2f} minutes")
# Check remaining budget
actual_downtime = 20 * 60 # 20 minutes
status = budget.remaining_budget(actual_downtime)
print(f"Budget consumed: {status['consumed_percentage']:.1f}%")
print(f"Budget remaining: {status['remaining_percentage']:.1f}%")
# Calculate burn rate (10 min downtime in 5 days)
rate = budget.burn_rate(downtime_in_period=10*60, period_duration_days=5)
print(f"Burn rate: {rate:.1f}x")
Step 4: Dashboard and Monitoring
Grafana Dashboard Panels:
{
"dashboard": {
"title": "SLO Dashboard - User API",
"panels": [
{
"title": "Current Availability SLI",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d])) * 100"
}],
"thresholds": [
{"value": 99.9, "color": "green"},
{"value": 99.5, "color": "yellow"},
{"value": 99.0, "color": "red"}
]
},
{
"title": "Error Budget Remaining",
"targets": [{
"expr": "((0.999 - (1 - sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)) * 100"
}]
},
{
"title": "Burn Rate (30d window)",
"targets": [{
"expr": "(1 - sum(rate(http_requests_total{status=~\"2..|3..\"}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)"
}]
}
]
}
}
Real-World Examples
Example 1: E-commerce Platform
service: checkout-api
slis:
- name: availability
measurement: successful_checkouts / total_checkout_attempts
- name: latency_p95
measurement: 95th_percentile_checkout_time
- name: payment_success
measurement: successful_payments / total_payment_attempts
slos:
- sli: availability
target: 99.95%
window: 30d
rationale: "Losing checkout means losing revenue"
- sli: latency_p95
target: 500ms
window: 7d
rationale: "Fast checkout improves conversion"
- sli: payment_success
target: 99.5%
window: 30d
rationale: "Payment failures acceptable only for fraud/insufficient funds"
error_budget_policy:
- remaining: ">50%"
action: "Normal development pace"
- remaining: "25-50%"
action: "Review reliability risks, defer non-critical features"
- remaining: "<25%"
action: "Feature freeze, focus on stability"
- remaining: "0%"
action: "Emergency reliability work only, incident retrospectives"
Example 2: Data Pipeline
service: analytics-pipeline
slis:
- name: freshness
measurement: time_since_last_successful_run
- name: completeness
measurement: records_processed / records_expected
- name: accuracy
measurement: records_passing_validation / total_records
slos:
- sli: freshness
target: "95% of runs complete within 2 hours of schedule"
window: 7d
- sli: completeness
target: "99.9% of expected records processed"
window: 30d
- sli: accuracy
target: "99.5% of records pass validation"
window: 30d
Common Pitfalls
Pitfall 1: Too Many SLOs
Problem: Tracking 20+ SLOs per service dilutes focus Solution: Start with 3-5 critical user-facing SLOs
Pitfall 2: Unrealistic Targets
Problem: Setting 99.99% SLO when system achieves 99.5% Solution: Base SLOs on current performance + incremental improvement
Pitfall 3: Ignoring Error Budget
Problem: Treating SLO breaches as “nice to have” metrics Solution: Enforce error budget policy with stakeholder buy-in
Pitfall 4: SLOs Without SLIs
Problem: “Our service should be fast” (no measurement) Solution: Define measurable SLIs first, then set SLO targets
Pitfall 5: Identical SLA and SLO
Problem: No buffer between internal target and customer commitment Solution: SLO should be stricter (e.g., 99.9% SLO, 99.5% SLA)
Migration Path
For Teams Starting Fresh
Week 1-2: Instrument SLIs
- Add metrics for availability, latency, errors
- Validate data accuracy
Week 3-4: Analyze baseline
- Review 90 days of historical data
- Identify current performance levels
Week 5-6: Define initial SLOs
- Set conservative targets (achievable 90% of time)
- Document rationale
Week 7-8: Implement monitoring
- Create dashboards
- Set up burn rate alerts
Month 3+: Iterate and refine
- Adjust targets based on learnings
- Add error budget policy
For Teams With Existing Monitoring
- Map existing metrics to SLI framework
- Identify gaps (missing user-facing metrics)
- Define SLOs based on historical performance
- Implement error budget tracking
- Establish error budget policy
Tools and Resources
Monitoring Tools
- Prometheus + Grafana: Open-source, flexible
- Datadog: Commercial, easy setup
- New Relic: APM with SLO tracking
- Google Cloud Monitoring: Native GCP integration
SLO Management Tools
- Nobl9: Dedicated SLO platform
- Sloth: SLO generator for Prometheus
- Grafana SLO: Built-in Grafana SLO tracking
Useful Commands
# Check current availability (from Prometheus)
curl -g 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=sum(rate(http_requests_total{status=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))'
# Calculate error budget consumption
echo "scale=2; (99.9 - 99.85) / (100 - 99.9) * 100" | bc
# Output: 50.00 (50% budget consumed)
Conclusion
SLOs, SLIs, and error budgets provide a framework for balancing reliability and innovation. Key takeaways:
- Start small: 3-5 critical SLOs per service
- Measure user experience: SLIs should reflect what users care about
- Use error budgets: Make reliability vs velocity tradeoffs objective
- Iterate: Refine SLOs based on learnings and changing requirements
- Enforce policy: Error budget must have teeth to be effective
Remember: The goal isn’t perfect reliability—it’s appropriate reliability that allows sustainable innovation.