Understanding SLOs, SLIs, and SLAs: A Practical Guide

Introduction

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are fundamental concepts in Site Reliability Engineering. Understanding and implementing them correctly is crucial for maintaining reliable services.

Core Concepts

SLI (Service Level Indicator)

Definition: A quantitative measure of service reliability from the user’s perspective.

Common SLIs:

Availability: Percentage of successful requests
Latency: Proportion of requests served faster than threshold
Throughput: Requests processed per second
Error Rate: Percentage of failed requests

Example SLI Definitions:

# Availability SLI
availability_sli = (successful_requests / total_requests) * 100

# Latency SLI
latency_sli = (requests_under_300ms / total_requests) * 100

# Error Rate SLI
error_rate_sli = (failed_requests / total_requests) * 100

SLO (Service Level Objective)

Definition: A target value or range for an SLI over a time window.

Example SLOs:

# 99.9% of requests succeed (availability)
availability_slo: 99.9%
time_window: 30 days

# 95% of requests complete under 300ms (latency)
latency_slo: 95%
threshold: 300ms
time_window: 7 days

# Error rate stays below 0.1%
error_rate_slo: 0.1%
time_window: 30 days

SLO Best Practices:

Start with achievable targets, iterate upward
Base SLOs on user expectations, not technical limits
Define multiple SLOs for different aspects (availability, latency, throughput)
Choose meaningful time windows (typically 7, 28, or 90 days)

SLA (Service Level Agreement)

Definition: A formal commitment to customers, typically with financial consequences if violated.

SLA vs SLO:

SLA: External commitment with penalties
SLO: Internal target, typically stricter than SLA

Example:

# External SLA (customer-facing)
availability_sla: 99.5%
penalty: Credits for downtime beyond 0.5%

# Internal SLO (engineering target)
availability_slo: 99.9%
buffer: 0.4% (SLO - SLA)

Error Budgets

Definition: The acceptable amount of unreliability, calculated as 100% - SLO.

Error Budget Calculation

# Example: 99.9% availability SLO over 30 days
slo_target = 99.9  # percent
time_window = 30 * 24 * 60  # minutes in 30 days = 43,200

# Error budget in minutes
error_budget = time_window * (1 - slo_target / 100)
# Result: 43.2 minutes of downtime allowed per 30 days

# Error budget in requests (for 1M requests/month)
total_requests = 1_000_000
error_budget_requests = total_requests * (1 - slo_target / 100)
# Result: 1,000 failed requests allowed

Error Budget Policy

When error budget is healthy (>25% remaining):

Deploy new features aggressively
Experiment with new technologies
Take calculated risks

When error budget is low (<25% remaining):

Freeze feature launches
Focus on reliability improvements
Slow down deployment velocity
Investigate recent changes

When error budget is exhausted:

Stop all risky changes
Emergency reliability work only
Root cause analysis mandatory
Postmortem all incidents

Burn Rate

Definition: The rate at which error budget is being consumed.

Burn Rate Calculation

# Example: Monitoring burn rate
time_window = 30  # days
observed_availability = 99.8  # percent (measured)
slo_target = 99.9  # percent

# Error budget consumption
budget_consumed = (slo_target - observed_availability) / (100 - slo_target)
# Result: (99.9 - 99.8) / (100 - 99.9) = 0.1 / 0.1 = 100% consumed

# Burn rate (relative to time window)
elapsed_days = 5
burn_rate = (budget_consumed * time_window) / elapsed_days
# If 20% budget consumed in 5 days: burn_rate = 6x

Burn Rate Alerting

Multi-window, multi-burn-rate alerts:

# Fast burn (critical)
- alert: HighBurnRate
  expr: |
    (1 - availability_sli{service="api"}) / (1 - 0.999) > 14.4
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "Error budget will exhaust in 2 days at current rate"

# Moderate burn (warning)
- alert: ModerateBurnRate
  expr: |
    (1 - availability_sli{service="api"}) / (1 - 0.999) > 6
  for: 6h
  labels:
    severity: warning
  annotations:
    summary: "Error budget will exhaust in 5 days at current rate"

# Slow burn (info)
- alert: SlowBurnRate
  expr: |
    (1 - availability_sli{service="api"}) / (1 - 0.999) > 3
  for: 24h
  labels:
    severity: info
  annotations:
    summary: "Error budget will exhaust in 10 days at current rate"

Practical Implementation

Step 1: Define SLIs

Prometheus Example:

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI (95th percentile under 300ms)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.3

Step 2: Set SLO Targets

services:
  - name: user-api
    slos:
      - type: availability
        target: 99.9
        window: 30d

      - type: latency
        target: 99.0
        threshold: 300ms
        window: 7d

      - type: error_rate
        target: 0.1
        window: 30d

Step 3: Calculate Error Budgets

Python Script:

from datetime import datetime, timedelta

class ErrorBudget:
    def __init__(self, slo_target, time_window_days):
        self.slo_target = slo_target
        self.time_window = time_window_days
        self.allowed_downtime = self._calculate_downtime()

    def _calculate_downtime(self):
        """Calculate allowed downtime in seconds"""
        total_seconds = self.time_window * 24 * 60 * 60
        return total_seconds * (1 - self.slo_target / 100)

    def remaining_budget(self, actual_downtime_seconds):
        """Calculate remaining error budget"""
        remaining = self.allowed_downtime - actual_downtime_seconds
        percentage = (remaining / self.allowed_downtime) * 100
        return {
            'remaining_seconds': remaining,
            'remaining_percentage': percentage,
            'consumed_percentage': 100 - percentage
        }

    def burn_rate(self, downtime_in_period, period_duration_days):
        """Calculate current burn rate"""
        expected_budget_use = period_duration_days / self.time_window * 100
        actual_budget_use = (downtime_in_period / self.allowed_downtime) * 100
        return actual_budget_use / expected_budget_use

# Usage
budget = ErrorBudget(slo_target=99.9, time_window_days=30)
print(f"Allowed downtime: {budget.allowed_downtime / 60:.2f} minutes")

# Check remaining budget
actual_downtime = 20 * 60  # 20 minutes
status = budget.remaining_budget(actual_downtime)
print(f"Budget consumed: {status['consumed_percentage']:.1f}%")
print(f"Budget remaining: {status['remaining_percentage']:.1f}%")

# Calculate burn rate (10 min downtime in 5 days)
rate = budget.burn_rate(downtime_in_period=10*60, period_duration_days=5)
print(f"Burn rate: {rate:.1f}x")

Step 4: Dashboard and Monitoring

Grafana Dashboard Panels:

{
  "dashboard": {
    "title": "SLO Dashboard - User API",
    "panels": [
      {
        "title": "Current Availability SLI",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d])) * 100"
        }],
        "thresholds": [
          {"value": 99.9, "color": "green"},
          {"value": 99.5, "color": "yellow"},
          {"value": 99.0, "color": "red"}
        ]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [{
          "expr": "((0.999 - (1 - sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)) * 100"
        }]
      },
      {
        "title": "Burn Rate (30d window)",
        "targets": [{
          "expr": "(1 - sum(rate(http_requests_total{status=~\"2..|3..\"}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)"
        }]
      }
    ]
  }
}

Real-World Examples

Example 1: E-commerce Platform

service: checkout-api
slis:
  - name: availability
    measurement: successful_checkouts / total_checkout_attempts

  - name: latency_p95
    measurement: 95th_percentile_checkout_time

  - name: payment_success
    measurement: successful_payments / total_payment_attempts

slos:
  - sli: availability
    target: 99.95%
    window: 30d
    rationale: "Losing checkout means losing revenue"

  - sli: latency_p95
    target: 500ms
    window: 7d
    rationale: "Fast checkout improves conversion"

  - sli: payment_success
    target: 99.5%
    window: 30d
    rationale: "Payment failures acceptable only for fraud/insufficient funds"

error_budget_policy:
  - remaining: ">50%"
    action: "Normal development pace"

  - remaining: "25-50%"
    action: "Review reliability risks, defer non-critical features"

  - remaining: "<25%"
    action: "Feature freeze, focus on stability"

  - remaining: "0%"
    action: "Emergency reliability work only, incident retrospectives"

Example 2: Data Pipeline

service: analytics-pipeline
slis:
  - name: freshness
    measurement: time_since_last_successful_run

  - name: completeness
    measurement: records_processed / records_expected

  - name: accuracy
    measurement: records_passing_validation / total_records

slos:
  - sli: freshness
    target: "95% of runs complete within 2 hours of schedule"
    window: 7d

  - sli: completeness
    target: "99.9% of expected records processed"
    window: 30d

  - sli: accuracy
    target: "99.5% of records pass validation"
    window: 30d

Common Pitfalls

Pitfall 1: Too Many SLOs

Problem: Tracking 20+ SLOs per service dilutes focus Solution: Start with 3-5 critical user-facing SLOs

Pitfall 2: Unrealistic Targets

Problem: Setting 99.99% SLO when system achieves 99.5% Solution: Base SLOs on current performance + incremental improvement

Pitfall 3: Ignoring Error Budget

Problem: Treating SLO breaches as “nice to have” metrics Solution: Enforce error budget policy with stakeholder buy-in

Pitfall 4: SLOs Without SLIs

Problem: “Our service should be fast” (no measurement) Solution: Define measurable SLIs first, then set SLO targets

Pitfall 5: Identical SLA and SLO

Problem: No buffer between internal target and customer commitment Solution: SLO should be stricter (e.g., 99.9% SLO, 99.5% SLA)

Migration Path

For Teams Starting Fresh

Week 1-2: Instrument SLIs
- Add metrics for availability, latency, errors
- Validate data accuracy
Week 3-4: Analyze baseline
- Review 90 days of historical data
- Identify current performance levels
Week 5-6: Define initial SLOs
- Set conservative targets (achievable 90% of time)
- Document rationale
Week 7-8: Implement monitoring
- Create dashboards
- Set up burn rate alerts
Month 3+: Iterate and refine
- Adjust targets based on learnings
- Add error budget policy

For Teams With Existing Monitoring

Map existing metrics to SLI framework
Identify gaps (missing user-facing metrics)
Define SLOs based on historical performance
Implement error budget tracking
Establish error budget policy

Tools and Resources

Monitoring Tools

Prometheus + Grafana: Open-source, flexible
Datadog: Commercial, easy setup
New Relic: APM with SLO tracking
Google Cloud Monitoring: Native GCP integration

SLO Management Tools

Nobl9: Dedicated SLO platform
Sloth: SLO generator for Prometheus
Grafana SLO: Built-in Grafana SLO tracking

Useful Commands

# Check current availability (from Prometheus)
curl -g 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total{status=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))'

# Calculate error budget consumption
echo "scale=2; (99.9 - 99.85) / (100 - 99.9) * 100" | bc
# Output: 50.00 (50% budget consumed)

Conclusion

SLOs, SLIs, and error budgets provide a framework for balancing reliability and innovation. Key takeaways:

Start small: 3-5 critical SLOs per service
Measure user experience: SLIs should reflect what users care about
Use error budgets: Make reliability vs velocity tradeoffs objective
Iterate: Refine SLOs based on learnings and changing requirements
Enforce policy: Error budget must have teeth to be effective

Remember: The goal isn’t perfect reliability—it’s appropriate reliability that allows sustainable innovation.

Introduction#

Core Concepts#

SLI (Service Level Indicator)#

SLO (Service Level Objective)#

SLA (Service Level Agreement)#

Error Budgets#

Error Budget Calculation#

Error Budget Policy#

Burn Rate#

Burn Rate Calculation#

Burn Rate Alerting#

Practical Implementation#

Step 1: Define SLIs#

Step 2: Set SLO Targets#

Step 3: Calculate Error Budgets#

Step 4: Dashboard and Monitoring#

Real-World Examples#

Example 1: E-commerce Platform#

Example 2: Data Pipeline#

Common Pitfalls#

Pitfall 1: Too Many SLOs#

Pitfall 2: Unrealistic Targets#

Pitfall 3: Ignoring Error Budget#

Pitfall 4: SLOs Without SLIs#

Pitfall 5: Identical SLA and SLO#

Migration Path#

For Teams Starting Fresh#

For Teams With Existing Monitoring#

Tools and Resources#

Monitoring Tools#

SLO Management Tools#

Useful Commands#

Conclusion#