Introduction

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are fundamental concepts in Site Reliability Engineering. Understanding and implementing them correctly is crucial for maintaining reliable services.

Core Concepts

SLI (Service Level Indicator)

Definition: A quantitative measure of service reliability from the user’s perspective.

Common SLIs:

  • Availability: Percentage of successful requests
  • Latency: Proportion of requests served faster than threshold
  • Throughput: Requests processed per second
  • Error Rate: Percentage of failed requests

Example SLI Definitions:

# Availability SLI
availability_sli = (successful_requests / total_requests) * 100

# Latency SLI
latency_sli = (requests_under_300ms / total_requests) * 100

# Error Rate SLI
error_rate_sli = (failed_requests / total_requests) * 100

SLO (Service Level Objective)

Definition: A target value or range for an SLI over a time window.

Example SLOs:

# 99.9% of requests succeed (availability)
availability_slo: 99.9%
time_window: 30 days

# 95% of requests complete under 300ms (latency)
latency_slo: 95%
threshold: 300ms
time_window: 7 days

# Error rate stays below 0.1%
error_rate_slo: 0.1%
time_window: 30 days

SLO Best Practices:

  1. Start with achievable targets, iterate upward
  2. Base SLOs on user expectations, not technical limits
  3. Define multiple SLOs for different aspects (availability, latency, throughput)
  4. Choose meaningful time windows (typically 7, 28, or 90 days)

SLA (Service Level Agreement)

Definition: A formal commitment to customers, typically with financial consequences if violated.

SLA vs SLO:

  • SLA: External commitment with penalties
  • SLO: Internal target, typically stricter than SLA

Example:

# External SLA (customer-facing)
availability_sla: 99.5%
penalty: Credits for downtime beyond 0.5%

# Internal SLO (engineering target)
availability_slo: 99.9%
buffer: 0.4% (SLO - SLA)

Error Budgets

Definition: The acceptable amount of unreliability, calculated as 100% - SLO.

Error Budget Calculation

# Example: 99.9% availability SLO over 30 days
slo_target = 99.9  # percent
time_window = 30 * 24 * 60  # minutes in 30 days = 43,200

# Error budget in minutes
error_budget = time_window * (1 - slo_target / 100)
# Result: 43.2 minutes of downtime allowed per 30 days

# Error budget in requests (for 1M requests/month)
total_requests = 1_000_000
error_budget_requests = total_requests * (1 - slo_target / 100)
# Result: 1,000 failed requests allowed

Error Budget Policy

When error budget is healthy (>25% remaining):

  • Deploy new features aggressively
  • Experiment with new technologies
  • Take calculated risks

When error budget is low (<25% remaining):

  • Freeze feature launches
  • Focus on reliability improvements
  • Slow down deployment velocity
  • Investigate recent changes

When error budget is exhausted:

  • Stop all risky changes
  • Emergency reliability work only
  • Root cause analysis mandatory
  • Postmortem all incidents

Burn Rate

Definition: The rate at which error budget is being consumed.

Burn Rate Calculation

# Example: Monitoring burn rate
time_window = 30  # days
observed_availability = 99.8  # percent (measured)
slo_target = 99.9  # percent

# Error budget consumption
budget_consumed = (slo_target - observed_availability) / (100 - slo_target)
# Result: (99.9 - 99.8) / (100 - 99.9) = 0.1 / 0.1 = 100% consumed

# Burn rate (relative to time window)
elapsed_days = 5
burn_rate = (budget_consumed * time_window) / elapsed_days
# If 20% budget consumed in 5 days: burn_rate = 6x

Burn Rate Alerting

Multi-window, multi-burn-rate alerts:

# Fast burn (critical)
- alert: HighBurnRate
  expr: |
    (1 - availability_sli{service="api"}) / (1 - 0.999) > 14.4
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "Error budget will exhaust in 2 days at current rate"

# Moderate burn (warning)
- alert: ModerateBurnRate
  expr: |
    (1 - availability_sli{service="api"}) / (1 - 0.999) > 6
  for: 6h
  labels:
    severity: warning
  annotations:
    summary: "Error budget will exhaust in 5 days at current rate"

# Slow burn (info)
- alert: SlowBurnRate
  expr: |
    (1 - availability_sli{service="api"}) / (1 - 0.999) > 3
  for: 24h
  labels:
    severity: info
  annotations:
    summary: "Error budget will exhaust in 10 days at current rate"

Practical Implementation

Step 1: Define SLIs

Prometheus Example:

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI (95th percentile under 300ms)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.3

Step 2: Set SLO Targets

services:
  - name: user-api
    slos:
      - type: availability
        target: 99.9
        window: 30d

      - type: latency
        target: 99.0
        threshold: 300ms
        window: 7d

      - type: error_rate
        target: 0.1
        window: 30d

Step 3: Calculate Error Budgets

Python Script:

from datetime import datetime, timedelta

class ErrorBudget:
    def __init__(self, slo_target, time_window_days):
        self.slo_target = slo_target
        self.time_window = time_window_days
        self.allowed_downtime = self._calculate_downtime()

    def _calculate_downtime(self):
        """Calculate allowed downtime in seconds"""
        total_seconds = self.time_window * 24 * 60 * 60
        return total_seconds * (1 - self.slo_target / 100)

    def remaining_budget(self, actual_downtime_seconds):
        """Calculate remaining error budget"""
        remaining = self.allowed_downtime - actual_downtime_seconds
        percentage = (remaining / self.allowed_downtime) * 100
        return {
            'remaining_seconds': remaining,
            'remaining_percentage': percentage,
            'consumed_percentage': 100 - percentage
        }

    def burn_rate(self, downtime_in_period, period_duration_days):
        """Calculate current burn rate"""
        expected_budget_use = period_duration_days / self.time_window * 100
        actual_budget_use = (downtime_in_period / self.allowed_downtime) * 100
        return actual_budget_use / expected_budget_use

# Usage
budget = ErrorBudget(slo_target=99.9, time_window_days=30)
print(f"Allowed downtime: {budget.allowed_downtime / 60:.2f} minutes")

# Check remaining budget
actual_downtime = 20 * 60  # 20 minutes
status = budget.remaining_budget(actual_downtime)
print(f"Budget consumed: {status['consumed_percentage']:.1f}%")
print(f"Budget remaining: {status['remaining_percentage']:.1f}%")

# Calculate burn rate (10 min downtime in 5 days)
rate = budget.burn_rate(downtime_in_period=10*60, period_duration_days=5)
print(f"Burn rate: {rate:.1f}x")

Step 4: Dashboard and Monitoring

Grafana Dashboard Panels:

{
  "dashboard": {
    "title": "SLO Dashboard - User API",
    "panels": [
      {
        "title": "Current Availability SLI",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d])) * 100"
        }],
        "thresholds": [
          {"value": 99.9, "color": "green"},
          {"value": 99.5, "color": "yellow"},
          {"value": 99.0, "color": "red"}
        ]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [{
          "expr": "((0.999 - (1 - sum(rate(http_requests_total{status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)) * 100"
        }]
      },
      {
        "title": "Burn Rate (30d window)",
        "targets": [{
          "expr": "(1 - sum(rate(http_requests_total{status=~\"2..|3..\"}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)"
        }]
      }
    ]
  }
}

Real-World Examples

Example 1: E-commerce Platform

service: checkout-api
slis:
  - name: availability
    measurement: successful_checkouts / total_checkout_attempts

  - name: latency_p95
    measurement: 95th_percentile_checkout_time

  - name: payment_success
    measurement: successful_payments / total_payment_attempts

slos:
  - sli: availability
    target: 99.95%
    window: 30d
    rationale: "Losing checkout means losing revenue"

  - sli: latency_p95
    target: 500ms
    window: 7d
    rationale: "Fast checkout improves conversion"

  - sli: payment_success
    target: 99.5%
    window: 30d
    rationale: "Payment failures acceptable only for fraud/insufficient funds"

error_budget_policy:
  - remaining: ">50%"
    action: "Normal development pace"

  - remaining: "25-50%"
    action: "Review reliability risks, defer non-critical features"

  - remaining: "<25%"
    action: "Feature freeze, focus on stability"

  - remaining: "0%"
    action: "Emergency reliability work only, incident retrospectives"

Example 2: Data Pipeline

service: analytics-pipeline
slis:
  - name: freshness
    measurement: time_since_last_successful_run

  - name: completeness
    measurement: records_processed / records_expected

  - name: accuracy
    measurement: records_passing_validation / total_records

slos:
  - sli: freshness
    target: "95% of runs complete within 2 hours of schedule"
    window: 7d

  - sli: completeness
    target: "99.9% of expected records processed"
    window: 30d

  - sli: accuracy
    target: "99.5% of records pass validation"
    window: 30d

Common Pitfalls

Pitfall 1: Too Many SLOs

Problem: Tracking 20+ SLOs per service dilutes focus Solution: Start with 3-5 critical user-facing SLOs

Pitfall 2: Unrealistic Targets

Problem: Setting 99.99% SLO when system achieves 99.5% Solution: Base SLOs on current performance + incremental improvement

Pitfall 3: Ignoring Error Budget

Problem: Treating SLO breaches as “nice to have” metrics Solution: Enforce error budget policy with stakeholder buy-in

Pitfall 4: SLOs Without SLIs

Problem: “Our service should be fast” (no measurement) Solution: Define measurable SLIs first, then set SLO targets

Pitfall 5: Identical SLA and SLO

Problem: No buffer between internal target and customer commitment Solution: SLO should be stricter (e.g., 99.9% SLO, 99.5% SLA)

Migration Path

For Teams Starting Fresh

  1. Week 1-2: Instrument SLIs

    • Add metrics for availability, latency, errors
    • Validate data accuracy
  2. Week 3-4: Analyze baseline

    • Review 90 days of historical data
    • Identify current performance levels
  3. Week 5-6: Define initial SLOs

    • Set conservative targets (achievable 90% of time)
    • Document rationale
  4. Week 7-8: Implement monitoring

    • Create dashboards
    • Set up burn rate alerts
  5. Month 3+: Iterate and refine

    • Adjust targets based on learnings
    • Add error budget policy

For Teams With Existing Monitoring

  1. Map existing metrics to SLI framework
  2. Identify gaps (missing user-facing metrics)
  3. Define SLOs based on historical performance
  4. Implement error budget tracking
  5. Establish error budget policy

Tools and Resources

Monitoring Tools

  • Prometheus + Grafana: Open-source, flexible
  • Datadog: Commercial, easy setup
  • New Relic: APM with SLO tracking
  • Google Cloud Monitoring: Native GCP integration

SLO Management Tools

  • Nobl9: Dedicated SLO platform
  • Sloth: SLO generator for Prometheus
  • Grafana SLO: Built-in Grafana SLO tracking

Useful Commands

# Check current availability (from Prometheus)
curl -g 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total{status=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))'

# Calculate error budget consumption
echo "scale=2; (99.9 - 99.85) / (100 - 99.9) * 100" | bc
# Output: 50.00 (50% budget consumed)

Conclusion

SLOs, SLIs, and error budgets provide a framework for balancing reliability and innovation. Key takeaways:

  1. Start small: 3-5 critical SLOs per service
  2. Measure user experience: SLIs should reflect what users care about
  3. Use error budgets: Make reliability vs velocity tradeoffs objective
  4. Iterate: Refine SLOs based on learnings and changing requirements
  5. Enforce policy: Error budget must have teeth to be effective

Remember: The goal isn’t perfect reliability—it’s appropriate reliability that allows sustainable innovation.