SRE Practices

Collection of SRE practices, methodologies and approaches for ensuring system reliability

Comprehensive guides on Site Reliability Engineering practices including SLOs, incident management, on-call procedures, and toil reduction strategies.

Browse all SRE practices below or explore by topic using the categories and tags.

📊 SRE Practice 18 min

Disaster Recovery Planning: RTO, RPO, and Building Resilient Systems

Introduction Disaster Recovery (DR) is the process, policies, and procedures for recovering and continuing technology infrastructure after a disaster. A disaster can be natural (earthquake, flood), technical (data center failure, ransomware), or human-caused (accidental deletion, security breach). Core Principle: “Hope is not a strategy. Plan for failure before it happens.” Key Concepts RTO vs RPO Time ─────────────────────────────────────────────────────────────> │ │ │ │ Disaster Detection Recovery Normal Occurs Time Begins Operations │◄──────────────────────────────────►│ │ Recovery Time │ │ Objective (RTO) │ │ │ │◄───────────►│ │ Data Loss │ (Recovery Point │ Objective - RPO) │ Recovery Time Objective (RTO) Definition: Maximum acceptable time that a system can be down after a disaster. …

October 16, 2025 · 18 min · DevOps Engineer

📊 SRE Practice 13 min

Observability: The Three Pillars of Metrics, Logs, and Traces

Introduction Observability is the ability to understand the internal state of a system based on its external outputs. Unlike traditional monitoring, which tells you what is broken, observability helps you understand why it’s broken, even for issues you’ve never encountered before. Core Principle: “You can’t fix what you can’t see. You can’t see what you don’t measure.” The Three Pillars Overview ┌─────────────────────────────────────────┐ │ OBSERVABILITY │ ├─────────────┬──────────────┬────────────┤ │ METRICS │ LOGS │ TRACES │ ├─────────────┼──────────────┼────────────┤ │ What/When │ Why/Details │ Where │ │ Aggregated │ Individual │ Causal │ │ Time-series │ Events │ Flows │ │ Dashboards │ Search │ Waterfall │ └─────────────┴──────────────┴────────────┘ When to Use Each: …

October 16, 2025 · 13 min · DevOps Engineer

observability metrics logging tracing prometheus

📊 SRE Practice 23 min

Chaos Engineering: Building Resilient Systems Through Controlled Experiments

Introduction Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Rather than waiting for failures to happen, chaos engineering proactively injects failures to identify weaknesses before they impact users. Why does this matter? In modern distributed systems (microservices, cloud infrastructure, containers), failures are inevitable. A network can partition, a server can crash, a database can slow down. Traditional testing can’t predict all the ways these components interact when things go wrong. Chaos engineering fills this gap by deliberately causing failures in a controlled way. …

October 16, 2025 · 23 min · DevOps Engineer

chaos-engineering resilience testing observability incident-response

📊 SRE Practice 15 min

Toil Reduction: Strategies and Automation Priorities

Introduction Toil is manual, repetitive, automatable work that scales linearly with service growth. It’s the operational burden that keeps engineers from doing valuable engineering work. Reducing toil is essential for scaling both systems and teams effectively. What is Toil? Google’s SRE Definition Toil has the following characteristics: Manual - Requires human action Repetitive - Done over and over Automatable - Could be automated Tactical - Reactive, interrupt-driven No enduring value - Doesn’t improve the system Scales linearly - Grows with service growth Toil vs Engineering Work Toil (eliminate this): …

October 15, 2025 · 15 min · DevOps Engineer

toil automation efficiency devops

📊 SRE Practice 13 min

On-Call Runbook: Template and Best Practices

Introduction An on-call runbook is a documented set of procedures and information that helps engineers respond to incidents effectively. A good runbook reduces Mean Time To Resolution (MTTR), decreases stress, and enables any team member to handle incidents confidently. Why Runbooks Matter Without Runbooks Every incident requires figuring things out from scratch Tribal knowledge lost when team members leave New team members struggle with on-call Inconsistent incident response Higher MTTR and more user impact With Runbooks Standardized, tested response procedures Knowledge preserved and shared Faster onboarding for new team members Consistent, reliable incident response Lower MTTR and reduced stress Runbook Structure Essential Sections Service Overview - What the service does Architecture - Key components and dependencies Common Alerts - What triggers pages and how to respond Troubleshooting Guide - Diagnostic steps and solutions Escalation Procedures - When and how to escalate Emergency Contacts - Who to reach for help Rollback Procedures - How to revert changes Useful Commands - Quick reference for common tasks Complete Runbook Template # On-Call Runbook: [Service Name] **Last Updated:** YYYY-MM-DD **Maintained By:** [Team Name] **On-Call Schedule:** [Link to PagerDuty/Opsgenie] --- ## Service Overview ### Purpose [What does this service do? Why does it matter?] Example: "The Payment Service processes all customer payments including credit cards, PayPal, and gift cards. It handles ~500 transactions/minute and is critical for revenue generation. Any downtime directly impacts sales." ### Key Metrics - **Traffic:** [Requests per minute/hour] - **Latency:** [P50, P95, P99 response times] - **Error Rate:** [Typical error percentage] - **SLO:** [Availability and performance targets] ### Business Impact - **High Impact Hours:** [Peak times when incidents matter most] - **Estimated Revenue Impact:** [Cost per minute of downtime] - **Affected Users:** [Number/type of users impacted by outage] --- ## Architecture ### System Components ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │ API GW │─────▶│Payment Service│─────▶│ Database │ └─────────────┘ └──────────────┘ └─────────────┘ │ ├─────▶ Stripe API ├─────▶ PayPal API └─────▶ Fraud Detection …

October 15, 2025 · 13 min · DevOps Engineer

oncall runbook incident-response operations

📊 SRE Practice 12 min

Blameless Postmortem: Process and Template

Introduction A blameless postmortem is a structured review process conducted after an incident to understand what happened, why it happened, and how to prevent similar incidents in the future—without placing blame on individuals. Why Blameless? The Problem with Blame Traditional approach: Engineers fear being blamed Information is hidden or sanitized Root causes remain undiscovered Same incidents repeat Blameless approach: Psychological safety encourages honesty Full context emerges Systemic issues are identified Organization learns and improves Core Principle “People don’t cause incidents; systems do.” …

October 15, 2025 · 12 min · DevOps Engineer

postmortem incident-response blameless retrospective

📊 SRE Practice 7 min

Understanding SLOs, SLIs, and SLAs: A Practical Guide

Introduction Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are fundamental concepts in Site Reliability Engineering. Understanding and implementing them correctly is crucial for maintaining reliable services. Core Concepts SLI (Service Level Indicator) Definition: A quantitative measure of service reliability from the user’s perspective. Common SLIs: Availability: Percentage of successful requests Latency: Proportion of requests served faster than threshold Throughput: Requests processed per second Error Rate: Percentage of failed requests Example SLI Definitions: …

October 15, 2025 · 7 min · DevOps Engineer

slo sli sla error-budget observability

📊 SRE Practice 5 min

SSL Certificate Renewal Runbook

SSL Certificate Renewal Runbook Quick Reference: Duration: 5-15 minutes Impact: Minimal (usually zero downtime) Requires: Sudo access to load balancer or cert-manager admin Severity: HIGH (expired certs = full outage) Prerequisites What You Need SSH access to certificate servers or kubectl access to cluster Let’s Encrypt API credentials (if manual renewal) Backup certificate location documented 30+ days lead time before expiry (not 5 minutes!) Check Current Status # View certificate expiry openssl s_client -connect example.com:443 -showcerts | grep -A5 "Verify return code" # Or via Kubernetes kubectl get certificate -n ingress-nginx kubectl describe certificate prod-cert -n ingress-nginx # Check expiry date specifically openssl x509 -in /etc/ssl/certs/server.crt -noout -enddate Automatic Renewal (Preferred) Using cert-manager (Kubernetes) apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: prod-cert namespace: ingress-nginx spec: secretName: prod-tls-secret issuerRef: name: letsencrypt-prod kind: ClusterIssuer dnsNames: - example.com - "*.example.com" Cert-manager automatically: …

August 15, 2025 · 5 min · DevOps Engineer

ssl certificate renewal runbook tls