π SRE Practice
13 min
On-Call Runbook: Template and Best Practices
Introduction An on-call runbook is a documented set of procedures and information that helps engineers respond to incidents effectively. A good runbook reduces Mean Time To Resolution (MTTR), decreases stress, and enables any team member to handle incidents confidently.
Why Runbooks Matter Without Runbooks Every incident requires figuring things out from scratch Tribal knowledge lost when team members leave New team members struggle with on-call Inconsistent incident response Higher MTTR and more user impact With Runbooks Standardized, tested response procedures Knowledge preserved and shared Faster onboarding for new team members Consistent, reliable incident response Lower MTTR and reduced stress Runbook Structure Essential Sections Service Overview - What the service does Architecture - Key components and dependencies Common Alerts - What triggers pages and how to respond Troubleshooting Guide - Diagnostic steps and solutions Escalation Procedures - When and how to escalate Emergency Contacts - Who to reach for help Rollback Procedures - How to revert changes Useful Commands - Quick reference for common tasks Complete Runbook Template # On-Call Runbook: [Service Name] **Last Updated:** YYYY-MM-DD **Maintained By:** [Team Name] **On-Call Schedule:** [Link to PagerDuty/Opsgenie] --- ## Service Overview ### Purpose [What does this service do? Why does it matter?] Example: "The Payment Service processes all customer payments including credit cards, PayPal, and gift cards. It handles ~500 transactions/minute and is critical for revenue generation. Any downtime directly impacts sales." ### Key Metrics - **Traffic:** [Requests per minute/hour] - **Latency:** [P50, P95, P99 response times] - **Error Rate:** [Typical error percentage] - **SLO:** [Availability and performance targets] ### Business Impact - **High Impact Hours:** [Peak times when incidents matter most] - **Estimated Revenue Impact:** [Cost per minute of downtime] - **Affected Users:** [Number/type of users impacted by outage] --- ## Architecture ### System Components βββββββββββββββ ββββββββββββββββ βββββββββββββββ β API GW βββββββΆβPayment ServiceβββββββΆβ Database β βββββββββββββββ ββββββββββββββββ βββββββββββββββ β βββββββΆ Stripe API βββββββΆ PayPal API βββββββΆ Fraud Detection
β¦
October 15, 2025 Β· 13 min Β· DevOps Engineer
π SRE Practice
12 min
Blameless Postmortem: Process and Template
Introduction A blameless postmortem is a structured review process conducted after an incident to understand what happened, why it happened, and how to prevent similar incidents in the futureβwithout placing blame on individuals.
Why Blameless? The Problem with Blame Traditional approach:
Engineers fear being blamed Information is hidden or sanitized Root causes remain undiscovered Same incidents repeat Blameless approach:
Psychological safety encourages honesty Full context emerges Systemic issues are identified Organization learns and improves Core Principle βPeople donβt cause incidents; systems do.β
β¦
October 15, 2025 Β· 12 min Β· DevOps Engineer
π¨ Incident
7 min
Incident: Database Connection Pool Exhaustion
Incident Summary Date: 2025-10-14 Time: 03:15 UTC Duration: 23 minutes Severity: SEV-1 (Critical) Impact: Complete API unavailability affecting 100% of users
Quick Facts Users Affected: ~2,000 active users Services Affected: API, Admin Dashboard, Mobile App Revenue Impact: ~$4,500 in lost transactions SLO Impact: Consumed 45% of monthly error budget Timeline 03:15:00 - PagerDuty alert fired: API health check failures 03:15:30 - On-call engineer (Alice) acknowledged alert 03:16:00 - Initial investigation: All API pods showing healthy status 03:17:00 - Checked application logs: βconnection timeoutβ errors appearing 03:18:00 - Senior engineer (Bob) joined incident response 03:19:00 - Identified pattern: All database connection attempts timing out 03:20:00 - Checked database status: PostgreSQL running normally 03:22:00 - Checked connection pool metrics: 100/100 connections in use 03:23:00 - Root cause identified: Background job leaking connections 03:25:00 - Decision made to restart API pods to release connections 03:27:00 - Rolling restart initiated for API deployment 03:30:00 - First pods restarted, connection pool draining 03:33:00 - 50% of pods restarted, API partially operational 03:35:00 - All pods restarted, connection pool normalized 03:36:00 - Smoke tests passed, API fully operational 03:38:00 - Incident marked as resolved 03:45:00 - Post-incident monitoring confirmed stability Root Cause Analysis What Happened The API service uses a PostgreSQL connection pool configured with a maximum of 100 connections. A background job for data synchronization was deployed on October 12th (2 days prior to incident).
β¦
October 14, 2025 Β· 7 min Β· DevOps Engineer
π¨ Incident
8 min
Incident: Kubernetes OOMKilled - Memory Leak in Production
Incident Summary Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates
Quick Facts Users Affected: ~30% of users experiencing slow responses Services Affected: User API service Error Rate: Spiked from 0.5% to 8% SLO Impact: 25% of monthly error budget consumed Timeline 14:30 - Prometheus alert: High pod restart rate detected 14:31 - On-call engineer (Dave) acknowledged, investigating 14:33 - Observed pattern: Pods restarting every 15-20 minutes 14:35 - Checked pod status: OOMKilled (exit code 137) 14:37 - Senior SRE (Emma) joined investigation 14:40 - Checked resource limits: 512MB memory limit per pod 14:42 - Reviewed recent deployments: New caching feature deployed yesterday 14:45 - Examined memory metrics: Linear growth from 100MB β 512MB over 15 min 14:50 - Hypothesis: Memory leak in new caching code 14:52 - Decision: Increase memory limit to 1GB as temporary mitigation 14:55 - Memory limit increased, pods restarted with new limits 15:00 - Pod restart frequency decreased (now every ~30 minutes) 15:05 - Confirmed leak still present, just slower with more memory 15:10 - Development team engaged to investigate caching code 15:25 - Memory leak identified: Event listeners not being removed 15:35 - Fix developed and tested locally 15:45 - Hotfix deployed to production 16:00 - Memory usage stabilized at ~180MB 16:15 - Monitoring shows no growth, pods stable 16:30 - Error rate returned to baseline 16:45 - Incident marked as resolved Root Cause Analysis What Happened On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.
β¦
September 28, 2025 Β· 8 min Β· DevOps Engineer
π¨ Incident
11 min
Incident: Redis Cache Failure Causes Cascading Database Load
Incident Summary Date: 2025-09-05 Time: 09:45 UTC Duration: 1 hour 32 minutes Severity: SEV-1 (Critical) Impact: Severe performance degradation affecting 85% of users
Quick Facts Users Affected: ~8,500 active users (85%) Services Affected: Web Application, Mobile API, Admin Dashboard Response Time: P95 latency increased from 200ms to 45 seconds Revenue Impact: ~$18,000 in lost sales and abandoned carts SLO Impact: 70% of monthly error budget consumed Timeline 09:45:00 - Redis cluster health check alert: Node down 09:45:15 - Application latency spiked dramatically 09:45:30 - PagerDuty alert: P95 latency > 10 seconds 09:46:00 - On-call engineer (Sarah) acknowledged alert 09:47:00 - Database CPU spiked to 95% utilization 09:48:00 - Database connection pool approaching limits (180/200) 09:49:00 - User complaints started flooding support channels 09:50:00 - Senior SRE (Marcus) joined incident response 09:52:00 - Checked Redis status: Master node unresponsive 09:54:00 - Identified: Redis master failure, failover not working 09:56:00 - Incident escalated to SEV-1, incident commander assigned 09:58:00 - Attempted automatic failover: Failed 10:00:00 - Decision: Manual promotion of Redis replica to master 10:03:00 - Promoted replica-1 to master manually 10:05:00 - Updated application config to point to new master 10:08:00 - Rolling restart of application pods initiated 10:15:00 - 50% of pods restarted with new Redis endpoint 10:18:00 - Cache warming started for critical keys 10:22:00 - Database load starting to decrease (CPU: 65%) 10:25:00 - P95 latency improved to 3 seconds 10:30:00 - All pods restarted, cache rebuild in progress 10:40:00 - P95 latency down to 800ms 10:50:00 - Cache fully populated, metrics returning to normal 11:05:00 - P95 latency at 220ms (near baseline) 11:17:00 - Incident marked as resolved 11:30:00 - Post-incident monitoring confirmed stability Root Cause Analysis What Happened The production Redis cluster consisted of 1 master and 2 replicas running Redis Sentinel for high availability. On September 5th at 09:45 UTC, the Redis master node experienced a kernel panic due to an underlying infrastructure issue.
β¦
September 5, 2025 Β· 11 min Β· DevOps Engineer
π¨ Incident
2 min
Incident: Missing DAGs in Apache Airflow
Incident Description Time: 2025-08-17 02:00 UTC
Duration: 45 minutes
Impact: Critical - all scheduled tasks stopped
Symptoms DAGs disappeared from Airflow UI Scheduler logs showing import errors Tasks not running on schedule Timeline 02:00 - Issue Detection # Monitoring showed no tasks airflow dags list | wc -l # Result: 0 (should be ~50) 02:05 - Initial Diagnosis # Check scheduler status systemctl status airflow-scheduler # Check logs tail -f /var/log/airflow/scheduler.log 02:10 - Root Cause Found # Error found in logs: ImportError: No module named 'pandas' # DAG file imports pandas, but library is missing Root Cause Analysis Cause Virtual environment update removed the pandas dependency used in one of the DAG files. Airflow stops loading ALL DAGs when any single DAG file has import errors.
β¦
August 17, 2025 Β· 2 min Β· DevOps Engineer
π¨ Incident
9 min
Incident: SSL Certificate Expiry Causes Complete Outage
Incident Summary Date: 2025-08-15 Time: 08:00 UTC Duration: 1 hour 45 minutes Severity: SEV-1 (Critical) Impact: Complete service unavailability for all users
Quick Facts Users Affected: 100% - all external traffic Services Affected: All public-facing services Revenue Impact: ~$12,000 in lost sales SLO Impact: 80% of monthly error budget consumed in single incident Timeline 08:00:00 - SSL certificate expired (not detected) 08:00:30 - User reports started coming in: βYour connection is not privateβ 08:02:00 - PagerDuty alert: Health check failures from external monitoring 08:02:30 - On-call engineer (Sarah) acknowledged alert 08:03:00 - Opened website, saw SSL certificate error 08:03:30 - Checked certificate expiry: Expired at 08:00 UTC 08:04:00 - Root cause identified: SSL certificate expired 08:04:30 - Incident escalated to SEV-1, incident commander assigned 08:05:00 - Senior SRE (Mike) joined as incident commander 08:06:00 - Attempted automatic renewal with certbot: Failed - rate limit exceeded 08:08:00 - Checked Letβs Encrypt rate limits: Hit weekly renewal limit 08:10:00 - Decision: Use backup certificate from 6 months ago (still valid) 08:12:00 - Located backup certificate in secure storage 08:15:00 - Deployed backup certificate to load balancer 08:18:00 - Certificate updated, but services still showing errors 08:20:00 - Discovered cached certificate in CDN (Cloudflare) 08:22:00 - Purged Cloudflare cache 08:25:00 - Still seeing errors from some users 08:27:00 - Realized nginx not reloaded after certificate update 08:30:00 - Reloaded nginx on all load balancers 08:33:00 - Service partially restored, some users still affected 08:35:00 - Identified browser certificate caching 08:38:00 - Communicated workaround to users (clear browser cache) 08:45:00 - Traffic gradually recovering 09:00:00 - 90% of users able to access site 09:30:00 - 98% recovery, remaining issues browser caching 09:45:00 - Incident marked as resolved Root Cause Analysis What Happened Primary cause: SSL certificate for *.example.com expired at 08:00 UTC on August 15th, 2025.
β¦
August 15, 2025 Β· 9 min Β· DevOps Engineer
π¨ Incident
12 min
Incident: Disk Space Exhaustion Causes Node Failures
Incident Summary Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures
Quick Facts Users Affected: ~40% experiencing intermittent errors Services Affected: Multiple microservices across 3 Kubernetes nodes Nodes Failed: 3 out of 8 worker nodes Pods Evicted: 47 pods due to disk pressure SLO Impact: 35% of monthly error budget consumed Timeline 11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3 11:22:00 - On-call engineer (Tom) acknowledged alert 11:25:00 - Checked node: 92% disk usage, mostly logs 11:28:00 - Second alert: node-worker-5 also >85% 11:30:00 - Third alert: node-worker-7 >85% 11:32:00 - Senior SRE (Rachel) joined investigation 11:35:00 - Pattern identified: All nodes running logging-agent pod 11:38:00 - First node reached 98% disk usage 11:40:00 - Kubelet started evicting pods due to disk pressure 11:42:00 - 12 pods evicted from node-worker-3 11:45:00 - User reports: Intermittent 503 errors 11:47:00 - Incident escalated to SEV-2 11:50:00 - Identified root cause: Log rotation not working for logging-agent 11:52:00 - Emergency: Manual log cleanup on affected nodes 11:58:00 - First node cleaned: 92% β 45% disk usage 12:05:00 - Second node cleaned: 88% β 40% disk usage 12:10:00 - Third node cleaned: 95% β 42% disk usage 12:15:00 - All evicted pods rescheduled and running 12:30:00 - Deployed fix for log rotation issue 12:45:00 - Monitoring shows disk usage stabilizing 13:00:00 - Implemented automated log cleanup job 13:30:00 - Added improved monitoring and alerts 14:15:00 - Verified all nodes healthy, services normal 15:05:00 - Incident marked as resolved Root Cause Analysis What Happened A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely.
β¦
July 22, 2025 Β· 12 min Β· DevOps Engineer