Introduction
An on-call runbook is a documented set of procedures and information that helps engineers respond to incidents effectively. A good runbook reduces Mean Time To Resolution (MTTR), decreases stress, and enables any team member to handle incidents confidently.
Why Runbooks Matter
Without Runbooks
- Every incident requires figuring things out from scratch
- Tribal knowledge lost when team members leave
- New team members struggle with on-call
- Inconsistent incident response
- Higher MTTR and more user impact
With Runbooks
- Standardized, tested response procedures
- Knowledge preserved and shared
- Faster onboarding for new team members
- Consistent, reliable incident response
- Lower MTTR and reduced stress
Runbook Structure
Essential Sections
- Service Overview - What the service does
- Architecture - Key components and dependencies
- Common Alerts - What triggers pages and how to respond
- Troubleshooting Guide - Diagnostic steps and solutions
- Escalation Procedures - When and how to escalate
- Emergency Contacts - Who to reach for help
- Rollback Procedures - How to revert changes
- Useful Commands - Quick reference for common tasks
Complete Runbook Template
# On-Call Runbook: [Service Name]
**Last Updated:** YYYY-MM-DD
**Maintained By:** [Team Name]
**On-Call Schedule:** [Link to PagerDuty/Opsgenie]
---
## Service Overview
### Purpose
[What does this service do? Why does it matter?]
Example:
"The Payment Service processes all customer payments including credit cards,
PayPal, and gift cards. It handles ~500 transactions/minute and is critical
for revenue generation. Any downtime directly impacts sales."
### Key Metrics
- **Traffic:** [Requests per minute/hour]
- **Latency:** [P50, P95, P99 response times]
- **Error Rate:** [Typical error percentage]
- **SLO:** [Availability and performance targets]
### Business Impact
- **High Impact Hours:** [Peak times when incidents matter most]
- **Estimated Revenue Impact:** [Cost per minute of downtime]
- **Affected Users:** [Number/type of users impacted by outage]
---
## Architecture
### System Components
βββββββββββββββ ββββββββββββββββ βββββββββββββββ β API GW βββββββΆβPayment ServiceβββββββΆβ Database β βββββββββββββββ ββββββββββββββββ βββββββββββββββ β βββββββΆ Stripe API βββββββΆ PayPal API βββββββΆ Fraud Detection
### Dependencies
| Service | Purpose | Impact if Down | Contact |
|---------|---------|----------------|---------|
| PostgreSQL | Transaction storage | Cannot process payments | DBA team |
| Redis | Session cache | Degraded performance | Platform team |
| Stripe | Credit card processing | Card payments fail | External (status.stripe.com) |
| Fraud Service | Transaction screening | Must disable checks | Security team |
### Infrastructure
- **Platform:** Kubernetes (AWS EKS)
- **Namespace:** `production/payments`
- **Replicas:** 10 pods
- **Resources:** 2 CPU, 4GB RAM per pod
- **Database:** PostgreSQL 14, multi-AZ RDS
---
## Dashboards and Monitoring
### Primary Dashboard
[Link to Grafana/Datadog dashboard]
**Key Panels:**
- Request rate (should be ~500/min during business hours)
- Error rate (should be <0.5%)
- Latency percentiles (P95 should be <300ms)
- Database connection pool usage (should be <80%)
### Logs
- **Kibana:** [Link to logs]
- **Query:** `service:payment-service AND level:error`
### Metrics
- **Prometheus:** [Link to Prometheus]
- **Key Metrics:**
- `payment_requests_total`
- `payment_errors_total`
- `payment_duration_seconds`
---
## Common Alerts
### π΄ CRITICAL: High Error Rate
**Alert:** `PaymentServiceHighErrorRate`
**Description:**
Error rate exceeds 5% for 5 minutes.
**Impact:**
- Payments failing for customers
- Revenue loss
- Customer support tickets increasing
**Diagnostic Steps:**
1. **Check the dashboard** to confirm error rate spike
```bash
# Open: [Dashboard Link]
# Look at: Error rate panel (should show spike)
Check recent deployments
kubectl rollout history deployment/payment-service -n production
Review error logs
kubectl logs -n production deployment/payment-service --tail=100 | grep ERROR
Check external dependencies
- Stripe: https://status.stripe.com
- PayPal: https://status.paypal.com
- Database: Check RDS metrics in AWS console
Common Causes & Solutions:
Cause | Symptoms | Solution |
---|---|---|
Bad deployment | Errors started after deploy | Rollback (see section below) |
External API down | Specific payment method failing | Enable fallback mode |
Database issues | Connection timeouts | Check connection pool, restart if needed |
Traffic spike | All metrics elevated | Scale up pods |
Resolution Steps:
If recent deployment:
# Rollback to previous version
kubectl rollout undo deployment/payment-service -n production
kubectl rollout status deployment/payment-service -n production
If external API down:
# Enable fallback mode (skip that payment method)
kubectl set env deployment/payment-service -n production \
STRIPE_ENABLED=false
If database connection issues:
# Check connection pool
kubectl exec -n production deployment/payment-service -- \
curl localhost:8080/metrics | grep db_connection
# Restart deployment if pool exhausted
kubectl rollout restart deployment/payment-service -n production
Escalation:
- If not resolved in 15 minutes, escalate to senior engineer
- If payment partner issue, contact their support
π‘ WARNING: High Latency
Alert: PaymentServiceHighLatency
Description: P95 latency exceeds 500ms for 10 minutes.
Impact:
- Slow checkout experience
- Potential timeout errors
- Customer frustration
Diagnostic Steps:
Check latency dashboard
- Which percentiles are affected? (P50, P95, P99)
- Is it all requests or specific endpoints?
Check database performance
# Slow query log kubectl exec -n production deployment/payment-service -- \ curl localhost:8080/admin/slow-queries
Check external API latency
kubectl logs -n production deployment/payment-service \ | grep "external_api_duration" | tail -50
Common Causes & Solutions:
Cause | Solution |
---|---|
Database slow queries | Identify and optimize queries, add indexes |
External API slow | Contact partner, enable caching |
High CPU usage | Scale up pods |
Memory pressure | Check for memory leaks, restart if needed |
Resolution Steps:
Scale up if resource constrained:
kubectl scale deployment/payment-service -n production --replicas=15
Escalation:
- If latency affects SLO, escalate immediately
- If database related, engage DBA team
π WARNING: Pod Crashes
Alert: PaymentServicePodCrashing
Description: Pods restarting more than 3 times in 10 minutes.
Diagnostic Steps:
Check pod status
kubectl get pods -n production -l app=payment-service
Check crash logs
# Get crashed pod name POD=$(kubectl get pods -n production -l app=payment-service \ --field-selector=status.phase=Running \ -o jsonpath='{.items[0].metadata.name}') # Check previous logs kubectl logs -n production $POD --previous
Check resource limits
kubectl top pods -n production -l app=payment-service
Common Causes:
- OOMKilled: Memory limit too low, increase memory
- CrashLoopBackOff: Application bug, check logs for stack trace
- Failed health checks: Health endpoint timing out
Resolution Steps:
If OOMKilled:
# Increase memory limit
kubectl set resources deployment/payment-service -n production \
--limits=memory=6Gi --requests=memory=4Gi
If application crash:
# Rollback recent deployment
kubectl rollout undo deployment/payment-service -n production
Troubleshooting Guide
Issue: Individual Payment Failing
Symptoms:
- One transaction failing
- Others succeeding
- No alerts firing
Steps:
Find transaction in logs
kubectl logs -n production deployment/payment-service \ | grep "transaction_id:ABC123"
Check payment method
- Is Stripe/PayPal returning errors?
- Card declined vs system error?
Validate with user
- Confirm payment method is valid
- Try different payment method
Note: Single failures are often user error (expired card, insufficient funds). If pattern emerges (multiple similar failures), investigate deeper.
Issue: Database Connection Errors
Symptoms:
ERROR: could not connect to database
FATAL: remaining connection slots are reserved
Steps:
Check connection pool metrics
kubectl exec -n production deployment/payment-service -- \ curl localhost:8080/metrics | grep -A 5 db_pool
Check active connections in database
SELECT count(*) FROM pg_stat_activity WHERE datname = 'payments';
Identify connection leaks
# Check for long-running queries kubectl exec -n production deployment/payment-service -- \ curl localhost:8080/admin/active-connections
Resolution:
Immediate:
# Restart pods to clear connection pool
kubectl rollout restart deployment/payment-service -n production
Long-term:
- Fix connection leaks in application code
- Increase connection pool size
- Scale database if needed
Issue: Stuck Transactions
Symptoms:
- Transactions in “pending” state for >10 minutes
- User charged but order not completed
Steps:
Find stuck transactions
SELECT * FROM transactions WHERE status = 'pending' AND created_at < NOW() - INTERVAL '10 minutes';
Check transaction logs
kubectl logs -n production deployment/payment-service \ | grep "transaction_id:ABC123"
Verify with payment provider
- Check Stripe/PayPal dashboard
- Confirm charge status
Resolution:
Manual intervention required:
-- If provider confirms success
UPDATE transactions
SET status = 'completed', updated_at = NOW()
WHERE transaction_id = 'ABC123';
-- If provider confirms failure
UPDATE transactions
SET status = 'failed', updated_at = NOW()
WHERE transaction_id = 'ABC123';
Follow-up:
- File bug to fix automatic transaction resolution
- Implement timeout handling
Rollback Procedures
Standard Deployment Rollback
When to rollback:
- High error rate after deployment
- Unexpected behavior in production
- Performance degradation
Steps:
Verify current version
kubectl get deployment payment-service -n production \ -o jsonpath='{.spec.template.spec.containers[0].image}'
Check rollout history
kubectl rollout history deployment/payment-service -n production
Rollback to previous version
kubectl rollout undo deployment/payment-service -n production
Monitor rollback progress
kubectl rollout status deployment/payment-service -n production
Verify recovery
- Check dashboard for error rate decrease
- Monitor logs for any new errors
- Confirm with test transaction
Estimated time: 2-3 minutes
Database Migration Rollback
When to rollback:
- Migration caused errors
- Performance severely degraded
- Data corruption detected
Steps:
Stop application deployments
kubectl scale deployment/payment-service -n production --replicas=0
Connect to database
psql -h payments-db.abc123.us-east-1.rds.amazonaws.com \ -U admin -d payments
Run rollback migration
-- Check current migration version SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1; -- Run down migration \i migrations/20251015_rollback.sql
Verify database state
-- Check critical tables SELECT COUNT(*) FROM transactions; SELECT COUNT(*) FROM users;
Restart application
kubectl scale deployment/payment-service -n production --replicas=10
Estimated time: 10-15 minutes Escalation: Engage DBA team immediately for database rollbacks
Escalation Procedures
When to Escalate
Immediate escalation:
- Unable to restore service within 15 minutes
- Data loss or corruption suspected
- Security incident detected
- Multiple services affected
- Business-critical hours (Black Friday, etc.)
Standard escalation:
- Problem outside your expertise
- Requires access you don’t have
- External dependency issue
- Need additional resources
Escalation Chain
Level 1: On-call engineer (you)
- Response time: Immediate
- Responsibilities: Initial diagnosis, basic troubleshooting
Level 2: Senior engineer
- Contact: [Slack: @senior-oncall] [Phone: +1-555-0100]
- Response time: 15 minutes
- Escalate if: Not resolved in 15 min, or unfamiliar issue
Level 3: Team lead / Manager
- Contact: [Slack: @team-lead] [Phone: +1-555-0200]
- Response time: 30 minutes
- Escalate if: Major incident, multi-service impact
Level 4: Engineering director
- Contact: [Phone: +1-555-0300]
- Response time: 1 hour
- Escalate if: Exec-level decision needed, major customer impact
External Teams
Team | Contact | When to Engage |
---|---|---|
DBA Team | #dba-oncall | Database issues, slow queries |
Platform Team | #platform-oncall | Kubernetes, networking, infra |
Security Team | #security-oncall | Security incidents, suspicious activity |
Customer Support | #support-escalations | Customer-facing communication |
Escalation Template
Slack message:
π¨ ESCALATION: Payment Service High Error Rate
Severity: SEV-2
Duration: 20 minutes
Impact: 10% of payment requests failing
What I've tried:
- Checked recent deployments (none in last 2 hours)
- Reviewed logs (seeing Stripe API timeouts)
- Checked Stripe status page (all systems operational)
Current status:
- Error rate: 10%
- Traffic: Normal (~500/min)
- External APIs: Stripe intermittent timeouts
Need help with:
- Determining if this is our issue or Stripe's
- Decision on enabling fallback payment methods
Dashboard: [link]
Incident channel: #incident-20251015
Emergency Contacts
Internal Team
Role | Name | Slack | Phone | Timezone |
---|---|---|---|---|
On-Call (L2) | [Auto-rotates] | @senior-oncall | [PagerDuty] | Various |
Team Lead | Jane Smith | @jane | +1-555-0200 | PST |
DBA On-Call | [Auto-rotates] | @dba-oncall | [PagerDuty] | Various |
Platform On-Call | [Auto-rotates] | @platform-oncall | [PagerDuty] | Various |
External Partners
Partner | Support Contact | Status Page | SLA |
---|---|---|---|
Stripe | [email protected] | status.stripe.com | 24/7 |
PayPal | [email protected] | status.paypal.com | Business hours |
AWS | AWS Console (Premium Support) | health.aws.amazon.com | 24/7 |
Useful Commands
Kubernetes
# Get pod status
kubectl get pods -n production -l app=payment-service
# View recent logs
kubectl logs -n production deployment/payment-service --tail=100
# Follow logs in real-time
kubectl logs -n production deployment/payment-service -f
# Get pod details
kubectl describe pod -n production [POD_NAME]
# Execute command in pod
kubectl exec -n production [POD_NAME] -- [COMMAND]
# Port forward to local machine
kubectl port-forward -n production deployment/payment-service 8080:8080
# Scale replicas
kubectl scale deployment/payment-service -n production --replicas=15
# Restart deployment
kubectl rollout restart deployment/payment-service -n production
# Check deployment status
kubectl rollout status deployment/payment-service -n production
# View deployment history
kubectl rollout history deployment/payment-service -n production
# Rollback deployment
kubectl rollout undo deployment/payment-service -n production
Database
# Connect to database
psql -h payments-db.abc123.us-east-1.rds.amazonaws.com -U admin -d payments
# Run SQL query
kubectl exec -n production deployment/payment-service -- \
psql -h [DB_HOST] -U [USER] -d payments -c "SELECT COUNT(*) FROM transactions;"
# Check active connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
# Find slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';
# Kill long-running query
SELECT pg_terminate_backend([PID]);
Monitoring
# Query Prometheus
curl -g 'http://prometheus:9090/api/v1/query?query=payment_requests_total'
# Get recent error rate
curl -g 'http://prometheus:9090/api/v1/query?query=rate(payment_errors_total[5m])'
# Check service health
curl http://payment-service.production.svc.cluster.local:8080/health
Traffic Control
# Drain traffic from service
kubectl annotate service payment-service -n production \
traffic.sidecar.istio.io/excludeInboundPorts="8080"
# Restore traffic
kubectl annotate service payment-service -n production \
traffic.sidecar.istio.io/excludeInboundPorts-
Maintenance Procedures
Planned Maintenance
Before maintenance:
- Announce in #engineering and #support channels
- Update status page if user-facing
- Schedule during low-traffic window (typically 2-4 AM PST)
During maintenance:
- Follow change management process
- Monitor metrics dashboard continuously
- Keep incident channel updated
- Have rollback plan ready
After maintenance:
- Verify all metrics returned to baseline
- Run smoke tests
- Monitor for 30 minutes
- Update status page
- Close maintenance ticket
Database Maintenance
# Run vacuum (during low traffic)
psql -h [DB_HOST] -U admin -d payments -c "VACUUM ANALYZE transactions;"
# Reindex (if query performance degraded)
psql -h [DB_HOST] -U admin -d payments -c "REINDEX TABLE transactions;"
# Check table sizes
psql -h [DB_HOST] -U admin -d payments -c "
SELECT
schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
"
Known Issues
Issue: Intermittent Stripe Timeouts
Status: Ongoing (tracking in JIRA-1234)
Symptoms:
- Occasional Stripe API timeouts (< 1% of requests)
- No pattern to timing
- Stripe status page shows green
Workaround:
- Application automatically retries
- Fallback to manual payment option shown to user
Mitigation:
- Increased timeout from 5s to 10s
- Added circuit breaker
Next steps:
- Working with Stripe support to investigate
- Considering secondary payment processor
Issue: Memory Leak on High Traffic
Status: Under investigation (JIRA-2345)
Symptoms:
- Memory usage slowly climbs during sustained high traffic
- Reaches limit after ~6 hours
- Pods restart and recover
Workaround:
- Scheduled pod restarts every 4 hours during high-traffic events
- Increased memory limit from 4GB to 6GB
Next steps:
- Heap dump analysis in progress
- Suspected object pooling issue
Runbook Maintenance
Update Schedule
- Weekly: Review for accuracy after incidents
- Monthly: Update contacts and on-call schedule
- Quarterly: Full runbook review and testing
How to Update
- Create PR in [repo link]
- Get review from team member
- Update “Last Updated” date
- Announce changes in #team-chat
Feedback
Found an issue or have suggestions?
- File issue: [GitHub link]
- Slack: #runbook-feedback
Additional Resources
- Architecture Diagrams: [Confluence link]
- API Documentation: [Link]
- Monitoring Dashboard: [Grafana link]
- Incident History: [Postmortem repository]
- Team Wiki: [Wiki link]
## Best Practices for Runbooks
### 1. Test Your Runbooks
**Regular testing:**
```bash
# Create test script
#!/bin/bash
echo "Testing Payment Service Runbook"
echo "1. Testing dashboard access..."
curl -s [DASHBOARD_URL] > /dev/null && echo "β Dashboard accessible"
echo "2. Testing kubectl access..."
kubectl get pods -n production -l app=payment-service > /dev/null && echo "β Kubectl working"
echo "3. Testing database access..."
psql -h [DB_HOST] -U admin -d payments -c "SELECT 1;" > /dev/null && echo "β Database accessible"
echo "Runbook test complete!"
Game days:
- Schedule quarterly incident simulation
- Have engineers follow runbook procedures
- Identify gaps and update runbook
2. Keep It Updated
After every incident:
- Add new troubleshooting steps discovered
- Update estimated resolution times
- Document new failure modes
Version control:
- Store runbooks in Git
- Review changes in PRs
- Tag releases
3. Make It Accessible
Where to store:
- β Git repository (searchable, version controlled)
- β Wiki (easily accessible, collaborative)
- β PagerDuty runbook feature (embedded in alerts)
Where NOT to store:
- β Someone’s laptop
- β Private documents
- β Outdated format
4. Use Clear Language
Good:
If error rate exceeds 5%, rollback the deployment:
kubectl rollout undo deployment/payment-service -n production
Bad:
When things are broken, you should probably rollback unless there's
a good reason not to. Use kubectl to undo it.
5. Include Context
Why this matters:
- Helps engineers make decisions
- Reduces “just following orders” mentality
- Enables improvisation when needed
Example:
Scale to 15 pods (from 10) if CPU > 80%.
Why: The service can handle increased traffic with more pods.
15 pods has been tested and works well. Don't go above 20 pods
without consulting the platform team, as that may cause database
connection pool exhaustion.
Runbook for Runbooks
Creating a New Runbook
- Copy template (provided above)
- Fill in service-specific details
- Architecture diagram
- Common alerts
- Key commands
- Test all commands yourself
- Have teammate review
- Publish and announce
Measuring Runbook Effectiveness
metrics:
- name: runbook_usage
measure: "Times referenced during incidents"
target: ">80% of incidents"
- name: mttr_improvement
measure: "MTTR with vs without runbook"
target: "30% reduction"
- name: runbook_completeness
measure: "% of incidents fully covered by runbook"
target: ">70%"
- name: runbook_accuracy
measure: "% of runbook steps that work as documented"
target: "95%"
Conclusion
Effective runbooks are living documents that:
- Reduce cognitive load during stressful incidents
- Preserve knowledge across team changes
- Enable anyone to respond to incidents
- Improve MTTR through standardized procedures
- Build confidence for on-call engineers
Remember: The best runbook is one that’s actually used and kept up to date. Start simple, iterate based on real incidents, and maintain regularly.
“A runbook used is worth ten runbooks written.”