Introduction

An on-call runbook is a documented set of procedures and information that helps engineers respond to incidents effectively. A good runbook reduces Mean Time To Resolution (MTTR), decreases stress, and enables any team member to handle incidents confidently.

Why Runbooks Matter

Without Runbooks

  • Every incident requires figuring things out from scratch
  • Tribal knowledge lost when team members leave
  • New team members struggle with on-call
  • Inconsistent incident response
  • Higher MTTR and more user impact

With Runbooks

  • Standardized, tested response procedures
  • Knowledge preserved and shared
  • Faster onboarding for new team members
  • Consistent, reliable incident response
  • Lower MTTR and reduced stress

Runbook Structure

Essential Sections

  1. Service Overview - What the service does
  2. Architecture - Key components and dependencies
  3. Common Alerts - What triggers pages and how to respond
  4. Troubleshooting Guide - Diagnostic steps and solutions
  5. Escalation Procedures - When and how to escalate
  6. Emergency Contacts - Who to reach for help
  7. Rollback Procedures - How to revert changes
  8. Useful Commands - Quick reference for common tasks

Complete Runbook Template

# On-Call Runbook: [Service Name]

**Last Updated:** YYYY-MM-DD
**Maintained By:** [Team Name]
**On-Call Schedule:** [Link to PagerDuty/Opsgenie]

---

## Service Overview

### Purpose
[What does this service do? Why does it matter?]

Example:
"The Payment Service processes all customer payments including credit cards,
PayPal, and gift cards. It handles ~500 transactions/minute and is critical
for revenue generation. Any downtime directly impacts sales."

### Key Metrics
- **Traffic:** [Requests per minute/hour]
- **Latency:** [P50, P95, P99 response times]
- **Error Rate:** [Typical error percentage]
- **SLO:** [Availability and performance targets]

### Business Impact
- **High Impact Hours:** [Peak times when incidents matter most]
- **Estimated Revenue Impact:** [Cost per minute of downtime]
- **Affected Users:** [Number/type of users impacted by outage]

---

## Architecture

### System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ API GW │─────▢│Payment Service│─────▢│ Database β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β”€β”€β”€β”€β–Ά Stripe API β”œβ”€β”€β”€β”€β”€β–Ά PayPal API └─────▢ Fraud Detection


### Dependencies

| Service | Purpose | Impact if Down | Contact |
|---------|---------|----------------|---------|
| PostgreSQL | Transaction storage | Cannot process payments | DBA team |
| Redis | Session cache | Degraded performance | Platform team |
| Stripe | Credit card processing | Card payments fail | External (status.stripe.com) |
| Fraud Service | Transaction screening | Must disable checks | Security team |

### Infrastructure

- **Platform:** Kubernetes (AWS EKS)
- **Namespace:** `production/payments`
- **Replicas:** 10 pods
- **Resources:** 2 CPU, 4GB RAM per pod
- **Database:** PostgreSQL 14, multi-AZ RDS

---

## Dashboards and Monitoring

### Primary Dashboard
[Link to Grafana/Datadog dashboard]

**Key Panels:**
- Request rate (should be ~500/min during business hours)
- Error rate (should be <0.5%)
- Latency percentiles (P95 should be <300ms)
- Database connection pool usage (should be <80%)

### Logs
- **Kibana:** [Link to logs]
- **Query:** `service:payment-service AND level:error`

### Metrics
- **Prometheus:** [Link to Prometheus]
- **Key Metrics:**
  - `payment_requests_total`
  - `payment_errors_total`
  - `payment_duration_seconds`

---

## Common Alerts

### πŸ”΄ CRITICAL: High Error Rate

**Alert:** `PaymentServiceHighErrorRate`

**Description:**
Error rate exceeds 5% for 5 minutes.

**Impact:**
- Payments failing for customers
- Revenue loss
- Customer support tickets increasing

**Diagnostic Steps:**

1. **Check the dashboard** to confirm error rate spike
   ```bash
   # Open: [Dashboard Link]
   # Look at: Error rate panel (should show spike)
  1. Check recent deployments

    kubectl rollout history deployment/payment-service -n production
    
  2. Review error logs

    kubectl logs -n production deployment/payment-service --tail=100 | grep ERROR
    
  3. Check external dependencies

Common Causes & Solutions:

CauseSymptomsSolution
Bad deploymentErrors started after deployRollback (see section below)
External API downSpecific payment method failingEnable fallback mode
Database issuesConnection timeoutsCheck connection pool, restart if needed
Traffic spikeAll metrics elevatedScale up pods

Resolution Steps:

If recent deployment:

# Rollback to previous version
kubectl rollout undo deployment/payment-service -n production
kubectl rollout status deployment/payment-service -n production

If external API down:

# Enable fallback mode (skip that payment method)
kubectl set env deployment/payment-service -n production \
  STRIPE_ENABLED=false

If database connection issues:

# Check connection pool
kubectl exec -n production deployment/payment-service -- \
  curl localhost:8080/metrics | grep db_connection

# Restart deployment if pool exhausted
kubectl rollout restart deployment/payment-service -n production

Escalation:

  • If not resolved in 15 minutes, escalate to senior engineer
  • If payment partner issue, contact their support

🟑 WARNING: High Latency

Alert: PaymentServiceHighLatency

Description: P95 latency exceeds 500ms for 10 minutes.

Impact:

  • Slow checkout experience
  • Potential timeout errors
  • Customer frustration

Diagnostic Steps:

  1. Check latency dashboard

    • Which percentiles are affected? (P50, P95, P99)
    • Is it all requests or specific endpoints?
  2. Check database performance

    # Slow query log
    kubectl exec -n production deployment/payment-service -- \
      curl localhost:8080/admin/slow-queries
    
  3. Check external API latency

    kubectl logs -n production deployment/payment-service \
      | grep "external_api_duration" | tail -50
    

Common Causes & Solutions:

CauseSolution
Database slow queriesIdentify and optimize queries, add indexes
External API slowContact partner, enable caching
High CPU usageScale up pods
Memory pressureCheck for memory leaks, restart if needed

Resolution Steps:

Scale up if resource constrained:

kubectl scale deployment/payment-service -n production --replicas=15

Escalation:

  • If latency affects SLO, escalate immediately
  • If database related, engage DBA team

🟠 WARNING: Pod Crashes

Alert: PaymentServicePodCrashing

Description: Pods restarting more than 3 times in 10 minutes.

Diagnostic Steps:

  1. Check pod status

    kubectl get pods -n production -l app=payment-service
    
  2. Check crash logs

    # Get crashed pod name
    POD=$(kubectl get pods -n production -l app=payment-service \
      --field-selector=status.phase=Running \
      -o jsonpath='{.items[0].metadata.name}')
    
    # Check previous logs
    kubectl logs -n production $POD --previous
    
  3. Check resource limits

    kubectl top pods -n production -l app=payment-service
    

Common Causes:

  • OOMKilled: Memory limit too low, increase memory
  • CrashLoopBackOff: Application bug, check logs for stack trace
  • Failed health checks: Health endpoint timing out

Resolution Steps:

If OOMKilled:

# Increase memory limit
kubectl set resources deployment/payment-service -n production \
  --limits=memory=6Gi --requests=memory=4Gi

If application crash:

# Rollback recent deployment
kubectl rollout undo deployment/payment-service -n production

Troubleshooting Guide

Issue: Individual Payment Failing

Symptoms:

  • One transaction failing
  • Others succeeding
  • No alerts firing

Steps:

  1. Find transaction in logs

    kubectl logs -n production deployment/payment-service \
      | grep "transaction_id:ABC123"
    
  2. Check payment method

    • Is Stripe/PayPal returning errors?
    • Card declined vs system error?
  3. Validate with user

    • Confirm payment method is valid
    • Try different payment method

Note: Single failures are often user error (expired card, insufficient funds). If pattern emerges (multiple similar failures), investigate deeper.


Issue: Database Connection Errors

Symptoms:

ERROR: could not connect to database
FATAL: remaining connection slots are reserved

Steps:

  1. Check connection pool metrics

    kubectl exec -n production deployment/payment-service -- \
      curl localhost:8080/metrics | grep -A 5 db_pool
    
  2. Check active connections in database

    SELECT count(*) FROM pg_stat_activity
    WHERE datname = 'payments';
    
  3. Identify connection leaks

    # Check for long-running queries
    kubectl exec -n production deployment/payment-service -- \
      curl localhost:8080/admin/active-connections
    

Resolution:

Immediate:

# Restart pods to clear connection pool
kubectl rollout restart deployment/payment-service -n production

Long-term:

  • Fix connection leaks in application code
  • Increase connection pool size
  • Scale database if needed

Issue: Stuck Transactions

Symptoms:

  • Transactions in “pending” state for >10 minutes
  • User charged but order not completed

Steps:

  1. Find stuck transactions

    SELECT * FROM transactions
    WHERE status = 'pending'
    AND created_at < NOW() - INTERVAL '10 minutes';
    
  2. Check transaction logs

    kubectl logs -n production deployment/payment-service \
      | grep "transaction_id:ABC123"
    
  3. Verify with payment provider

    • Check Stripe/PayPal dashboard
    • Confirm charge status

Resolution:

Manual intervention required:

-- If provider confirms success
UPDATE transactions
SET status = 'completed', updated_at = NOW()
WHERE transaction_id = 'ABC123';

-- If provider confirms failure
UPDATE transactions
SET status = 'failed', updated_at = NOW()
WHERE transaction_id = 'ABC123';

Follow-up:

  • File bug to fix automatic transaction resolution
  • Implement timeout handling

Rollback Procedures

Standard Deployment Rollback

When to rollback:

  • High error rate after deployment
  • Unexpected behavior in production
  • Performance degradation

Steps:

  1. Verify current version

    kubectl get deployment payment-service -n production \
      -o jsonpath='{.spec.template.spec.containers[0].image}'
    
  2. Check rollout history

    kubectl rollout history deployment/payment-service -n production
    
  3. Rollback to previous version

    kubectl rollout undo deployment/payment-service -n production
    
  4. Monitor rollback progress

    kubectl rollout status deployment/payment-service -n production
    
  5. Verify recovery

    • Check dashboard for error rate decrease
    • Monitor logs for any new errors
    • Confirm with test transaction

Estimated time: 2-3 minutes


Database Migration Rollback

When to rollback:

  • Migration caused errors
  • Performance severely degraded
  • Data corruption detected

Steps:

  1. Stop application deployments

    kubectl scale deployment/payment-service -n production --replicas=0
    
  2. Connect to database

    psql -h payments-db.abc123.us-east-1.rds.amazonaws.com \
      -U admin -d payments
    
  3. Run rollback migration

    -- Check current migration version
    SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;
    
    -- Run down migration
    \i migrations/20251015_rollback.sql
    
  4. Verify database state

    -- Check critical tables
    SELECT COUNT(*) FROM transactions;
    SELECT COUNT(*) FROM users;
    
  5. Restart application

    kubectl scale deployment/payment-service -n production --replicas=10
    

Estimated time: 10-15 minutes Escalation: Engage DBA team immediately for database rollbacks


Escalation Procedures

When to Escalate

Immediate escalation:

  • Unable to restore service within 15 minutes
  • Data loss or corruption suspected
  • Security incident detected
  • Multiple services affected
  • Business-critical hours (Black Friday, etc.)

Standard escalation:

  • Problem outside your expertise
  • Requires access you don’t have
  • External dependency issue
  • Need additional resources

Escalation Chain

Level 1: On-call engineer (you)

  • Response time: Immediate
  • Responsibilities: Initial diagnosis, basic troubleshooting

Level 2: Senior engineer

  • Contact: [Slack: @senior-oncall] [Phone: +1-555-0100]
  • Response time: 15 minutes
  • Escalate if: Not resolved in 15 min, or unfamiliar issue

Level 3: Team lead / Manager

  • Contact: [Slack: @team-lead] [Phone: +1-555-0200]
  • Response time: 30 minutes
  • Escalate if: Major incident, multi-service impact

Level 4: Engineering director

  • Contact: [Phone: +1-555-0300]
  • Response time: 1 hour
  • Escalate if: Exec-level decision needed, major customer impact

External Teams

TeamContactWhen to Engage
DBA Team#dba-oncallDatabase issues, slow queries
Platform Team#platform-oncallKubernetes, networking, infra
Security Team#security-oncallSecurity incidents, suspicious activity
Customer Support#support-escalationsCustomer-facing communication

Escalation Template

Slack message:

🚨 ESCALATION: Payment Service High Error Rate

Severity: SEV-2
Duration: 20 minutes
Impact: 10% of payment requests failing

What I've tried:
- Checked recent deployments (none in last 2 hours)
- Reviewed logs (seeing Stripe API timeouts)
- Checked Stripe status page (all systems operational)

Current status:
- Error rate: 10%
- Traffic: Normal (~500/min)
- External APIs: Stripe intermittent timeouts

Need help with:
- Determining if this is our issue or Stripe's
- Decision on enabling fallback payment methods

Dashboard: [link]
Incident channel: #incident-20251015

Emergency Contacts

Internal Team

RoleNameSlackPhoneTimezone
On-Call (L2)[Auto-rotates]@senior-oncall[PagerDuty]Various
Team LeadJane Smith@jane+1-555-0200PST
DBA On-Call[Auto-rotates]@dba-oncall[PagerDuty]Various
Platform On-Call[Auto-rotates]@platform-oncall[PagerDuty]Various

External Partners

PartnerSupport ContactStatus PageSLA
Stripe[email protected]status.stripe.com24/7
PayPal[email protected]status.paypal.comBusiness hours
AWSAWS Console (Premium Support)health.aws.amazon.com24/7

Useful Commands

Kubernetes

# Get pod status
kubectl get pods -n production -l app=payment-service

# View recent logs
kubectl logs -n production deployment/payment-service --tail=100

# Follow logs in real-time
kubectl logs -n production deployment/payment-service -f

# Get pod details
kubectl describe pod -n production [POD_NAME]

# Execute command in pod
kubectl exec -n production [POD_NAME] -- [COMMAND]

# Port forward to local machine
kubectl port-forward -n production deployment/payment-service 8080:8080

# Scale replicas
kubectl scale deployment/payment-service -n production --replicas=15

# Restart deployment
kubectl rollout restart deployment/payment-service -n production

# Check deployment status
kubectl rollout status deployment/payment-service -n production

# View deployment history
kubectl rollout history deployment/payment-service -n production

# Rollback deployment
kubectl rollout undo deployment/payment-service -n production

Database

# Connect to database
psql -h payments-db.abc123.us-east-1.rds.amazonaws.com -U admin -d payments

# Run SQL query
kubectl exec -n production deployment/payment-service -- \
  psql -h [DB_HOST] -U [USER] -d payments -c "SELECT COUNT(*) FROM transactions;"

# Check active connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

# Find slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';

# Kill long-running query
SELECT pg_terminate_backend([PID]);

Monitoring

# Query Prometheus
curl -g 'http://prometheus:9090/api/v1/query?query=payment_requests_total'

# Get recent error rate
curl -g 'http://prometheus:9090/api/v1/query?query=rate(payment_errors_total[5m])'

# Check service health
curl http://payment-service.production.svc.cluster.local:8080/health

Traffic Control

# Drain traffic from service
kubectl annotate service payment-service -n production \
  traffic.sidecar.istio.io/excludeInboundPorts="8080"

# Restore traffic
kubectl annotate service payment-service -n production \
  traffic.sidecar.istio.io/excludeInboundPorts-

Maintenance Procedures

Planned Maintenance

Before maintenance:

  1. Announce in #engineering and #support channels
  2. Update status page if user-facing
  3. Schedule during low-traffic window (typically 2-4 AM PST)

During maintenance:

  1. Follow change management process
  2. Monitor metrics dashboard continuously
  3. Keep incident channel updated
  4. Have rollback plan ready

After maintenance:

  1. Verify all metrics returned to baseline
  2. Run smoke tests
  3. Monitor for 30 minutes
  4. Update status page
  5. Close maintenance ticket

Database Maintenance

# Run vacuum (during low traffic)
psql -h [DB_HOST] -U admin -d payments -c "VACUUM ANALYZE transactions;"

# Reindex (if query performance degraded)
psql -h [DB_HOST] -U admin -d payments -c "REINDEX TABLE transactions;"

# Check table sizes
psql -h [DB_HOST] -U admin -d payments -c "
  SELECT
    schemaname, tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
  FROM pg_tables
  WHERE schemaname = 'public'
  ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
"

Known Issues

Issue: Intermittent Stripe Timeouts

Status: Ongoing (tracking in JIRA-1234)

Symptoms:

  • Occasional Stripe API timeouts (< 1% of requests)
  • No pattern to timing
  • Stripe status page shows green

Workaround:

  • Application automatically retries
  • Fallback to manual payment option shown to user

Mitigation:

  • Increased timeout from 5s to 10s
  • Added circuit breaker

Next steps:

  • Working with Stripe support to investigate
  • Considering secondary payment processor

Issue: Memory Leak on High Traffic

Status: Under investigation (JIRA-2345)

Symptoms:

  • Memory usage slowly climbs during sustained high traffic
  • Reaches limit after ~6 hours
  • Pods restart and recover

Workaround:

  • Scheduled pod restarts every 4 hours during high-traffic events
  • Increased memory limit from 4GB to 6GB

Next steps:

  • Heap dump analysis in progress
  • Suspected object pooling issue

Runbook Maintenance

Update Schedule

  • Weekly: Review for accuracy after incidents
  • Monthly: Update contacts and on-call schedule
  • Quarterly: Full runbook review and testing

How to Update

  1. Create PR in [repo link]
  2. Get review from team member
  3. Update “Last Updated” date
  4. Announce changes in #team-chat

Feedback

Found an issue or have suggestions?

  • File issue: [GitHub link]
  • Slack: #runbook-feedback

Additional Resources

  • Architecture Diagrams: [Confluence link]
  • API Documentation: [Link]
  • Monitoring Dashboard: [Grafana link]
  • Incident History: [Postmortem repository]
  • Team Wiki: [Wiki link]

## Best Practices for Runbooks

### 1. Test Your Runbooks

**Regular testing:**
```bash
# Create test script
#!/bin/bash
echo "Testing Payment Service Runbook"
echo "1. Testing dashboard access..."
curl -s [DASHBOARD_URL] > /dev/null && echo "βœ“ Dashboard accessible"

echo "2. Testing kubectl access..."
kubectl get pods -n production -l app=payment-service > /dev/null && echo "βœ“ Kubectl working"

echo "3. Testing database access..."
psql -h [DB_HOST] -U admin -d payments -c "SELECT 1;" > /dev/null && echo "βœ“ Database accessible"

echo "Runbook test complete!"

Game days:

  • Schedule quarterly incident simulation
  • Have engineers follow runbook procedures
  • Identify gaps and update runbook

2. Keep It Updated

After every incident:

  • Add new troubleshooting steps discovered
  • Update estimated resolution times
  • Document new failure modes

Version control:

  • Store runbooks in Git
  • Review changes in PRs
  • Tag releases

3. Make It Accessible

Where to store:

  • βœ… Git repository (searchable, version controlled)
  • βœ… Wiki (easily accessible, collaborative)
  • βœ… PagerDuty runbook feature (embedded in alerts)

Where NOT to store:

  • ❌ Someone’s laptop
  • ❌ Private documents
  • ❌ Outdated format

4. Use Clear Language

Good:

If error rate exceeds 5%, rollback the deployment:
kubectl rollout undo deployment/payment-service -n production

Bad:

When things are broken, you should probably rollback unless there's
a good reason not to. Use kubectl to undo it.

5. Include Context

Why this matters:

  • Helps engineers make decisions
  • Reduces “just following orders” mentality
  • Enables improvisation when needed

Example:

Scale to 15 pods (from 10) if CPU > 80%.

Why: The service can handle increased traffic with more pods.
15 pods has been tested and works well. Don't go above 20 pods
without consulting the platform team, as that may cause database
connection pool exhaustion.

Runbook for Runbooks

Creating a New Runbook

  1. Copy template (provided above)
  2. Fill in service-specific details
    • Architecture diagram
    • Common alerts
    • Key commands
  3. Test all commands yourself
  4. Have teammate review
  5. Publish and announce

Measuring Runbook Effectiveness

metrics:
  - name: runbook_usage
    measure: "Times referenced during incidents"
    target: ">80% of incidents"

  - name: mttr_improvement
    measure: "MTTR with vs without runbook"
    target: "30% reduction"

  - name: runbook_completeness
    measure: "% of incidents fully covered by runbook"
    target: ">70%"

  - name: runbook_accuracy
    measure: "% of runbook steps that work as documented"
    target: "95%"

Conclusion

Effective runbooks are living documents that:

  1. Reduce cognitive load during stressful incidents
  2. Preserve knowledge across team changes
  3. Enable anyone to respond to incidents
  4. Improve MTTR through standardized procedures
  5. Build confidence for on-call engineers

Remember: The best runbook is one that’s actually used and kept up to date. Start simple, iterate based on real incidents, and maintain regularly.

“A runbook used is worth ten runbooks written.”