On-Call Runbook: Template and Best Practices

Introduction

An on-call runbook is a documented set of procedures and information that helps engineers respond to incidents effectively. A good runbook reduces Mean Time To Resolution (MTTR), decreases stress, and enables any team member to handle incidents confidently.

Why Runbooks Matter

Without Runbooks

Every incident requires figuring things out from scratch
Tribal knowledge lost when team members leave
New team members struggle with on-call
Inconsistent incident response
Higher MTTR and more user impact

With Runbooks

Standardized, tested response procedures
Knowledge preserved and shared
Faster onboarding for new team members
Consistent, reliable incident response
Lower MTTR and reduced stress

Runbook Structure

Essential Sections

Service Overview - What the service does
Architecture - Key components and dependencies
Common Alerts - What triggers pages and how to respond
Troubleshooting Guide - Diagnostic steps and solutions
Escalation Procedures - When and how to escalate
Emergency Contacts - Who to reach for help
Rollback Procedures - How to revert changes
Useful Commands - Quick reference for common tasks

Complete Runbook Template

# On-Call Runbook: [Service Name]

**Last Updated:** YYYY-MM-DD
**Maintained By:** [Team Name]
**On-Call Schedule:** [Link to PagerDuty/Opsgenie]

---

## Service Overview

### Purpose
[What does this service do? Why does it matter?]

Example:
"The Payment Service processes all customer payments including credit cards,
PayPal, and gift cards. It handles ~500 transactions/minute and is critical
for revenue generation. Any downtime directly impacts sales."

### Key Metrics
- **Traffic:** [Requests per minute/hour]
- **Latency:** [P50, P95, P99 response times]
- **Error Rate:** [Typical error percentage]
- **SLO:** [Availability and performance targets]

### Business Impact
- **High Impact Hours:** [Peak times when incidents matter most]
- **Estimated Revenue Impact:** [Cost per minute of downtime]
- **Affected Users:** [Number/type of users impacted by outage]

---

## Architecture

### System Components

┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │ API GW │─────▶│Payment Service│─────▶│ Database │ └─────────────┘ └──────────────┘ └─────────────┘ │ ├─────▶ Stripe API ├─────▶ PayPal API └─────▶ Fraud Detection


### Dependencies

| Service | Purpose | Impact if Down | Contact |
|---------|---------|----------------|---------|
| PostgreSQL | Transaction storage | Cannot process payments | DBA team |
| Redis | Session cache | Degraded performance | Platform team |
| Stripe | Credit card processing | Card payments fail | External (status.stripe.com) |
| Fraud Service | Transaction screening | Must disable checks | Security team |

### Infrastructure

- **Platform:** Kubernetes (AWS EKS)
- **Namespace:** `production/payments`
- **Replicas:** 10 pods
- **Resources:** 2 CPU, 4GB RAM per pod
- **Database:** PostgreSQL 14, multi-AZ RDS

---

## Dashboards and Monitoring

### Primary Dashboard
[Link to Grafana/Datadog dashboard]

**Key Panels:**
- Request rate (should be ~500/min during business hours)
- Error rate (should be <0.5%)
- Latency percentiles (P95 should be <300ms)
- Database connection pool usage (should be <80%)

### Logs
- **Kibana:** [Link to logs]
- **Query:** `service:payment-service AND level:error`

### Metrics
- **Prometheus:** [Link to Prometheus]
- **Key Metrics:**
  - `payment_requests_total`
  - `payment_errors_total`
  - `payment_duration_seconds`

---

## Common Alerts

### 🔴 CRITICAL: High Error Rate

**Alert:** `PaymentServiceHighErrorRate`

**Description:**
Error rate exceeds 5% for 5 minutes.

**Impact:**
- Payments failing for customers
- Revenue loss
- Customer support tickets increasing

**Diagnostic Steps:**

1. **Check the dashboard** to confirm error rate spike
   ```bash
   # Open: [Dashboard Link]
   # Look at: Error rate panel (should show spike)

Check recent deployments

kubectl rollout history deployment/payment-service -n production

Review error logs

kubectl logs -n production deployment/payment-service --tail=100 | grep ERROR

Check external dependencies
- Stripe: https://status.stripe.com
- PayPal: https://status.paypal.com
- Database: Check RDS metrics in AWS console

Common Causes & Solutions:

Cause	Symptoms	Solution
Bad deployment	Errors started after deploy	Rollback (see section below)
External API down	Specific payment method failing	Enable fallback mode
Database issues	Connection timeouts	Check connection pool, restart if needed
Traffic spike	All metrics elevated	Scale up pods

Resolution Steps:

If recent deployment:

# Rollback to previous version
kubectl rollout undo deployment/payment-service -n production
kubectl rollout status deployment/payment-service -n production

If external API down:

# Enable fallback mode (skip that payment method)
kubectl set env deployment/payment-service -n production \
  STRIPE_ENABLED=false

If database connection issues:

# Check connection pool
kubectl exec -n production deployment/payment-service -- \
  curl localhost:8080/metrics | grep db_connection

# Restart deployment if pool exhausted
kubectl rollout restart deployment/payment-service -n production

Escalation:

If not resolved in 15 minutes, escalate to senior engineer
If payment partner issue, contact their support

🟡 WARNING: High Latency

Alert: PaymentServiceHighLatency

Description: P95 latency exceeds 500ms for 10 minutes.

Impact:

Slow checkout experience
Potential timeout errors
Customer frustration

Diagnostic Steps:

Check latency dashboard
- Which percentiles are affected? (P50, P95, P99)
- Is it all requests or specific endpoints?

Check database performance

# Slow query log
kubectl exec -n production deployment/payment-service -- \
  curl localhost:8080/admin/slow-queries

Check external API latency

kubectl logs -n production deployment/payment-service \
  | grep "external_api_duration" | tail -50

Common Causes & Solutions:

Cause	Solution
Database slow queries	Identify and optimize queries, add indexes
External API slow	Contact partner, enable caching
High CPU usage	Scale up pods
Memory pressure	Check for memory leaks, restart if needed

Resolution Steps:

Scale up if resource constrained:

kubectl scale deployment/payment-service -n production --replicas=15

Escalation:

If latency affects SLO, escalate immediately
If database related, engage DBA team

🟠 WARNING: Pod Crashes

Alert: PaymentServicePodCrashing

Description: Pods restarting more than 3 times in 10 minutes.

Diagnostic Steps:

Check pod status

kubectl get pods -n production -l app=payment-service

Check crash logs

# Get crashed pod name
POD=$(kubectl get pods -n production -l app=payment-service \
  --field-selector=status.phase=Running \
  -o jsonpath='{.items[0].metadata.name}')

# Check previous logs
kubectl logs -n production $POD --previous

Check resource limits

kubectl top pods -n production -l app=payment-service

Common Causes:

OOMKilled: Memory limit too low, increase memory
CrashLoopBackOff: Application bug, check logs for stack trace
Failed health checks: Health endpoint timing out

Resolution Steps:

If OOMKilled:

# Increase memory limit
kubectl set resources deployment/payment-service -n production \
  --limits=memory=6Gi --requests=memory=4Gi

If application crash:

# Rollback recent deployment
kubectl rollout undo deployment/payment-service -n production

Troubleshooting Guide

Issue: Individual Payment Failing

Symptoms:

One transaction failing
Others succeeding
No alerts firing

Steps:

Find transaction in logs

kubectl logs -n production deployment/payment-service \
  | grep "transaction_id:ABC123"

Check payment method
- Is Stripe/PayPal returning errors?
- Card declined vs system error?
Validate with user
- Confirm payment method is valid
- Try different payment method

Note: Single failures are often user error (expired card, insufficient funds). If pattern emerges (multiple similar failures), investigate deeper.

Issue: Database Connection Errors

Symptoms:

ERROR: could not connect to database
FATAL: remaining connection slots are reserved

Steps:

Check connection pool metrics

kubectl exec -n production deployment/payment-service -- \
  curl localhost:8080/metrics | grep -A 5 db_pool

Check active connections in database

SELECT count(*) FROM pg_stat_activity
WHERE datname = 'payments';

Identify connection leaks

# Check for long-running queries
kubectl exec -n production deployment/payment-service -- \
  curl localhost:8080/admin/active-connections

Resolution:

Immediate:

# Restart pods to clear connection pool
kubectl rollout restart deployment/payment-service -n production

Long-term:

Fix connection leaks in application code
Increase connection pool size
Scale database if needed

Issue: Stuck Transactions

Symptoms:

Transactions in “pending” state for >10 minutes
User charged but order not completed

Steps:

Find stuck transactions

SELECT * FROM transactions
WHERE status = 'pending'
AND created_at < NOW() - INTERVAL '10 minutes';

Check transaction logs

kubectl logs -n production deployment/payment-service \
  | grep "transaction_id:ABC123"

Verify with payment provider
- Check Stripe/PayPal dashboard
- Confirm charge status

Resolution:

Manual intervention required:

-- If provider confirms success
UPDATE transactions
SET status = 'completed', updated_at = NOW()
WHERE transaction_id = 'ABC123';

-- If provider confirms failure
UPDATE transactions
SET status = 'failed', updated_at = NOW()
WHERE transaction_id = 'ABC123';

Follow-up:

File bug to fix automatic transaction resolution
Implement timeout handling

Rollback Procedures

Standard Deployment Rollback

When to rollback:

High error rate after deployment
Unexpected behavior in production
Performance degradation

Steps:

Verify current version

kubectl get deployment payment-service -n production \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

Check rollout history

kubectl rollout history deployment/payment-service -n production

Rollback to previous version

kubectl rollout undo deployment/payment-service -n production

Monitor rollback progress

kubectl rollout status deployment/payment-service -n production

Verify recovery
- Check dashboard for error rate decrease
- Monitor logs for any new errors
- Confirm with test transaction

Estimated time: 2-3 minutes

Database Migration Rollback

When to rollback:

Migration caused errors
Performance severely degraded
Data corruption detected

Steps:

Stop application deployments

kubectl scale deployment/payment-service -n production --replicas=0

Connect to database

psql -h payments-db.abc123.us-east-1.rds.amazonaws.com \
  -U admin -d payments

Run rollback migration

-- Check current migration version
SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;

-- Run down migration
\i migrations/20251015_rollback.sql

Verify database state

-- Check critical tables
SELECT COUNT(*) FROM transactions;
SELECT COUNT(*) FROM users;

Restart application

kubectl scale deployment/payment-service -n production --replicas=10

Estimated time: 10-15 minutes Escalation: Engage DBA team immediately for database rollbacks

Escalation Procedures

When to Escalate

Immediate escalation:

Unable to restore service within 15 minutes
Data loss or corruption suspected
Security incident detected
Multiple services affected
Business-critical hours (Black Friday, etc.)

Standard escalation:

Problem outside your expertise
Requires access you don’t have
External dependency issue
Need additional resources

Escalation Chain

Level 1: On-call engineer (you)

Response time: Immediate
Responsibilities: Initial diagnosis, basic troubleshooting

Level 2: Senior engineer

Contact: [Slack: @senior-oncall] [Phone: +1-555-0100]
Response time: 15 minutes
Escalate if: Not resolved in 15 min, or unfamiliar issue

Level 3: Team lead / Manager

Contact: [Slack: @team-lead] [Phone: +1-555-0200]
Response time: 30 minutes
Escalate if: Major incident, multi-service impact

Level 4: Engineering director

Contact: [Phone: +1-555-0300]
Response time: 1 hour
Escalate if: Exec-level decision needed, major customer impact

External Teams

Team	Contact	When to Engage
DBA Team	#dba-oncall	Database issues, slow queries
Platform Team	#platform-oncall	Kubernetes, networking, infra
Security Team	#security-oncall	Security incidents, suspicious activity
Customer Support	#support-escalations	Customer-facing communication

Escalation Template

Slack message:

🚨 ESCALATION: Payment Service High Error Rate

Severity: SEV-2
Duration: 20 minutes
Impact: 10% of payment requests failing

What I've tried:
- Checked recent deployments (none in last 2 hours)
- Reviewed logs (seeing Stripe API timeouts)
- Checked Stripe status page (all systems operational)

Current status:
- Error rate: 10%
- Traffic: Normal (~500/min)
- External APIs: Stripe intermittent timeouts

Need help with:
- Determining if this is our issue or Stripe's
- Decision on enabling fallback payment methods

Dashboard: [link]
Incident channel: #incident-20251015

Emergency Contacts

Internal Team

Role	Name	Slack	Phone	Timezone
On-Call (L2)	[Auto-rotates]	@senior-oncall	[PagerDuty]	Various
Team Lead	Jane Smith	@jane	+1-555-0200	PST
DBA On-Call	[Auto-rotates]	@dba-oncall	[PagerDuty]	Various
Platform On-Call	[Auto-rotates]	@platform-oncall	[PagerDuty]	Various

External Partners

Partner	Support Contact	Status Page	SLA
Stripe	[email protected]	status.stripe.com	24/7
PayPal	[email protected]	status.paypal.com	Business hours
AWS	AWS Console (Premium Support)	health.aws.amazon.com	24/7

Useful Commands

Kubernetes

# Get pod status
kubectl get pods -n production -l app=payment-service

# View recent logs
kubectl logs -n production deployment/payment-service --tail=100

# Follow logs in real-time
kubectl logs -n production deployment/payment-service -f

# Get pod details
kubectl describe pod -n production [POD_NAME]

# Execute command in pod
kubectl exec -n production [POD_NAME] -- [COMMAND]

# Port forward to local machine
kubectl port-forward -n production deployment/payment-service 8080:8080

# Scale replicas
kubectl scale deployment/payment-service -n production --replicas=15

# Restart deployment
kubectl rollout restart deployment/payment-service -n production

# Check deployment status
kubectl rollout status deployment/payment-service -n production

# View deployment history
kubectl rollout history deployment/payment-service -n production

# Rollback deployment
kubectl rollout undo deployment/payment-service -n production

Database

# Connect to database
psql -h payments-db.abc123.us-east-1.rds.amazonaws.com -U admin -d payments

# Run SQL query
kubectl exec -n production deployment/payment-service -- \
  psql -h [DB_HOST] -U [USER] -d payments -c "SELECT COUNT(*) FROM transactions;"

# Check active connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

# Find slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';

# Kill long-running query
SELECT pg_terminate_backend([PID]);

Monitoring

# Query Prometheus
curl -g 'http://prometheus:9090/api/v1/query?query=payment_requests_total'

# Get recent error rate
curl -g 'http://prometheus:9090/api/v1/query?query=rate(payment_errors_total[5m])'

# Check service health
curl http://payment-service.production.svc.cluster.local:8080/health

Traffic Control

# Drain traffic from service
kubectl annotate service payment-service -n production \
  traffic.sidecar.istio.io/excludeInboundPorts="8080"

# Restore traffic
kubectl annotate service payment-service -n production \
  traffic.sidecar.istio.io/excludeInboundPorts-

Maintenance Procedures

Planned Maintenance

Before maintenance:

Announce in #engineering and #support channels
Update status page if user-facing
Schedule during low-traffic window (typically 2-4 AM PST)

During maintenance:

Follow change management process
Monitor metrics dashboard continuously
Keep incident channel updated
Have rollback plan ready

After maintenance:

Verify all metrics returned to baseline
Run smoke tests
Monitor for 30 minutes
Update status page
Close maintenance ticket

Database Maintenance

# Run vacuum (during low traffic)
psql -h [DB_HOST] -U admin -d payments -c "VACUUM ANALYZE transactions;"

# Reindex (if query performance degraded)
psql -h [DB_HOST] -U admin -d payments -c "REINDEX TABLE transactions;"

# Check table sizes
psql -h [DB_HOST] -U admin -d payments -c "
  SELECT
    schemaname, tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
  FROM pg_tables
  WHERE schemaname = 'public'
  ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
"

Known Issues

Issue: Intermittent Stripe Timeouts

Status: Ongoing (tracking in JIRA-1234)

Symptoms:

Occasional Stripe API timeouts (< 1% of requests)
No pattern to timing
Stripe status page shows green

Workaround:

Application automatically retries
Fallback to manual payment option shown to user

Mitigation:

Increased timeout from 5s to 10s
Added circuit breaker

Next steps:

Working with Stripe support to investigate
Considering secondary payment processor

Issue: Memory Leak on High Traffic

Status: Under investigation (JIRA-2345)

Symptoms:

Memory usage slowly climbs during sustained high traffic
Reaches limit after ~6 hours
Pods restart and recover

Workaround:

Scheduled pod restarts every 4 hours during high-traffic events
Increased memory limit from 4GB to 6GB

Next steps:

Heap dump analysis in progress
Suspected object pooling issue

Runbook Maintenance

Update Schedule

Weekly: Review for accuracy after incidents
Monthly: Update contacts and on-call schedule
Quarterly: Full runbook review and testing

How to Update

Create PR in [repo link]
Get review from team member
Update “Last Updated” date
Announce changes in #team-chat

Feedback

Found an issue or have suggestions?

File issue: [GitHub link]
Slack: #runbook-feedback

Additional Resources

Architecture Diagrams: [Confluence link]
API Documentation: [Link]
Monitoring Dashboard: [Grafana link]
Incident History: [Postmortem repository]
Team Wiki: [Wiki link]


## Best Practices for Runbooks

### 1. Test Your Runbooks

**Regular testing:**
```bash
# Create test script
#!/bin/bash
echo "Testing Payment Service Runbook"
echo "1. Testing dashboard access..."
curl -s [DASHBOARD_URL] > /dev/null && echo "✓ Dashboard accessible"

echo "2. Testing kubectl access..."
kubectl get pods -n production -l app=payment-service > /dev/null && echo "✓ Kubectl working"

echo "3. Testing database access..."
psql -h [DB_HOST] -U admin -d payments -c "SELECT 1;" > /dev/null && echo "✓ Database accessible"

echo "Runbook test complete!"

Game days:

Schedule quarterly incident simulation
Have engineers follow runbook procedures
Identify gaps and update runbook

2. Keep It Updated

After every incident:

Add new troubleshooting steps discovered
Update estimated resolution times
Document new failure modes

Version control:

Store runbooks in Git
Review changes in PRs
Tag releases

3. Make It Accessible

Where to store:

✅ Git repository (searchable, version controlled)
✅ Wiki (easily accessible, collaborative)
✅ PagerDuty runbook feature (embedded in alerts)

Where NOT to store:

❌ Someone’s laptop
❌ Private documents
❌ Outdated format

4. Use Clear Language

Good:

If error rate exceeds 5%, rollback the deployment:
kubectl rollout undo deployment/payment-service -n production

Bad:

When things are broken, you should probably rollback unless there's
a good reason not to. Use kubectl to undo it.

5. Include Context

Why this matters:

Helps engineers make decisions
Reduces “just following orders” mentality
Enables improvisation when needed

Example:

Scale to 15 pods (from 10) if CPU > 80%.

Why: The service can handle increased traffic with more pods.
15 pods has been tested and works well. Don't go above 20 pods
without consulting the platform team, as that may cause database
connection pool exhaustion.

Runbook for Runbooks

Creating a New Runbook

Copy template (provided above)
Fill in service-specific details
- Architecture diagram
- Common alerts
- Key commands
Test all commands yourself
Have teammate review
Publish and announce

Measuring Runbook Effectiveness

metrics:
  - name: runbook_usage
    measure: "Times referenced during incidents"
    target: ">80% of incidents"

  - name: mttr_improvement
    measure: "MTTR with vs without runbook"
    target: "30% reduction"

  - name: runbook_completeness
    measure: "% of incidents fully covered by runbook"
    target: ">70%"

  - name: runbook_accuracy
    measure: "% of runbook steps that work as documented"
    target: "95%"

Conclusion

Effective runbooks are living documents that:

Reduce cognitive load during stressful incidents
Preserve knowledge across team changes
Enable anyone to respond to incidents
Improve MTTR through standardized procedures
Build confidence for on-call engineers

Remember: The best runbook is one that’s actually used and kept up to date. Start simple, iterate based on real incidents, and maintain regularly.

“A runbook used is worth ten runbooks written.”

Introduction#

Why Runbooks Matter#

Without Runbooks#

With Runbooks#

Runbook Structure#

Essential Sections#

Complete Runbook Template#

🟡 WARNING: High Latency#

🟠 WARNING: Pod Crashes#

Troubleshooting Guide#

Issue: Individual Payment Failing#

Issue: Database Connection Errors#

Issue: Stuck Transactions#

Rollback Procedures#

Standard Deployment Rollback#

Database Migration Rollback#

Escalation Procedures#

When to Escalate#

Escalation Chain#

External Teams#

Escalation Template#

Emergency Contacts#

Internal Team#

External Partners#

Useful Commands#

Kubernetes#

Database#

Monitoring#

Traffic Control#

Maintenance Procedures#

Planned Maintenance#

Database Maintenance#

Known Issues#

Issue: Intermittent Stripe Timeouts#

Issue: Memory Leak on High Traffic#

Runbook Maintenance#

Update Schedule#

How to Update#

Feedback#

Additional Resources#

2. Keep It Updated#

3. Make It Accessible#

4. Use Clear Language#

5. Include Context#

Runbook for Runbooks#

Creating a New Runbook#

Measuring Runbook Effectiveness#

Conclusion#

Introduction

Why Runbooks Matter

Without Runbooks

With Runbooks

Runbook Structure

Essential Sections

Complete Runbook Template

🟡 WARNING: High Latency

🟠 WARNING: Pod Crashes

Troubleshooting Guide

Issue: Individual Payment Failing

Issue: Database Connection Errors

Issue: Stuck Transactions

Rollback Procedures

Standard Deployment Rollback

Database Migration Rollback

Escalation Procedures

When to Escalate

Escalation Chain

External Teams

Escalation Template

Emergency Contacts

Internal Team

External Partners

Useful Commands

Kubernetes

Database

Monitoring

Traffic Control

Maintenance Procedures

Planned Maintenance

Database Maintenance

Known Issues

Issue: Intermittent Stripe Timeouts

Issue: Memory Leak on High Traffic

Runbook Maintenance

Update Schedule

How to Update

Feedback

Additional Resources

2. Keep It Updated

3. Make It Accessible

4. Use Clear Language

5. Include Context

Runbook for Runbooks

Creating a New Runbook

Measuring Runbook Effectiveness

Conclusion