Disaster Recovery Planning: RTO, RPO, and Building Resilient Systems

Introduction

Disaster Recovery (DR) is the process, policies, and procedures for recovering and continuing technology infrastructure after a disaster. A disaster can be natural (earthquake, flood), technical (data center failure, ransomware), or human-caused (accidental deletion, security breach).

Core Principle: “Hope is not a strategy. Plan for failure before it happens.”

Key Concepts

RTO vs RPO

Time ─────────────────────────────────────────────────────────────>
       │                  │                │                    │
   Disaster           Detection        Recovery            Normal
   Occurs              Time            Begins            Operations
       │◄──────────────────────────────────►│
       │         Recovery Time                  │
       │         Objective (RTO)                │
       │                                        │
       │◄───────────►│                          │
         Data Loss                              │
      (Recovery Point                           │
       Objective - RPO)                        │

Recovery Time Objective (RTO)

Definition: Maximum acceptable time that a system can be down after a disaster.

Examples:

RTO = 0 minutes: High-availability systems (financial trading)
RTO = 4 hours: Critical business applications
RTO = 24 hours: Standard applications
RTO = 7 days: Archival systems

Cost implications:

┌─────────────────────────────────────────┐
│ RTO vs Cost Relationship                │
│                                         │
│  Cost                                   │
│    ▲                                    │
│    │                            ╱       │
│    │                        ╱           │
│    │                    ╱               │
│    │                ╱                   │
│    │            ╱                       │
│    │        ╱                           │
│    │    ╱                               │
│    │╱                                   │
│    └────────────────────────────> RTO  │
│     0min   1hr    4hr   24hr    7d      │
└─────────────────────────────────────────┘

Recovery Point Objective (RPO)

Definition: Maximum acceptable amount of data loss measured in time.

Examples:

RPO = 0 minutes: No data loss (synchronous replication)
RPO = 15 minutes: Near-real-time backup
RPO = 1 hour: Frequent backups
RPO = 24 hours: Daily backups

Calculation:

# Data loss in a disaster
def calculate_data_loss(rpo_minutes, disaster_time, last_backup_time):
    """
    Calculate potential data loss based on RPO
    """
    time_since_backup = disaster_time - last_backup_time

    if time_since_backup <= rpo_minutes:
        return "Within RPO - Acceptable"
    else:
        data_loss_window = time_since_backup
        return f"Data loss: {data_loss_window} minutes (exceeds RPO of {rpo_minutes} min)"

# Example
disaster_time = datetime(2025, 10, 16, 14, 30)  # 2:30 PM
last_backup = datetime(2025, 10, 16, 14, 0)     # 2:00 PM
rpo = 15  # minutes

result = calculate_data_loss(rpo, disaster_time, last_backup)
# Output: "Data loss: 30 minutes (exceeds RPO of 15 min)"

DR Tiers

Tier 0: No DR

RTO: Days/weeks
RPO: Days
Strategy: Rebuild from scratch
Cost: $
Use case: Non-critical systems

Tier 1: Backup and Restore

RTO: 12-24 hours
RPO: 1-24 hours
Strategy: Regular backups to offsite storage
Cost: $$
Use case: Standard applications

backup_strategy:
  frequency: "Daily at 2 AM"
  retention: "30 days"
  storage: "S3 Glacier"
  restore_time: "4-8 hours"

Tier 2: Pilot Light

RTO: 1-4 hours
RPO: Minutes to hours
Strategy: Minimal infrastructure always running, scale up on disaster
Cost: $$$
Use case: Important applications

pilot_light:
  always_running:
    - database_replica
    - minimal_app_servers
  disaster_action:
    - scale_up_app_servers
    - update_dns
    - activate_load_balancer

Tier 3: Warm Standby

RTO: Minutes to 1 hour
RPO: Seconds to minutes
Strategy: Scaled-down version always running
Cost: $$$$
Use case: Business-critical applications

warm_standby:
  always_running:
    - replicated_database
    - reduced_capacity_app_servers
    - load_balancer_configured
  disaster_action:
    - scale_up_to_full_capacity
    - switch_dns_to_dr_site

Tier 4: Hot Site / Multi-Site Active-Active

RTO: 0 minutes (automatic failover)
RPO: 0 seconds (no data loss)
Strategy: Fully redundant infrastructure
Cost: $$$$$
Use case: Mission-critical systems

active_active:
  configuration:
    - multiple_regions
    - synchronous_replication
    - auto_failover
    - global_load_balancing
  disaster_action:
    - automatic_traffic_rerouting
    - no_manual_intervention

Disaster Scenarios

1. Data Center / Availability Zone Failure

Scenario: Entire AWS availability zone goes down

Impact:

All services in that AZ unavailable
Database read replicas lost
Application instances terminated

DR Strategy:

# Multi-AZ deployment
disaster: az_failure

architecture:
  deployment:
    availability_zones: [us-east-1a, us-east-1b, us-east-1c]
    min_healthy_azs: 2

  application:
    instances_per_az: 3
    total_instances: 9
    acceptable_loss: 33%  # Can lose 1 AZ

  database:
    primary: us-east-1a
    replicas:
      - us-east-1b
      - us-east-1c
    auto_failover: true

recovery_steps:
  1. detect:
      - aws_health_dashboard_alerts
      - instance_health_checks_failing

  2. automatic_actions:
      - remove_unhealthy_instances_from_lb
      - scale_up_healthy_az_instances
      - database_auto_failover_to_replica

  3. manual_verification:
      - check_service_availability
      - verify_data_consistency
      - monitor_error_rates

expected_rto: "5 minutes"
expected_rpo: "0 seconds (synchronous replication)"

Kubernetes Implementation:

# Multi-AZ pod distribution
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 9
  template:
    spec:
      # Spread pods across AZs
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: api-server

      # Prefer different AZs
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: api-server
              topologyKey: topology.kubernetes.io/zone

2. Region Failure

Scenario: Entire AWS region becomes unavailable

Impact:

All services in region down
Regional data unavailable
Cross-region dependencies break

DR Strategy:

disaster: region_failure

architecture:
  primary_region: us-east-1
  dr_region: us-west-2

  configuration:
    data_replication: "Cross-region asynchronous"
    dns: "Route53 health checks + automatic failover"
    rto: "1 hour"
    rpo: "15 minutes"

recovery_steps:
  1. detection:
      - route53_health_checks_failing
      - cloudwatch_alarms_triggered
      - manual_verification

  2. failover_database:
      - promote_dr_region_replica_to_primary
      - verify_data_consistency
      - update_connection_strings

  3. failover_application:
      - update_route53_to_point_to_dr_region
      - scale_up_dr_instances
      - verify_service_health

  4. communication:
      - notify_stakeholders
      - update_status_page
      - monitor_recovery_progress

  5. post_recovery:
      - analyze_data_loss
      - review_recovery_time
      - document_learnings

Terraform Multi-Region Setup:

# Primary region
provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}

# DR region
provider "aws" {
  alias  = "dr"
  region = "us-west-2"
}

# Application in primary region
module "app_primary" {
  source = "./modules/application"
  providers = {
    aws = aws.primary
  }

  environment = "production-primary"
  capacity    = "full"
}

# Application in DR region (reduced capacity)
module "app_dr" {
  source = "./modules/application"
  providers = {
    aws = aws.dr
  }

  environment = "production-dr"
  capacity    = "minimal"  # Scale up on disaster
}

# Cross-region database replication
resource "aws_rds_cluster" "primary" {
  provider                = aws.primary
  cluster_identifier      = "prod-db-primary"
  engine                  = "aurora-postgresql"
  database_name           = "production"
  master_username         = var.db_username
  master_password         = var.db_password

  # Enable backup for cross-region replication
  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"
}

# DR region replica
resource "aws_rds_cluster" "dr" {
  provider               = aws.dr
  cluster_identifier     = "prod-db-dr"
  replication_source_arn = aws_rds_cluster.primary.arn
}

# DNS failover
resource "aws_route53_health_check" "primary" {
  fqdn              = module.app_primary.endpoint
  type              = "HTTPS"
  port              = 443
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "api" {
  zone_id = var.route53_zone_id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = module.app_primary.load_balancer_dns
    zone_id                = module.app_primary.load_balancer_zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
}

resource "aws_route53_record" "api_failover" {
  zone_id = var.route53_zone_id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = module.app_dr.load_balancer_dns
    zone_id                = module.app_dr.load_balancer_zone_id
    evaluate_target_health = true
  }
}

3. Data Corruption / Ransomware

Scenario: Production database corrupted by bug or ransomware attack

Impact:

Data integrity compromised
Need point-in-time recovery
Potential data loss

DR Strategy:

disaster: data_corruption

detection:
  - automated_data_validation_checks
  - unusual_data_modification_patterns
  - ransomware_detection_alerts

recovery_steps:
  1. isolate:
      - stop_application_writes
      - identify_corruption_start_time
      - preserve_current_state_for_forensics

  2. assess:
      - determine_last_known_good_backup
      - calculate_data_loss_window
      - identify_affected_records

  3. restore:
      - restore_from_point_in_time_backup
      - verify_data_integrity
      - replay_valid_transactions_if_possible

  4. validate:
      - run_data_consistency_checks
      - compare_with_audit_logs
      - user_acceptance_testing

  5. resume:
      - gradual_traffic_increase
      - monitor_for_recurring_issues

backup_strategy:
  continuous_backup:
    enabled: true
    retention: "35 days"
    point_in_time_restore: true

  snapshot_backup:
    frequency: "Every 6 hours"
    retention: "90 days"

  immutable_backup:
    enabled: true
    description: "Ransomware protection"
    retention: "1 year"

PostgreSQL Point-in-Time Recovery:

#!/bin/bash
# Restore database to point in time

RESTORE_TIME="2025-10-16 14:00:00 UTC"
BACKUP_LOCATION="s3://backups/postgres/base-backup"
WAL_LOCATION="s3://backups/postgres/wal-archive"

# 1. Stop PostgreSQL
systemctl stop postgresql

# 2. Remove corrupted data
rm -rf /var/lib/postgresql/14/main/*

# 3. Restore base backup
aws s3 sync $BACKUP_LOCATION /var/lib/postgresql/14/main/

# 4. Create recovery configuration
cat > /var/lib/postgresql/14/main/recovery.conf <<EOF
restore_command = 'aws s3 cp $WAL_LOCATION/%f %p'
recovery_target_time = '$RESTORE_TIME'
recovery_target_action = 'promote'
EOF

# 5. Start PostgreSQL in recovery mode
systemctl start postgresql

# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-14-main.log

# 7. Verify data integrity
psql -U postgres -c "SELECT count(*) FROM critical_table;"

4. Accidental Deletion

Scenario: Engineer accidentally deletes production Kubernetes namespace

Impact:

All services in namespace destroyed
Persistent volumes deleted
ConfigMaps and secrets lost

DR Strategy:

disaster: accidental_deletion

prevention:
  - rbac_least_privilege
  - require_approval_for_prod_changes
  - backup_before_destructive_operations
  - terraform_prevent_destroy_flags

recovery_with_velero:
  backup_schedule: "Every 4 hours"
  retention: "30 days"
  include:
    - namespaces
    - persistent_volumes
    - cluster_resources

  restore_steps:
    1. list_available_backups
    2. select_most_recent_backup
    3. restore_namespace
    4. verify_resources
    5. test_application

Velero Backup and Restore:

# Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --backup-location-config region=us-east-1

# Create scheduled backup
velero schedule create production-backup \
  --schedule="0 */4 * * *" \
  --include-namespaces production \
  --ttl 720h

# Disaster happens: namespace deleted
kubectl delete namespace production  # Oh no!

# List available backups
velero backup get

# Restore from backup
velero restore create --from-backup production-backup-20251016140000

# Monitor restore
velero restore describe production-backup-20251016140000

# Verify restoration
kubectl get all -n production

Backup Strategies

3-2-1 Backup Rule

┌─────────────────────────────────────────┐
│         3-2-1 Backup Rule               │
├─────────────────────────────────────────┤
│  3 copies of data:                      │
│    1. Production                        │
│    2. Local backup                      │
│    3. Offsite backup                    │
│                                         │
│  2 different media types:               │
│    - Disk (fast recovery)               │
│    - Cloud object storage (cheap)       │
│                                         │
│  1 offsite/offline copy:                │
│    - Different region                   │
│    - Air-gapped (ransomware protection)│
└─────────────────────────────────────────┘

Backup Types

1. Full Backup

# Complete copy of all data
pg_dump -h localhost -U postgres -F c -f /backups/full_$(date +%Y%m%d).dump production_db

# Pros: Simple restore, complete copy
# Cons: Large size, slow, expensive storage

2. Incremental Backup

# Only data changed since last backup (any type)
rsync --link-dest=/backups/previous /data /backups/$(date +%Y%m%d)

# Pros: Fast, small size
# Cons: Complex restore (need all incrementals)

3. Differential Backup

# Data changed since last full backup
# Restoring requires: Last full + last differential

# Pros: Faster than full, simpler restore than incremental
# Cons: Grows over time until next full backup

4. Continuous Backup (WAL/Binlog Shipping)

-- PostgreSQL continuous archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'
wal_level = replica

-- MySQL binary log shipping
log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 7

Benefits:

RPO: Seconds to minutes
Point-in-time recovery
Minimal data loss

Backup Validation

#!/bin/bash
# Automated backup validation script

BACKUP_FILE="/backups/production_$(date +%Y%m%d).dump"

# 1. Verify backup exists
if [ ! -f "$BACKUP_FILE" ]; then
    echo "ERROR: Backup file not found!"
    exit 1
fi

# 2. Verify backup is not corrupted
pg_restore --list $BACKUP_FILE > /dev/null 2>&1
if [ $? -ne 0 ]; then
    echo "ERROR: Backup file is corrupted!"
    exit 1
fi

# 3. Test restore to staging database
dropdb test_restore
createdb test_restore
pg_restore -d test_restore $BACKUP_FILE

# 4. Verify row counts match
PROD_COUNT=$(psql -t -c "SELECT count(*) FROM users" production_db)
TEST_COUNT=$(psql -t -c "SELECT count(*) FROM users" test_restore)

if [ "$PROD_COUNT" != "$TEST_COUNT" ]; then
    echo "ERROR: Row count mismatch!"
    exit 1
fi

# 5. Cleanup
dropdb test_restore

echo "✅ Backup validated successfully"

DR Testing

Testing Frequency

dr_testing_schedule:
  desktop_walkthroughs:
    frequency: "Monthly"
    participants: ["SRE", "Platform"]
    duration: "1 hour"
    scope: "Review DR procedures"

  tabletop_exercises:
    frequency: "Quarterly"
    participants: ["SRE", "Engineering", "Management"]
    duration: "2-3 hours"
    scope: "Simulated disaster scenario"

  partial_failover_tests:
    frequency: "Bi-annually"
    participants: ["All teams"]
    duration: "4 hours"
    scope: "Failover single service"

  full_dr_drill:
    frequency: "Annually"
    participants: ["All teams", "Executive"]
    duration: "1 day"
    scope: "Complete region failover"

DR Drill Template

# DR Drill: Region Failover Test

**Date:** 2025-11-15
**Duration:** 4 hours (9 AM - 1 PM PST)
**Scenario:** AWS us-east-1 region complete failure

## Objectives
1. Validate RTO of 1 hour
2. Validate RPO of 15 minutes
3. Test cross-team communication
4. Identify gaps in runbooks

## Participants
- **Incident Commander:** Alice (SRE Lead)
- **SRE Team:** Bob, Carol, Dave
- **Platform Team:** Eve, Frank
- **Database Team:** Grace
- **Communications:** Henry (Engineering Manager)
- **Observers:** CTO, VP Engineering

## Pre-Drill Checklist
- [ ] DR environment provisioned
- [ ] Recent backups verified
- [ ] Runbooks reviewed
- [ ] Stakeholders notified
- [ ] Rollback plan documented

## Timeline

### 09:00 - Kick-off
- Review scenario
- Assign roles
- Establish communication channels

### 09:15 - Inject Failure
- Simulate us-east-1 outage
- Stop Route53 health checks
- Wait for detection

### 09:20 - Detection & Response
- Teams detect outage
- Incident commander coordinates response
- Follow DR runbook

### 10:15 - Expected: Services Recovered (RTO: 1 hour)
- Database promoted in us-west-2
- Application traffic switched
- Verify functionality

### 10:30 - Verification
- Check all services healthy
- Validate data consistency
- Measure actual RTO/RPO

### 11:00 - Recovery Complete
- Return to normal operations OR
- Continue running on DR site

### 11:30 - Hot Wash / Retrospective
- What went well?
- What went wrong?
- Action items

## Success Criteria
- [ ] RTO < 1 hour
- [ ] RPO < 15 minutes
- [ ] No data loss
- [ ] All critical services operational
- [ ] Runbook followed successfully
- [ ] Communication protocol effective

## Metrics to Capture
- Time to detect: _______
- Time to decide: _______
- Time to execute: _______
- Total RTO: _______
- Data loss (RPO): _______

## Post-Drill Actions
- [ ] Update runbooks based on learnings
- [ ] Address identified gaps
- [ ] Share report with stakeholders
- [ ] Schedule follow-up drill

Drill Results Analysis

dr_drill_results:
  drill_id: "DR-2025-11-15"
  scenario: "Region failure"

  targets:
    rto: "1 hour"
    rpo: "15 minutes"

  actuals:
    rto: "1 hour 23 minutes"  # ❌ Exceeded target
    rpo: "8 minutes"          # ✅ Met target
    data_loss: "0 records"    # ✅ No loss

  timeline:
    - time: "09:15"
      event: "Failure injected"
    - time: "09:18"
      event: "Monitoring alerts fired"
    - time: "09:25"
      event: "Incident declared"
    - time: "09:30"
      event: "Database failover initiated"
    - time: "09:45"
      event: "Database promoted"
    - time: "10:10"
      event: "Application traffic switched"
    - time: "10:38"
      event: "All services operational"

  what_went_well:
    - "Monitoring detected outage quickly (3 min)"
    - "Team communication was clear"
    - "No data loss"
    - "Runbook was accurate"

  what_went_wrong:
    - "Database failover took longer than expected (15 min vs expected 10 min)"
    - "DNS propagation delayed recovery (15 min)"
    - "One team member couldn't access DR console (permissions issue)"

  action_items:
    - id: "DR-001"
      description: "Optimize database failover automation"
      owner: "Grace (Database Team)"
      due_date: "2025-11-30"
      priority: "High"

    - id: "DR-002"
      description: "Pre-lower DNS TTL before drills"
      owner: "Bob (SRE)"
      due_date: "2025-11-22"
      priority: "Medium"

    - id: "DR-003"
      description: "Audit and fix DR environment permissions"
      owner: "Eve (Platform)"
      due_date: "2025-11-25"
      priority: "High"

  lessons_learned:
    - "RTO exceeded by 23 minutes due to manual steps"
    - "Need to automate database promotion"
    - "DNS TTL should be lowered before planned failovers"

DR Documentation

DR Runbook Template

# Disaster Recovery Runbook: Region Failover

**Last Updated:** 2025-10-16
**Owner:** SRE Team
**RTO:** 1 hour
**RPO:** 15 minutes

## Prerequisites
- [ ] Verify DR environment is healthy
- [ ] Recent backup available (< 4 hours old)
- [ ] Incident commander assigned
- [ ] Communication channels established

## Roles and Responsibilities

| Role | Person | Responsibilities |
|------|--------|------------------|
| Incident Commander | @alice | Coordinate recovery, make decisions |
| Database Lead | @grace | Database failover |
| Platform Lead | @eve | Infrastructure & DNS |
| SRE On-Call | @bob | Execute runbook steps |
| Communications | @henry | Stakeholder updates |

## Decision Tree

Disaster Detected │ ├─> Can primary region recover in < 1 hour? │ ├─> YES: Wait and monitor │ └─> NO: Proceed with failover ▼ │ ├─> Is this a drill or real disaster? │ ├─> DRILL: Notify stakeholders, proceed │ └─> REAL: Declare incident, proceed ▼ │ ├─> Data corruption or infrastructure failure? │ ├─> DATA: Point-in-time restore (see DR-002) │ └─> INFRA: Region failover (continue below) ▼


## Failover Steps

### Phase 1: Preparation (5 minutes)

```bash
# 1. Verify DR site is healthy
./scripts/check-dr-health.sh

# Expected output:
# ✅ us-west-2 VPC reachable
# ✅ Database replica lag < 10s
# ✅ Application instances running
# ✅ Load balancer healthy

# 2. Create situation snapshot
./scripts/snapshot-current-state.sh > /tmp/pre-failover-state.json

# 3. Notify stakeholders
./scripts/send-notification.sh \
  --channel "#incidents" \
  --message "Initiating DR failover to us-west-2. ETA: 1 hour"

Phase 2: Database Failover (15 minutes)

# 4. Promote DR database to primary
aws rds promote-read-replica \
  --db-instance-identifier prod-db-dr-us-west-2

# 5. Wait for promotion (10-15 min)
aws rds wait db-instance-available \
  --db-instance-identifier prod-db-dr-us-west-2

# 6. Verify database is writable
psql -h prod-db-dr-us-west-2.amazonaws.com -U postgres -c \
  "CREATE TABLE dr_test (id int); DROP TABLE dr_test;"

# Expected: Table created and dropped successfully

# 7. Calculate actual RPO
./scripts/calculate-rpo.sh \
  --primary prod-db-primary-us-east-1 \
  --dr prod-db-dr-us-west-2

# Expected output:
# Replication lag at failure: 8 seconds
# Data loss: 0 transactions
# RPO: 8 seconds ✅ (target: 15 minutes)

Phase 3: Application Failover (20 minutes)

# 8. Scale up DR application instances
terraform apply \
  -var="dr_instance_count=20" \
  -var="dr_instance_type=m5.large" \
  terraform/dr-region/

# 9. Update application config to point to new database
kubectl set env deployment/api-server \
  -n production \
  DATABASE_HOST=prod-db-dr-us-west-2.amazonaws.com

# 10. Verify application health
kubectl get pods -n production
./scripts/smoke-test.sh --region us-west-2

# Expected: All pods running, smoke tests passing

Phase 4: Traffic Cutover (15 minutes)

# 11. Lower DNS TTL (if not already low)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234 \
  --change-batch file://lower-ttl.json

# Wait 5 minutes for TTL to expire

# 12. Switch Route53 to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234 \
  --change-batch file://failover-to-dr.json

# 13. Monitor traffic shift
watch -n 5 './scripts/check-traffic-distribution.sh'

# Expected: Traffic shifting from 100% us-east-1 to 100% us-west-2

Phase 5: Verification (10 minutes)

# 14. End-to-end smoke tests
./scripts/e2e-tests.sh --environment production-dr

# 15. Verify critical user journeys
./scripts/synthetic-tests.sh \
  --tests checkout,login,search \
  --region us-west-2

# 16. Check error rates and latency
open https://grafana.example.com/d/service-health

# Expected:
# - Error rate < 0.1%
# - P95 latency < 500ms
# - All services green

Phase 6: Communication (Ongoing)

# 17. Update status page
./scripts/update-status.sh \
  --status "operational" \
  --message "Services restored in DR region"

# 18. Send update to stakeholders
./scripts/send-notification.sh \
  --channel "#incidents" \
  --message "✅ Failover complete. Services operational in us-west-2. RTO: XX minutes"

Rollback Plan

# If failover fails, rollback:

# 1. Revert DNS
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234 \
  --change-batch file://rollback-to-primary.json

# 2. Stop writes to DR database
kubectl scale deployment/api-server --replicas=0 -n production

# 3. Investigate and retry

Post-Recovery Tasks

Schedule postmortem within 24 hours
Analyze actual RTO/RPO
Review costs (DR environment scaled up)
Decide when to fail back to primary region
Update runbook with learnings

Contact Information

Incident Commander: Alice - @alice (Slack), +1-555-0001
Database DRI: Grace - @grace (Slack), +1-555-0002
On-Call SRE: PagerDuty rotation
Emergency Escalation: VP Eng - +1-555-0099


## Cost Optimization

### DR Cost Calculator

```python
def calculate_dr_costs(tier, monthly_primary_cost):
    """
    Estimate DR costs based on tier
    """
    cost_multipliers = {
        "backup_restore": 0.1,      # 10% of primary (storage only)
        "pilot_light": 0.2,         # 20% (minimal compute + storage)
        "warm_standby": 0.5,        # 50% (reduced capacity)
        "hot_site": 1.0,            # 100% (full duplicate)
        "active_active": 1.2,       # 120% (over-provisioned)
    }

    dr_cost = monthly_primary_cost * cost_multipliers[tier]

    return {
        "tier": tier,
        "primary_monthly_cost": monthly_primary_cost,
        "dr_monthly_cost": dr_cost,
        "total_monthly_cost": monthly_primary_cost + dr_cost,
        "overhead_percentage": (dr_cost / monthly_primary_cost) * 100
    }

# Example
primary_cost = 10000  # $10k/month

for tier in ["backup_restore", "pilot_light", "warm_standby", "hot_site"]:
    result = calculate_dr_costs(tier, primary_cost)
    print(f"{tier}: ${result['dr_monthly_cost']}/mo ({result['overhead_percentage']}% overhead)")

# Output:
# backup_restore: $1000/mo (10% overhead)
# pilot_light: $2000/mo (20% overhead)
# warm_standby: $5000/mo (50% overhead)
# hot_site: $10000/mo (100% overhead)

Cost-Saving Strategies

cost_optimization:
  # 1. Right-size DR capacity
  dr_scaling:
    normal: "20% of primary capacity"
    disaster: "Scale to 100% within 15 minutes"
    savings: "80% compute cost"

  # 2. Use cheaper storage tiers
  backup_storage:
    hot_backups: "S3 Standard (7 days)"
    warm_backups: "S3 IA (30 days)"
    cold_backups: "S3 Glacier (1 year)"
    savings: "60% storage cost"

  # 3. Leverage spot instances for DR
  compute:
    dr_instances: "Spot instances (non-critical)"
    primary_instances: "On-demand (critical)"
    savings: "70% DR compute cost"

  # 4. Schedule DR environment
  scheduled_shutdown:
    weekdays: "Keep DR minimal capacity"
    weekends: "Shut down non-essential DR services"
    savings: "30% overall"

Compliance and Auditing

Compliance Requirements

compliance_framework:
  soc2:
    requirements:
      - backup_encryption: "AES-256"
      - backup_retention: "Minimum 90 days"
      - dr_testing: "Annually"
      - access_controls: "Role-based"

  gdpr:
    requirements:
      - data_sovereignty: "EU data stays in EU"
      - right_to_deletion: "Backup retention <= policy"
      - breach_notification: "72 hours"

  pci_dss:
    requirements:
      - backup_encryption: "Required"
      - offsite_backup: "Required"
      - backup_testing: "Annually"
      - access_logging: "All backup access logged"

  hipaa:
    requirements:
      - backup_encryption: "At rest and in transit"
      - backup_access_controls: "Audit trail"
      - backup_retention: "6 years"

Audit Checklist

# Quarterly DR Audit

## Backups
- [ ] All critical systems have backups
- [ ] Backup frequency meets RPO
- [ ] Backups are encrypted
- [ ] Backups are immutable (ransomware protection)
- [ ] Offsite backups exist
- [ ] Backup restoration tested in last 90 days

## Documentation
- [ ] DR runbooks up to date
- [ ] RTO/RPO documented for all services
- [ ] Contact information current
- [ ] Roles and responsibilities assigned

## Testing
- [ ] Tabletop exercise in last quarter
- [ ] Backup restoration test in last quarter
- [ ] Full DR drill in last year

## Infrastructure
- [ ] DR environment exists and is accessible
- [ ] DR environment capacity is adequate
- [ ] DNS failover configured
- [ ] Database replication working

## Compliance
- [ ] Backup retention meets policy
- [ ] Access controls audited
- [ ] Encryption verified
- [ ] Logs collected and retained

Common Pitfalls

Pitfall 1: Untested Backups

Problem: “We have backups” but never tested restore Impact: Backups are corrupted, incomplete, or unrestorable Solution: Regular restore testing, automated validation

Pitfall 2: Stale Runbooks

Problem: Runbook written 2 years ago, infrastructure changed Impact: Failover fails because steps are wrong Solution: Update runbooks with each infrastructure change, test regularly

Pitfall 3: Insufficient RTO/RPO

Problem: Business expects 15-minute RTO, DR plan is 24 hours Impact: Lost revenue, customer churn Solution: Align DR tier with business requirements

Pitfall 4: Single Point of Failure

Problem: All backups in same region as primary Impact: Regional disaster destroys primary AND backups Solution: Offsite, multi-region backups

Pitfall 5: No DR for Stateful Components

Problem: Application has DR, database doesn’t Impact: Can failover app, but no data Solution: DR plan for entire stack, especially data

Conclusion

Disaster recovery is not optional—it’s a business requirement. Key takeaways:

Define RTO/RPO: Understand business requirements
Choose Right Tier: Balance cost and risk
Test Regularly: Untested DR plans don’t work
Automate: Manual recovery is slow and error-prone
Document Everything: Runbooks save time during disasters
Practice: DR drills build muscle memory
Learn: Every drill improves the next one

Remember: “It’s not a question of if disaster will strike, but when. Be prepared.”

“The best DR plan is the one you’ve tested and know works. The worst is the one that looks good on paper but has never been validated.”

Introduction#

Key Concepts#

RTO vs RPO#

Recovery Time Objective (RTO)#

Recovery Point Objective (RPO)#

DR Tiers#

Tier 0: No DR#

Tier 1: Backup and Restore#

Tier 2: Pilot Light#

Tier 3: Warm Standby#

Tier 4: Hot Site / Multi-Site Active-Active#

Disaster Scenarios#

1. Data Center / Availability Zone Failure#

2. Region Failure#

3. Data Corruption / Ransomware#

4. Accidental Deletion#

Backup Strategies#

3-2-1 Backup Rule#

Backup Types#

1. Full Backup#

2. Incremental Backup#

3. Differential Backup#

4. Continuous Backup (WAL/Binlog Shipping)#

Backup Validation#

DR Testing#

Testing Frequency#

DR Drill Template#

Drill Results Analysis#

DR Documentation#

DR Runbook Template#

Phase 2: Database Failover (15 minutes)#

Phase 3: Application Failover (20 minutes)#

Phase 4: Traffic Cutover (15 minutes)#

Phase 5: Verification (10 minutes)#

Phase 6: Communication (Ongoing)#

Rollback Plan#

Post-Recovery Tasks#

Contact Information#

Related Runbooks#

Cost-Saving Strategies#

Compliance and Auditing#

Compliance Requirements#

Audit Checklist#

Common Pitfalls#

Pitfall 1: Untested Backups#

Pitfall 2: Stale Runbooks#

Pitfall 3: Insufficient RTO/RPO#

Pitfall 4: Single Point of Failure#

Pitfall 5: No DR for Stateful Components#

Conclusion#

Introduction

Key Concepts

RTO vs RPO

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

DR Tiers

Tier 0: No DR

Tier 1: Backup and Restore

Tier 2: Pilot Light

Tier 3: Warm Standby

Tier 4: Hot Site / Multi-Site Active-Active

Disaster Scenarios

1. Data Center / Availability Zone Failure

2. Region Failure

3. Data Corruption / Ransomware

4. Accidental Deletion

Backup Strategies

3-2-1 Backup Rule

Backup Types

1. Full Backup

2. Incremental Backup

3. Differential Backup

4. Continuous Backup (WAL/Binlog Shipping)

Backup Validation

DR Testing

Testing Frequency

DR Drill Template

Drill Results Analysis

DR Documentation

DR Runbook Template

Phase 2: Database Failover (15 minutes)

Phase 3: Application Failover (20 minutes)

Phase 4: Traffic Cutover (15 minutes)

Phase 5: Verification (10 minutes)

Phase 6: Communication (Ongoing)

Rollback Plan

Post-Recovery Tasks

Contact Information

Related Runbooks

Cost-Saving Strategies

Compliance and Auditing

Compliance Requirements

Audit Checklist

Common Pitfalls

Pitfall 1: Untested Backups

Pitfall 2: Stale Runbooks

Pitfall 3: Insufficient RTO/RPO

Pitfall 4: Single Point of Failure

Pitfall 5: No DR for Stateful Components

Conclusion