Introduction
Disaster Recovery (DR) is the process, policies, and procedures for recovering and continuing technology infrastructure after a disaster. A disaster can be natural (earthquake, flood), technical (data center failure, ransomware), or human-caused (accidental deletion, security breach).
Core Principle: “Hope is not a strategy. Plan for failure before it happens.”
Key Concepts
RTO vs RPO
Time βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ>
β β β β
Disaster Detection Recovery Normal
Occurs Time Begins Operations
βββββββββββββββββββββββββββββββββββββΊβ
β Recovery Time β
β Objective (RTO) β
β β
ββββββββββββββΊβ β
Data Loss β
(Recovery Point β
Objective - RPO) β
Recovery Time Objective (RTO)
Definition: Maximum acceptable time that a system can be down after a disaster.
Examples:
- RTO = 0 minutes: High-availability systems (financial trading)
- RTO = 4 hours: Critical business applications
- RTO = 24 hours: Standard applications
- RTO = 7 days: Archival systems
Cost implications:
βββββββββββββββββββββββββββββββββββββββββββ
β RTO vs Cost Relationship β
β β
β Cost β
β β² β
β β β± β
β β β± β
β β β± β
β β β± β
β β β± β
β β β± β
β β β± β
β ββ± β
β βββββββββββββββββββββββββββββ> RTO β
β 0min 1hr 4hr 24hr 7d β
βββββββββββββββββββββββββββββββββββββββββββ
Recovery Point Objective (RPO)
Definition: Maximum acceptable amount of data loss measured in time.
Examples:
- RPO = 0 minutes: No data loss (synchronous replication)
- RPO = 15 minutes: Near-real-time backup
- RPO = 1 hour: Frequent backups
- RPO = 24 hours: Daily backups
Calculation:
# Data loss in a disaster
def calculate_data_loss(rpo_minutes, disaster_time, last_backup_time):
"""
Calculate potential data loss based on RPO
"""
time_since_backup = disaster_time - last_backup_time
if time_since_backup <= rpo_minutes:
return "Within RPO - Acceptable"
else:
data_loss_window = time_since_backup
return f"Data loss: {data_loss_window} minutes (exceeds RPO of {rpo_minutes} min)"
# Example
disaster_time = datetime(2025, 10, 16, 14, 30) # 2:30 PM
last_backup = datetime(2025, 10, 16, 14, 0) # 2:00 PM
rpo = 15 # minutes
result = calculate_data_loss(rpo, disaster_time, last_backup)
# Output: "Data loss: 30 minutes (exceeds RPO of 15 min)"
DR Tiers
Tier 0: No DR
- RTO: Days/weeks
- RPO: Days
- Strategy: Rebuild from scratch
- Cost: $
- Use case: Non-critical systems
Tier 1: Backup and Restore
- RTO: 12-24 hours
- RPO: 1-24 hours
- Strategy: Regular backups to offsite storage
- Cost: $$
- Use case: Standard applications
backup_strategy:
frequency: "Daily at 2 AM"
retention: "30 days"
storage: "S3 Glacier"
restore_time: "4-8 hours"
Tier 2: Pilot Light
- RTO: 1-4 hours
- RPO: Minutes to hours
- Strategy: Minimal infrastructure always running, scale up on disaster
- Cost: $$$
- Use case: Important applications
pilot_light:
always_running:
- database_replica
- minimal_app_servers
disaster_action:
- scale_up_app_servers
- update_dns
- activate_load_balancer
Tier 3: Warm Standby
- RTO: Minutes to 1 hour
- RPO: Seconds to minutes
- Strategy: Scaled-down version always running
- Cost: $$$$
- Use case: Business-critical applications
warm_standby:
always_running:
- replicated_database
- reduced_capacity_app_servers
- load_balancer_configured
disaster_action:
- scale_up_to_full_capacity
- switch_dns_to_dr_site
Tier 4: Hot Site / Multi-Site Active-Active
- RTO: 0 minutes (automatic failover)
- RPO: 0 seconds (no data loss)
- Strategy: Fully redundant infrastructure
- Cost: $$$$$
- Use case: Mission-critical systems
active_active:
configuration:
- multiple_regions
- synchronous_replication
- auto_failover
- global_load_balancing
disaster_action:
- automatic_traffic_rerouting
- no_manual_intervention
Disaster Scenarios
1. Data Center / Availability Zone Failure
Scenario: Entire AWS availability zone goes down
Impact:
- All services in that AZ unavailable
- Database read replicas lost
- Application instances terminated
DR Strategy:
# Multi-AZ deployment
disaster: az_failure
architecture:
deployment:
availability_zones: [us-east-1a, us-east-1b, us-east-1c]
min_healthy_azs: 2
application:
instances_per_az: 3
total_instances: 9
acceptable_loss: 33% # Can lose 1 AZ
database:
primary: us-east-1a
replicas:
- us-east-1b
- us-east-1c
auto_failover: true
recovery_steps:
1. detect:
- aws_health_dashboard_alerts
- instance_health_checks_failing
2. automatic_actions:
- remove_unhealthy_instances_from_lb
- scale_up_healthy_az_instances
- database_auto_failover_to_replica
3. manual_verification:
- check_service_availability
- verify_data_consistency
- monitor_error_rates
expected_rto: "5 minutes"
expected_rpo: "0 seconds (synchronous replication)"
Kubernetes Implementation:
# Multi-AZ pod distribution
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 9
template:
spec:
# Spread pods across AZs
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
# Prefer different AZs
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api-server
topologyKey: topology.kubernetes.io/zone
2. Region Failure
Scenario: Entire AWS region becomes unavailable
Impact:
- All services in region down
- Regional data unavailable
- Cross-region dependencies break
DR Strategy:
disaster: region_failure
architecture:
primary_region: us-east-1
dr_region: us-west-2
configuration:
data_replication: "Cross-region asynchronous"
dns: "Route53 health checks + automatic failover"
rto: "1 hour"
rpo: "15 minutes"
recovery_steps:
1. detection:
- route53_health_checks_failing
- cloudwatch_alarms_triggered
- manual_verification
2. failover_database:
- promote_dr_region_replica_to_primary
- verify_data_consistency
- update_connection_strings
3. failover_application:
- update_route53_to_point_to_dr_region
- scale_up_dr_instances
- verify_service_health
4. communication:
- notify_stakeholders
- update_status_page
- monitor_recovery_progress
5. post_recovery:
- analyze_data_loss
- review_recovery_time
- document_learnings
Terraform Multi-Region Setup:
# Primary region
provider "aws" {
alias = "primary"
region = "us-east-1"
}
# DR region
provider "aws" {
alias = "dr"
region = "us-west-2"
}
# Application in primary region
module "app_primary" {
source = "./modules/application"
providers = {
aws = aws.primary
}
environment = "production-primary"
capacity = "full"
}
# Application in DR region (reduced capacity)
module "app_dr" {
source = "./modules/application"
providers = {
aws = aws.dr
}
environment = "production-dr"
capacity = "minimal" # Scale up on disaster
}
# Cross-region database replication
resource "aws_rds_cluster" "primary" {
provider = aws.primary
cluster_identifier = "prod-db-primary"
engine = "aurora-postgresql"
database_name = "production"
master_username = var.db_username
master_password = var.db_password
# Enable backup for cross-region replication
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
}
# DR region replica
resource "aws_rds_cluster" "dr" {
provider = aws.dr
cluster_identifier = "prod-db-dr"
replication_source_arn = aws_rds_cluster.primary.arn
}
# DNS failover
resource "aws_route53_health_check" "primary" {
fqdn = module.app_primary.endpoint
type = "HTTPS"
port = 443
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "api" {
zone_id = var.route53_zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = module.app_primary.load_balancer_dns
zone_id = module.app_primary.load_balancer_zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "api_failover" {
zone_id = var.route53_zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = module.app_dr.load_balancer_dns
zone_id = module.app_dr.load_balancer_zone_id
evaluate_target_health = true
}
}
3. Data Corruption / Ransomware
Scenario: Production database corrupted by bug or ransomware attack
Impact:
- Data integrity compromised
- Need point-in-time recovery
- Potential data loss
DR Strategy:
disaster: data_corruption
detection:
- automated_data_validation_checks
- unusual_data_modification_patterns
- ransomware_detection_alerts
recovery_steps:
1. isolate:
- stop_application_writes
- identify_corruption_start_time
- preserve_current_state_for_forensics
2. assess:
- determine_last_known_good_backup
- calculate_data_loss_window
- identify_affected_records
3. restore:
- restore_from_point_in_time_backup
- verify_data_integrity
- replay_valid_transactions_if_possible
4. validate:
- run_data_consistency_checks
- compare_with_audit_logs
- user_acceptance_testing
5. resume:
- gradual_traffic_increase
- monitor_for_recurring_issues
backup_strategy:
continuous_backup:
enabled: true
retention: "35 days"
point_in_time_restore: true
snapshot_backup:
frequency: "Every 6 hours"
retention: "90 days"
immutable_backup:
enabled: true
description: "Ransomware protection"
retention: "1 year"
PostgreSQL Point-in-Time Recovery:
#!/bin/bash
# Restore database to point in time
RESTORE_TIME="2025-10-16 14:00:00 UTC"
BACKUP_LOCATION="s3://backups/postgres/base-backup"
WAL_LOCATION="s3://backups/postgres/wal-archive"
# 1. Stop PostgreSQL
systemctl stop postgresql
# 2. Remove corrupted data
rm -rf /var/lib/postgresql/14/main/*
# 3. Restore base backup
aws s3 sync $BACKUP_LOCATION /var/lib/postgresql/14/main/
# 4. Create recovery configuration
cat > /var/lib/postgresql/14/main/recovery.conf <<EOF
restore_command = 'aws s3 cp $WAL_LOCATION/%f %p'
recovery_target_time = '$RESTORE_TIME'
recovery_target_action = 'promote'
EOF
# 5. Start PostgreSQL in recovery mode
systemctl start postgresql
# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-14-main.log
# 7. Verify data integrity
psql -U postgres -c "SELECT count(*) FROM critical_table;"
4. Accidental Deletion
Scenario: Engineer accidentally deletes production Kubernetes namespace
Impact:
- All services in namespace destroyed
- Persistent volumes deleted
- ConfigMaps and secrets lost
DR Strategy:
disaster: accidental_deletion
prevention:
- rbac_least_privilege
- require_approval_for_prod_changes
- backup_before_destructive_operations
- terraform_prevent_destroy_flags
recovery_with_velero:
backup_schedule: "Every 4 hours"
retention: "30 days"
include:
- namespaces
- persistent_volumes
- cluster_resources
restore_steps:
1. list_available_backups
2. select_most_recent_backup
3. restore_namespace
4. verify_resources
5. test_application
Velero Backup and Restore:
# Install Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--backup-location-config region=us-east-1
# Create scheduled backup
velero schedule create production-backup \
--schedule="0 */4 * * *" \
--include-namespaces production \
--ttl 720h
# Disaster happens: namespace deleted
kubectl delete namespace production # Oh no!
# List available backups
velero backup get
# Restore from backup
velero restore create --from-backup production-backup-20251016140000
# Monitor restore
velero restore describe production-backup-20251016140000
# Verify restoration
kubectl get all -n production
Backup Strategies
3-2-1 Backup Rule
βββββββββββββββββββββββββββββββββββββββββββ
β 3-2-1 Backup Rule β
βββββββββββββββββββββββββββββββββββββββββββ€
β 3 copies of data: β
β 1. Production β
β 2. Local backup β
β 3. Offsite backup β
β β
β 2 different media types: β
β - Disk (fast recovery) β
β - Cloud object storage (cheap) β
β β
β 1 offsite/offline copy: β
β - Different region β
β - Air-gapped (ransomware protection)β
βββββββββββββββββββββββββββββββββββββββββββ
Backup Types
1. Full Backup
# Complete copy of all data
pg_dump -h localhost -U postgres -F c -f /backups/full_$(date +%Y%m%d).dump production_db
# Pros: Simple restore, complete copy
# Cons: Large size, slow, expensive storage
2. Incremental Backup
# Only data changed since last backup (any type)
rsync --link-dest=/backups/previous /data /backups/$(date +%Y%m%d)
# Pros: Fast, small size
# Cons: Complex restore (need all incrementals)
3. Differential Backup
# Data changed since last full backup
# Restoring requires: Last full + last differential
# Pros: Faster than full, simpler restore than incremental
# Cons: Grows over time until next full backup
4. Continuous Backup (WAL/Binlog Shipping)
-- PostgreSQL continuous archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'
wal_level = replica
-- MySQL binary log shipping
log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 7
Benefits:
- RPO: Seconds to minutes
- Point-in-time recovery
- Minimal data loss
Backup Validation
#!/bin/bash
# Automated backup validation script
BACKUP_FILE="/backups/production_$(date +%Y%m%d).dump"
# 1. Verify backup exists
if [ ! -f "$BACKUP_FILE" ]; then
echo "ERROR: Backup file not found!"
exit 1
fi
# 2. Verify backup is not corrupted
pg_restore --list $BACKUP_FILE > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "ERROR: Backup file is corrupted!"
exit 1
fi
# 3. Test restore to staging database
dropdb test_restore
createdb test_restore
pg_restore -d test_restore $BACKUP_FILE
# 4. Verify row counts match
PROD_COUNT=$(psql -t -c "SELECT count(*) FROM users" production_db)
TEST_COUNT=$(psql -t -c "SELECT count(*) FROM users" test_restore)
if [ "$PROD_COUNT" != "$TEST_COUNT" ]; then
echo "ERROR: Row count mismatch!"
exit 1
fi
# 5. Cleanup
dropdb test_restore
echo "β
Backup validated successfully"
DR Testing
Testing Frequency
dr_testing_schedule:
desktop_walkthroughs:
frequency: "Monthly"
participants: ["SRE", "Platform"]
duration: "1 hour"
scope: "Review DR procedures"
tabletop_exercises:
frequency: "Quarterly"
participants: ["SRE", "Engineering", "Management"]
duration: "2-3 hours"
scope: "Simulated disaster scenario"
partial_failover_tests:
frequency: "Bi-annually"
participants: ["All teams"]
duration: "4 hours"
scope: "Failover single service"
full_dr_drill:
frequency: "Annually"
participants: ["All teams", "Executive"]
duration: "1 day"
scope: "Complete region failover"
DR Drill Template
# DR Drill: Region Failover Test
**Date:** 2025-11-15
**Duration:** 4 hours (9 AM - 1 PM PST)
**Scenario:** AWS us-east-1 region complete failure
## Objectives
1. Validate RTO of 1 hour
2. Validate RPO of 15 minutes
3. Test cross-team communication
4. Identify gaps in runbooks
## Participants
- **Incident Commander:** Alice (SRE Lead)
- **SRE Team:** Bob, Carol, Dave
- **Platform Team:** Eve, Frank
- **Database Team:** Grace
- **Communications:** Henry (Engineering Manager)
- **Observers:** CTO, VP Engineering
## Pre-Drill Checklist
- [ ] DR environment provisioned
- [ ] Recent backups verified
- [ ] Runbooks reviewed
- [ ] Stakeholders notified
- [ ] Rollback plan documented
## Timeline
### 09:00 - Kick-off
- Review scenario
- Assign roles
- Establish communication channels
### 09:15 - Inject Failure
- Simulate us-east-1 outage
- Stop Route53 health checks
- Wait for detection
### 09:20 - Detection & Response
- Teams detect outage
- Incident commander coordinates response
- Follow DR runbook
### 10:15 - Expected: Services Recovered (RTO: 1 hour)
- Database promoted in us-west-2
- Application traffic switched
- Verify functionality
### 10:30 - Verification
- Check all services healthy
- Validate data consistency
- Measure actual RTO/RPO
### 11:00 - Recovery Complete
- Return to normal operations OR
- Continue running on DR site
### 11:30 - Hot Wash / Retrospective
- What went well?
- What went wrong?
- Action items
## Success Criteria
- [ ] RTO < 1 hour
- [ ] RPO < 15 minutes
- [ ] No data loss
- [ ] All critical services operational
- [ ] Runbook followed successfully
- [ ] Communication protocol effective
## Metrics to Capture
- Time to detect: _______
- Time to decide: _______
- Time to execute: _______
- Total RTO: _______
- Data loss (RPO): _______
## Post-Drill Actions
- [ ] Update runbooks based on learnings
- [ ] Address identified gaps
- [ ] Share report with stakeholders
- [ ] Schedule follow-up drill
Drill Results Analysis
dr_drill_results:
drill_id: "DR-2025-11-15"
scenario: "Region failure"
targets:
rto: "1 hour"
rpo: "15 minutes"
actuals:
rto: "1 hour 23 minutes" # β Exceeded target
rpo: "8 minutes" # β
Met target
data_loss: "0 records" # β
No loss
timeline:
- time: "09:15"
event: "Failure injected"
- time: "09:18"
event: "Monitoring alerts fired"
- time: "09:25"
event: "Incident declared"
- time: "09:30"
event: "Database failover initiated"
- time: "09:45"
event: "Database promoted"
- time: "10:10"
event: "Application traffic switched"
- time: "10:38"
event: "All services operational"
what_went_well:
- "Monitoring detected outage quickly (3 min)"
- "Team communication was clear"
- "No data loss"
- "Runbook was accurate"
what_went_wrong:
- "Database failover took longer than expected (15 min vs expected 10 min)"
- "DNS propagation delayed recovery (15 min)"
- "One team member couldn't access DR console (permissions issue)"
action_items:
- id: "DR-001"
description: "Optimize database failover automation"
owner: "Grace (Database Team)"
due_date: "2025-11-30"
priority: "High"
- id: "DR-002"
description: "Pre-lower DNS TTL before drills"
owner: "Bob (SRE)"
due_date: "2025-11-22"
priority: "Medium"
- id: "DR-003"
description: "Audit and fix DR environment permissions"
owner: "Eve (Platform)"
due_date: "2025-11-25"
priority: "High"
lessons_learned:
- "RTO exceeded by 23 minutes due to manual steps"
- "Need to automate database promotion"
- "DNS TTL should be lowered before planned failovers"
DR Documentation
DR Runbook Template
# Disaster Recovery Runbook: Region Failover
**Last Updated:** 2025-10-16
**Owner:** SRE Team
**RTO:** 1 hour
**RPO:** 15 minutes
## Prerequisites
- [ ] Verify DR environment is healthy
- [ ] Recent backup available (< 4 hours old)
- [ ] Incident commander assigned
- [ ] Communication channels established
## Roles and Responsibilities
| Role | Person | Responsibilities |
|------|--------|------------------|
| Incident Commander | @alice | Coordinate recovery, make decisions |
| Database Lead | @grace | Database failover |
| Platform Lead | @eve | Infrastructure & DNS |
| SRE On-Call | @bob | Execute runbook steps |
| Communications | @henry | Stakeholder updates |
## Decision Tree
Disaster Detected β ββ> Can primary region recover in < 1 hour? β ββ> YES: Wait and monitor β ββ> NO: Proceed with failover βΌ β ββ> Is this a drill or real disaster? β ββ> DRILL: Notify stakeholders, proceed β ββ> REAL: Declare incident, proceed βΌ β ββ> Data corruption or infrastructure failure? β ββ> DATA: Point-in-time restore (see DR-002) β ββ> INFRA: Region failover (continue below) βΌ
## Failover Steps
### Phase 1: Preparation (5 minutes)
```bash
# 1. Verify DR site is healthy
./scripts/check-dr-health.sh
# Expected output:
# β
us-west-2 VPC reachable
# β
Database replica lag < 10s
# β
Application instances running
# β
Load balancer healthy
# 2. Create situation snapshot
./scripts/snapshot-current-state.sh > /tmp/pre-failover-state.json
# 3. Notify stakeholders
./scripts/send-notification.sh \
--channel "#incidents" \
--message "Initiating DR failover to us-west-2. ETA: 1 hour"
Phase 2: Database Failover (15 minutes)
# 4. Promote DR database to primary
aws rds promote-read-replica \
--db-instance-identifier prod-db-dr-us-west-2
# 5. Wait for promotion (10-15 min)
aws rds wait db-instance-available \
--db-instance-identifier prod-db-dr-us-west-2
# 6. Verify database is writable
psql -h prod-db-dr-us-west-2.amazonaws.com -U postgres -c \
"CREATE TABLE dr_test (id int); DROP TABLE dr_test;"
# Expected: Table created and dropped successfully
# 7. Calculate actual RPO
./scripts/calculate-rpo.sh \
--primary prod-db-primary-us-east-1 \
--dr prod-db-dr-us-west-2
# Expected output:
# Replication lag at failure: 8 seconds
# Data loss: 0 transactions
# RPO: 8 seconds β
(target: 15 minutes)
Phase 3: Application Failover (20 minutes)
# 8. Scale up DR application instances
terraform apply \
-var="dr_instance_count=20" \
-var="dr_instance_type=m5.large" \
terraform/dr-region/
# 9. Update application config to point to new database
kubectl set env deployment/api-server \
-n production \
DATABASE_HOST=prod-db-dr-us-west-2.amazonaws.com
# 10. Verify application health
kubectl get pods -n production
./scripts/smoke-test.sh --region us-west-2
# Expected: All pods running, smoke tests passing
Phase 4: Traffic Cutover (15 minutes)
# 11. Lower DNS TTL (if not already low)
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234 \
--change-batch file://lower-ttl.json
# Wait 5 minutes for TTL to expire
# 12. Switch Route53 to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234 \
--change-batch file://failover-to-dr.json
# 13. Monitor traffic shift
watch -n 5 './scripts/check-traffic-distribution.sh'
# Expected: Traffic shifting from 100% us-east-1 to 100% us-west-2
Phase 5: Verification (10 minutes)
# 14. End-to-end smoke tests
./scripts/e2e-tests.sh --environment production-dr
# 15. Verify critical user journeys
./scripts/synthetic-tests.sh \
--tests checkout,login,search \
--region us-west-2
# 16. Check error rates and latency
open https://grafana.example.com/d/service-health
# Expected:
# - Error rate < 0.1%
# - P95 latency < 500ms
# - All services green
Phase 6: Communication (Ongoing)
# 17. Update status page
./scripts/update-status.sh \
--status "operational" \
--message "Services restored in DR region"
# 18. Send update to stakeholders
./scripts/send-notification.sh \
--channel "#incidents" \
--message "β
Failover complete. Services operational in us-west-2. RTO: XX minutes"
Rollback Plan
# If failover fails, rollback:
# 1. Revert DNS
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234 \
--change-batch file://rollback-to-primary.json
# 2. Stop writes to DR database
kubectl scale deployment/api-server --replicas=0 -n production
# 3. Investigate and retry
Post-Recovery Tasks
- Schedule postmortem within 24 hours
- Analyze actual RTO/RPO
- Review costs (DR environment scaled up)
- Decide when to fail back to primary region
- Update runbook with learnings
Contact Information
- Incident Commander: Alice - @alice (Slack), +1-555-0001
- Database DRI: Grace - @grace (Slack), +1-555-0002
- On-Call SRE: PagerDuty rotation
- Emergency Escalation: VP Eng - +1-555-0099
Related Runbooks
- DR-001: Database Point-in-Time Restore
- DR-002: Kubernetes Namespace Recovery
- DR-003: Fail Back to Primary Region
## Cost Optimization
### DR Cost Calculator
```python
def calculate_dr_costs(tier, monthly_primary_cost):
"""
Estimate DR costs based on tier
"""
cost_multipliers = {
"backup_restore": 0.1, # 10% of primary (storage only)
"pilot_light": 0.2, # 20% (minimal compute + storage)
"warm_standby": 0.5, # 50% (reduced capacity)
"hot_site": 1.0, # 100% (full duplicate)
"active_active": 1.2, # 120% (over-provisioned)
}
dr_cost = monthly_primary_cost * cost_multipliers[tier]
return {
"tier": tier,
"primary_monthly_cost": monthly_primary_cost,
"dr_monthly_cost": dr_cost,
"total_monthly_cost": monthly_primary_cost + dr_cost,
"overhead_percentage": (dr_cost / monthly_primary_cost) * 100
}
# Example
primary_cost = 10000 # $10k/month
for tier in ["backup_restore", "pilot_light", "warm_standby", "hot_site"]:
result = calculate_dr_costs(tier, primary_cost)
print(f"{tier}: ${result['dr_monthly_cost']}/mo ({result['overhead_percentage']}% overhead)")
# Output:
# backup_restore: $1000/mo (10% overhead)
# pilot_light: $2000/mo (20% overhead)
# warm_standby: $5000/mo (50% overhead)
# hot_site: $10000/mo (100% overhead)
Cost-Saving Strategies
cost_optimization:
# 1. Right-size DR capacity
dr_scaling:
normal: "20% of primary capacity"
disaster: "Scale to 100% within 15 minutes"
savings: "80% compute cost"
# 2. Use cheaper storage tiers
backup_storage:
hot_backups: "S3 Standard (7 days)"
warm_backups: "S3 IA (30 days)"
cold_backups: "S3 Glacier (1 year)"
savings: "60% storage cost"
# 3. Leverage spot instances for DR
compute:
dr_instances: "Spot instances (non-critical)"
primary_instances: "On-demand (critical)"
savings: "70% DR compute cost"
# 4. Schedule DR environment
scheduled_shutdown:
weekdays: "Keep DR minimal capacity"
weekends: "Shut down non-essential DR services"
savings: "30% overall"
Compliance and Auditing
Compliance Requirements
compliance_framework:
soc2:
requirements:
- backup_encryption: "AES-256"
- backup_retention: "Minimum 90 days"
- dr_testing: "Annually"
- access_controls: "Role-based"
gdpr:
requirements:
- data_sovereignty: "EU data stays in EU"
- right_to_deletion: "Backup retention <= policy"
- breach_notification: "72 hours"
pci_dss:
requirements:
- backup_encryption: "Required"
- offsite_backup: "Required"
- backup_testing: "Annually"
- access_logging: "All backup access logged"
hipaa:
requirements:
- backup_encryption: "At rest and in transit"
- backup_access_controls: "Audit trail"
- backup_retention: "6 years"
Audit Checklist
# Quarterly DR Audit
## Backups
- [ ] All critical systems have backups
- [ ] Backup frequency meets RPO
- [ ] Backups are encrypted
- [ ] Backups are immutable (ransomware protection)
- [ ] Offsite backups exist
- [ ] Backup restoration tested in last 90 days
## Documentation
- [ ] DR runbooks up to date
- [ ] RTO/RPO documented for all services
- [ ] Contact information current
- [ ] Roles and responsibilities assigned
## Testing
- [ ] Tabletop exercise in last quarter
- [ ] Backup restoration test in last quarter
- [ ] Full DR drill in last year
## Infrastructure
- [ ] DR environment exists and is accessible
- [ ] DR environment capacity is adequate
- [ ] DNS failover configured
- [ ] Database replication working
## Compliance
- [ ] Backup retention meets policy
- [ ] Access controls audited
- [ ] Encryption verified
- [ ] Logs collected and retained
Common Pitfalls
Pitfall 1: Untested Backups
Problem: “We have backups” but never tested restore Impact: Backups are corrupted, incomplete, or unrestorable Solution: Regular restore testing, automated validation
Pitfall 2: Stale Runbooks
Problem: Runbook written 2 years ago, infrastructure changed Impact: Failover fails because steps are wrong Solution: Update runbooks with each infrastructure change, test regularly
Pitfall 3: Insufficient RTO/RPO
Problem: Business expects 15-minute RTO, DR plan is 24 hours Impact: Lost revenue, customer churn Solution: Align DR tier with business requirements
Pitfall 4: Single Point of Failure
Problem: All backups in same region as primary Impact: Regional disaster destroys primary AND backups Solution: Offsite, multi-region backups
Pitfall 5: No DR for Stateful Components
Problem: Application has DR, database doesn’t Impact: Can failover app, but no data Solution: DR plan for entire stack, especially data
Conclusion
Disaster recovery is not optionalβit’s a business requirement. Key takeaways:
- Define RTO/RPO: Understand business requirements
- Choose Right Tier: Balance cost and risk
- Test Regularly: Untested DR plans don’t work
- Automate: Manual recovery is slow and error-prone
- Document Everything: Runbooks save time during disasters
- Practice: DR drills build muscle memory
- Learn: Every drill improves the next one
Remember: “It’s not a question of if disaster will strike, but when. Be prepared.”
“The best DR plan is the one you’ve tested and know works. The worst is the one that looks good on paper but has never been validated.”