π SRE Practice
18 min
Disaster Recovery Planning: RTO, RPO, and Building Resilient Systems
Introduction Disaster Recovery (DR) is the process, policies, and procedures for recovering and continuing technology infrastructure after a disaster. A disaster can be natural (earthquake, flood), technical (data center failure, ransomware), or human-caused (accidental deletion, security breach).
Core Principle: βHope is not a strategy. Plan for failure before it happens.β
Key Concepts RTO vs RPO Time βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ> β β β β Disaster Detection Recovery Normal Occurs Time Begins Operations βββββββββββββββββββββββββββββββββββββΊβ β Recovery Time β β Objective (RTO) β β β ββββββββββββββΊβ β Data Loss β (Recovery Point β Objective - RPO) β Recovery Time Objective (RTO) Definition: Maximum acceptable time that a system can be down after a disaster.
β¦
October 16, 2025 Β· 18 min Β· DevOps Engineer
18 min
Linux Storage: Partitions, LVM, ZFS/Btrfs, Filesystems, Snapshots, and Backups
Executive Summary Storage strategy = reliable data access with recovery guarantees. Choose based on workload (traditional vs. modern, single vs. multi-server).
Why storage matters: Most production disasters involve storageβdisk fills up and crashes your application, a database corruption loses customer data, or backups fail during a restore attempt. Proper storage management prevents these scenarios.
Real-world disasters prevented by good storage:
Disk full at 3 AM: Application logs fill /var β server crashes β customers canβt access site. Solution: Separate /var/log partition with monitoring. Failed database restore: Backup runs nightly for 2 years, never tested. Server dies, restore fails (corrupt backup). Solution: Monthly test restores. Database grows unexpectedly: 50GB database grows to 500GB in 3 months. Traditional partitions = downtime to resize. Solution: LVM allows online growth. Ransomware encrypts production data: No snapshots available, last backup is 24 hours old. Solution: ZFS/Btrfs snapshots provide instant point-in-time recovery. This guide covers:
β¦
October 16, 2025 Β· 18 min Β· DevOps Engineer