Executive Summary
Storage strategy = reliable data access with recovery guarantees. Choose based on workload (traditional vs. modern, single vs. multi-server).
Why storage matters: Most production disasters involve storage—disk fills up and crashes your application, a database corruption loses customer data, or backups fail during a restore attempt. Proper storage management prevents these scenarios.
Real-world disasters prevented by good storage:
- Disk full at 3 AM: Application logs fill
/var
→ server crashes → customers can’t access site. Solution: Separate/var/log
partition with monitoring. - Failed database restore: Backup runs nightly for 2 years, never tested. Server dies, restore fails (corrupt backup). Solution: Monthly test restores.
- Database grows unexpectedly: 50GB database grows to 500GB in 3 months. Traditional partitions = downtime to resize. Solution: LVM allows online growth.
- Ransomware encrypts production data: No snapshots available, last backup is 24 hours old. Solution: ZFS/Btrfs snapshots provide instant point-in-time recovery.
This guide covers:
- Partitioning: LVM (flexible, standard) vs. ZFS/Btrfs (modern, copy-on-write)
- Filesystems: ext4 (safe default) vs. XFS (high-performance)
- Mount Options: Performance tuning (noatime, discard, nobarrier)
- Snapshots: Point-in-time copies (instant restore)
- Backups: Database-specific + filesystem strategies
- Test Restores: Verify backups work (critical!)
1. Storage Architecture: Partitioning & Volume Management
Comparison Table
Feature | Traditional (fdisk/MBR) | LVM (Logical) | ZFS/Btrfs (Modern CoW) |
---|---|---|---|
Flexibility | Fixed partitions | Add disk → grow volume | Add disk → grow pool |
Snapshots | Manual (image) | LVM snapshots (thin) | Native (instant) |
Redundancy | External RAID | External RAID | Built-in RAID (data+metadata) |
Complexity | Low | Medium | High |
Overhead | None | Small | 5-10% |
Use case | Servers (single disk) | Servers (growing storage) | Big data, NAS, databases |
Learning curve | Easy | Medium | Steep |
Recommendation
- Single-disk servers: Traditional partitions (simple)
- Growing storage: LVM (flexible, proven)
- Mission-critical: ZFS (redundancy built-in, data protection)
- High-performance: Btrfs (modern, Linux-native, RAID)
2. Partitioning: LVM (Logical Volume Management)
Why LVM
Problem: Disk full → can’t add more space without downtime Solution: LVM abstracts disk→logical volumes → grow on-the-fly Benefit: Non-disruptive capacity expansion
Real-world scenario - What happens WITHOUT LVM:
Timeline of a traditional partition disaster:
Day 1 (Setup): You provision a server with a 100GB disk, create traditional partitions:
/dev/sda1 → / (50GB)
/dev/sda2 → /var (50GB)
Month 3: Your database in /var/lib/postgresql
grows from 10GB to 45GB. Disk at 90% full.
Month 4 (Crisis): Database hits 50GB. Disk 100% full. Database crashes. Application down. Customers affected.
Your options (all bad):
- Delete data - Risky, might break application
- Add new disk + migrate - Requires downtime:
- Stop database (downtime starts)
- Mount new 100GB disk at
/mnt/new
- Copy 50GB of data:
rsync -av /var /mnt/new
(takes 30+ minutes) - Update
/etc/fstab
to mount new disk at/var
- Reboot
- Pray it works
- Total downtime: 1-2 hours minimum
With LVM - Same scenario:
Month 4 (Crisis): Database hits limit. Disk 90% full.
Your action (5 minutes, zero downtime):
# Add new 100GB disk
sudo pvcreate /dev/sdb
sudo vgextend vg0 /dev/sdb
# Grow /var volume by 50GB (online!)
sudo lvextend -L +50G /dev/vg0/lv_var
sudo resize2fs /dev/vg0/lv_var
# Done - no reboot, no downtime
Total downtime: 0 seconds
This is why every production server should use LVM.
LVM Concepts
Physical Volume (PV): /dev/sda, /dev/sdb (physical disks)
↓
Volume Group (VG): vg0 (pool)
↓
Logical Volume (LV): lv_root, lv_var, lv_db (partitions)
↓
Filesystem: ext4, xfs (mount /data, etc.)
Create LVM Stack
Step 1: Create Physical Volume
# Prepare disk (wipe if needed)
sudo wipefs -a /dev/sdb
sudo parted -s /dev/sdb mklabel gpt
sudo parted -s /dev/sdb mkpart primary 0% 100%
# Create PV
sudo pvcreate /dev/sdb1
sudo pvdisplay # Verify
Step 2: Create Volume Group
# Create VG "vg0" with /dev/sdb1
sudo vgcreate vg0 /dev/sdb1
# Verify
sudo vgdisplay vg0
# Output: Total PE: 2560 (10GB), Allocated PE: 0
Step 3: Create Logical Volumes
# Create 5GB volume for root
sudo lvcreate -L 5G -n lv_root vg0
# Create 4GB volume for database (leave headroom)
sudo lvcreate -L 4G -n lv_db vg0
# Create 1GB for logs
sudo lvcreate -L 1G -n lv_var vg0
# Verify
sudo lvdisplay vg0
Step 4: Create Filesystems
# Format volumes
sudo mkfs.ext4 /dev/vg0/lv_root
sudo mkfs.ext4 /dev/vg0/lv_db
sudo mkfs.ext4 /dev/vg0/lv_var
# Verify
sudo blkid /dev/vg0/*
Step 5: Mount
# Mount permanently (add to /etc/fstab)
sudo mkdir -p /mnt/data /var/lib/db /var/log
echo "/dev/vg0/lv_root / ext4 defaults,nofail 0 1" | sudo tee -a /etc/fstab
echo "/dev/vg0/lv_db /var/lib/db ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
echo "/dev/vg0/lv_var /var/log ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
sudo mount -a
df -h
Grow LVM Volumes (Non-Disruptive)
Add new disk to VG:
# Create PV on /dev/sdc
sudo pvcreate /dev/sdc1
# Add to VG
sudo vgextend vg0 /dev/sdc1
# Verify
sudo vgdisplay vg0 # Total PE should increase
Grow LV (mounted, no downtime):
# Increase lv_db from 4GB to 8GB
sudo lvextend -L +4G /dev/vg0/lv_db
# Resize filesystem (online)
sudo resize2fs /dev/vg0/lv_db
# Verify
df -h /var/lib/db
3. ZFS: Modern, Copy-on-Write
Why ZFS
Strengths:
- Built-in RAID (no external controller needed)
- Data integrity (checksums on all blocks)
- Compression (often 2-3x)
- Snapshots (instant, atomic)
- Pooled storage (grow easily)
Drawbacks:
- High memory usage (min 1GB per TB)
- Licensing (CDDL, not GPL)
- Steep learning curve
- Less portable than LVM + ext4
ZFS Quick Setup
Create pool:
# Single-disk pool
sudo zpool create data /dev/sdb
# RAID-1 (mirrored)
sudo zpool create data mirror /dev/sdb /dev/sdc
# RAID-Z1 (3-way with 1 fault tolerance, like RAID-5)
sudo zpool create data raidz /dev/sdb /dev/sdc /dev/sdd
# Verify
zpool list
zpool status
Create datasets (like LVM volumes):
# Create dataset
zfs create data/db
zfs create data/var_log
# Enable compression
zfs set compression=lz4 data/db
# Set quota (5GB max)
zfs set quota=5G data/db
# Verify
zfs list -o name,used,available,quota
Mount automatically:
# ZFS mounts automatically at /data/db, /data/var_log
df -h | grep data
4. Btrfs: Linux-Native CoW
Why Btrfs
Strengths:
- Linux-native (part of kernel)
- RAID support (1, 10, 5, 6)
- Snapshots (instant, writable)
- Compression
- Growing pool
Drawbacks:
- Still considered “experimental” (but production-ready)
- RAID-5/6 has known data loss issues (use RAID-1 or RAID-10)
- Less mature than ZFS
Btrfs Quick Setup
Create filesystem with RAID-1:
# RAID-1 across 2 disks
sudo mkfs.btrfs -m raid1 -d raid1 /dev/sdb /dev/sdc
# Mount
sudo mkdir /mnt/btrfs
sudo mount /dev/sdb /mnt/btrfs
df -h /mnt/btrfs
Create subvolumes:
# Like LVM volumes
sudo btrfs subvolume create /mnt/btrfs/db
sudo btrfs subvolume create /mnt/btrfs/var_log
# Mount subvolumes
echo "/dev/sdb /mnt/btrfs/db btrfs subvol=db,defaults 0 0" | sudo tee -a /etc/fstab
echo "/dev/sdb /mnt/btrfs/var_log btrfs subvol=var_log,defaults 0 0" | sudo tee -a /etc/fstab
sudo mount -a
5. Filesystem Selection: ext4 vs. XFS
Comparison
Feature | ext4 | XFS |
---|---|---|
Maturity | Stable (since 2008) | Mature (since 1994) |
Max filesize | 16TB | 9EB (exabytes!) |
Journaling | Yes | Delayed allocation |
Speed | Good | Excellent (high-I/O) |
Repair | fsck (slow) | xfs_repair (faster) |
Use case | General purpose | Databases, big files |
Learning curve | Easy | Easy |
Recommendation
ext4: Default for most servers (safe, proven)
XFS: High-I/O workloads (databases, data processing, video)
Create Filesystems
ext4:
sudo mkfs.ext4 -F /dev/vg0/lv_db
XFS:
sudo mkfs.xfs -f /dev/vg0/lv_db
6. Mount Options: Performance & Reliability
Safe Production Defaults
ext4:
/dev/vg0/lv_root / ext4 defaults,noatime,nodiratime,errors=remount-ro 0 1
/dev/vg0/lv_db /var/lib/db ext4 defaults,noatime,nodiratime,nofail 0 2
XFS:
/dev/vg0/lv_root / xfs defaults,noatime,nodiratime 0 1
/dev/vg0/lv_db /var/lib/db xfs defaults,noatime,nodiratime 0 2
Mount Options Explained
Why mount options matter: A single option can improve performance by 20-30% or prevent boot failures. These are not optional tweaks—they’re production best practices.
Option | Effect | Use Case | Performance Impact |
---|---|---|---|
defaults | rw, suid, dev, exec, auto, nouser, async | Base | Baseline |
noatime | Don’t update file access time (big I/O savings) | All servers | +20-30% read performance |
nodiratime | Don’t update dir access time | All servers | +5-10% directory ops |
nofail | Don’t fail boot if disk missing (NAS, multi-disk) | External/optional disks | Prevents boot hang |
discard | TRIM for SSD (async discard, safer) | SSDs only | Maintains SSD speed |
nobarrier | Skip disk barrier (risky, faster, needs UPS) | Databases with UPS | +15-25% write speed (risky!) |
errors=remount-ro | Remount read-only on error (data safety) | Data disks | Prevents corruption |
relatime | Update atime only if > ctime/mtime (compromise) | When atime needed | Better than default atime |
Real-world impact examples:
1. noatime
- The 20% performance boost you’re missing:
What it does: By default, Linux updates file access time on every read. This means reading a file triggers a write operation (updating metadata).
Without noatime:
# Reading a log file = 1 read + 1 write (access time update)
cat /var/log/app.log # Disk: read file + write metadata
With noatime:
# Reading a log file = 1 read only
cat /var/log/app.log # Disk: read file (no metadata write)
Measured impact:
- Web server with 1000 req/sec: 20% reduction in disk I/O
- Database server: 15-25% faster SELECT queries
- Log aggregation: 30% faster file reads
When NOT to use: Email servers (need atime for mailbox cleanup), backup tools that use atime
2. nobarrier
- Fast but dangerous:
What it does: Skips filesystem barriers (flushes to disk). Normally, Linux forces critical writes to disk before continuing. nobarrier
trusts your disk’s write cache.
Risk: If power loss occurs during write, filesystem can corrupt.
When it’s safe:
- Server has UPS (uninterruptible power supply)
- Battery-backed RAID controller
- Cloud VMs with persistent disks (AWS EBS, GCP Persistent Disk)
Performance gain: 15-25% faster writes
Disaster scenario without UPS: Power outage → PostgreSQL WAL corruption → database won’t start → restore from backup (hours of downtime)
3. nofail
- Prevents boot disasters:
Scenario: You mount a network share (NFS, CIFS) in /etc/fstab
. Network is down at boot.
Without nofail:
Boot process → mounts /mnt/network → network unreachable → boot HANGS
Server stuck at "Mounting /mnt/network..." forever
Emergency maintenance required
With nofail:
Boot process → tries /mnt/network → fails gracefully → continues boot
Server boots successfully, you mount network share manually later
Always use nofail
for:
- Network mounts (NFS, CIFS)
- External USB drives
- Optional data disks
- Any mount that might not be available at boot
SSD-Specific Mount Options
Warning: discard
can cause latency spikes. Use fstrim
instead.
# Option A: Async TRIM (safer, default on modern kernels)
/dev/nvme0n1p1 / ext4 defaults,noatime,discard 0 1
# Option B: Periodic TRIM (recommended)
/dev/nvme0n1p1 / ext4 defaults,noatime 0 1
# Then add weekly TRIM job:
# sudo fstrim -v /
Verify TRIM:
sudo fstrim -v /
# Output: /: 50 GiB (53687091200 bytes) trimmed on /dev/sda1
7. Snapshots: Point-in-Time Recovery
What snapshots do: Create instant point-in-time copies of your filesystem. Think of it like “save game” before a risky operation—if something goes wrong, you restore the snapshot in seconds.
Critical use cases:
1. Before risky changes (the “undo” button):
10:00 AM - Snapshot database before major migration
10:05 AM - Run migration script
10:15 AM - Migration fails, data corrupted
10:16 AM - Restore snapshot (30 seconds)
10:17 AM - Back to 10:00 AM state, no data loss
2. Consistent backups (no application downtime):
Problem: Can't backup live database (data inconsistent)
Traditional: Stop database → backup (30 min downtime) → start
With snapshots: Snapshot (instant) → backup snapshot → delete snapshot
Downtime: 0 seconds
3. Ransomware/accidental deletion recovery:
2:00 PM - Automated hourly snapshot created
2:30 PM - Ransomware encrypts /var/lib/db
2:35 PM - Restore 2:00 PM snapshot
Data loss: 30 minutes (acceptable for most workloads)
4. Testing in production (clone environment):
Create snapshot → mount snapshot as /mnt/test → run tests → delete
Production unaffected, testing on real data
When to use each snapshot type:
Scenario | LVM | Btrfs | ZFS | Why |
---|---|---|---|---|
Quick pre-upgrade backup | âś“ | âś“ | âś“ | All work well |
Hourly snapshots (many) | âś— | âś“ | âś“ | LVM snapshots have overhead |
Incremental remote backups | âś— | â–ł | âś“ | ZFS send/recv is best |
Writable snapshots (testing) | âś— | âś“ | âś“ | Btrfs/ZFS support writable |
Existing ext4/XFS setup | âś“ | âś— | âś— | LVM works with any filesystem |
Database servers (PostgreSQL/MySQL) | âś“ | âś“ | âś“ | All provide crash-consistent copies |
LVM Snapshots (Thin)
Create snapshot (2GB size):
# Snapshot of /dev/vg0/lv_db for backup
sudo lvcreate -L 2G -s -n lv_db_snap /dev/vg0/lv_db
# Verify
sudo lvdisplay vg0
Mount & backup:
# Mount snapshot (read-only)
sudo mkdir /mnt/snap_db
sudo mount -o ro /dev/vg0/lv_db_snap /mnt/snap_db
# Backup (no lock needed!)
sudo tar -czf /backups/db.tar.gz -C /mnt/snap_db .
# Unmount & remove
sudo umount /mnt/snap_db
sudo lvremove -f /dev/vg0/lv_db_snap
Pitfall: If snapshot fills up, LV becomes read-only. Size appropriately (at least 20% of LV).
Btrfs Snapshots (Instant)
Create snapshot:
# Snapshot of subvolume /mnt/btrfs/db
sudo btrfs subvolume snapshot /mnt/btrfs/db /mnt/btrfs/db_snap
# Verify
sudo btrfs subvolume list /mnt/btrfs
Backup & delete:
# Backup
sudo tar -czf /backups/db.tar.gz -C /mnt/btrfs db_snap
# Delete
sudo btrfs subvolume delete /mnt/btrfs/db_snap
Advantage: Instant (CoW), no extra space needed until modified.
ZFS Snapshots (Atomic)
Create snapshot:
# Snapshot named "backup-2025-10-16"
sudo zfs snapshot data/db@backup-2025-10-16
# List
sudo zfs list -t snapshot
Backup (incremental):
# Send to file
sudo zfs send data/db@backup-2025-10-16 | gzip > /backups/db.zfs.gz
# Restore
sudo zfs recv -F data/db_restored < /backups/db.zfs.gz
Advantage: Atomic (consistent), incremental send/recv efficient.
8. Backups: Database-Specific + Filesystem
Database Backups (Examples)
PostgreSQL (pg_dump):
# Full backup
sudo -u postgres pg_dump mydb | gzip > /backups/mydb-$(date +%Y%m%d).sql.gz
# Point-in-time recovery (with WAL archiving)
# See PostgreSQL docs for setup
MySQL (mysqldump):
# Full backup (all databases)
sudo mysqldump -u root -p'password' --all-databases | gzip > /backups/all-$(date +%Y%m%d).sql.gz
Neo4j (neo4j-admin dump):
# Dump database (offline)
sudo neo4j-admin dump --database=neo4j --to=/backups/neo4j-$(date +%Y%m%d).dump
MongoDB (mongodump):
# Dump all databases
sudo mongodump --out /backups/mongo-$(date +%Y%m%d)
Filesystem Backups
rsync (incremental):
# Backup /var/lib/db to /backups
sudo rsync -av --delete /var/lib/db/ /backups/db/
# Or to remote server
sudo rsync -av --delete /var/lib/db/ user@backup-server:/backups/db/
tar (full + incremental):
# Full backup (day 1)
sudo tar -czf /backups/db-full-$(date +%Y%m%d).tar.gz /var/lib/db
# Incremental (daily)
sudo tar -czf /backups/db-incr-$(date +%Y%m%d).tar.gz -g /backups/db.snar /var/lib/db
With snapshot (safe, no locks):
# Create snapshot
sudo lvcreate -L 2G -s -n db_backup /dev/vg0/lv_db
# Mount & backup (application doesn't know)
sudo mount -o ro /dev/vg0/db_backup /mnt/backup
sudo tar -czf /backups/db-$(date +%Y%m%d).tar.gz -C /mnt/backup .
# Cleanup
sudo umount /mnt/backup
sudo lvremove -f /dev/vg0/db_backup
9. Test Restores: Critical (But Often Skipped)
Why Test?
Reality: Backups that haven’t been tested usually fail when needed.
Horror stories from production (all real):
1. The GitLab.com disaster (2017):
- Incident: Database replication lag → engineer deletes wrong database directory → 300GB of production data gone
- Backup status: 5 different backup methods configured
- Restore attempt: ALL 5 backup methods failed:
- LVM snapshots: disabled accidentally weeks earlier
- Regular backups: corrupted, wouldn’t restore
- Disk snapshots: accidentally disabled
- Backup to S3: sync had been failing for months (nobody checked)
- PostgreSQL WAL-E: restore failed (configuration error)
- Data loss: 6 hours of customer data (issues, merge requests, comments)
- Root cause: Backups never tested
- Lesson: “We verified backups exist” ≠“We verified backups restore”
2. The MySQL backup that never was:
- Company: E-commerce startup (name withheld)
- Setup: Nightly
mysqldump
running for 2 years, cronjob shows success - Disaster: Database server dies, hardware failure
- Restore attempt: Backup file is 0 bytes
- Root cause: Mysqldump password changed 2 years ago, cron job failing silently (output redirected to
/dev/null
) - Result: Company went out of business (no data = no customer orders)
3. The backup on the same disk:
- Setup: PostgreSQL database at
/var/lib/postgresql
, backup script saves to/backups
(same disk) - Disaster: Disk failure (physical damage)
- Result: Both production DB and all backups lost
- Lesson: Backups on the same disk = not a backup
4. The untested compression:
- Setup: Tar backups with compression:
tar -czf backup.tar.gz /data
- Backup size: 500GB compressed to 50GB (90% compression - suspicious)
- Disaster: Need to restore after ransomware attack
- Restore attempt:
tar -xzf backup.tar.gz
→ “gzip: invalid compressed data–format violated” - Root cause: Disk was 95% full during backup, tar silently truncated the archive
- Result: Backup corrupted, unrecoverable
5. The permissions disaster:
- Setup: Backups created as root, restored as regular user
- Disaster: Restore successful, but all files owned by wrong user
- Result: Application can’t read config files, database won’t start (wrong permissions on data directory)
- Time to fix: 4 hours of manually fixing permissions on 2 million files
Key lesson: The only backup that matters is the one you’ve successfully restored.
How often should you test?
Data criticality | Test frequency | Why |
---|---|---|
Critical (customer data, financial) | Weekly | Data loss = business loss |
Important (internal tools, logs) | Monthly | Downtime acceptable, but painful |
Nice-to-have (dev environments) | Quarterly | Can rebuild if needed |
Minimum test: Extract backup, verify basic structure (file count, database schema) Better test: Full restore to staging environment, run smoke tests Best test: Full disaster recovery drill (restore to new server, start application, verify functionality)
Restore Checklist
Monthly test:
#!/bin/bash
echo "=== Backup Restore Test ==="
# Step 1: Verify backup exists & is accessible
BACKUP=/backups/db-$(date -d '7 days ago' +%Y%m%d).tar.gz
if [ ! -f "$BACKUP" ]; then
echo "ERROR: Backup not found: $BACKUP"
exit 1
fi
echo "âś“ Backup found: $BACKUP ($(du -h $BACKUP | cut -f1))"
# Step 2: Extract to staging
STAGING=/tmp/restore_test
mkdir -p $STAGING
tar -xzf $BACKUP -C $STAGING
echo "âś“ Backup extracted to $STAGING"
# Step 3: Verify database (PostgreSQL example)
echo "Verifying database..."
if [ -f "$STAGING/PG_VERSION" ]; then
echo "âś“ PostgreSQL database structure valid"
else
echo "ERROR: Database structure missing"
rm -rf $STAGING
exit 1
fi
# Step 4: Check file count (sanity check)
FILE_COUNT=$(find $STAGING -type f | wc -l)
echo "âś“ Database contains $FILE_COUNT files"
# Step 5: Cleanup
rm -rf $STAGING
echo "âś“ Test restore completed successfully"
# Step 6: Report
echo ""
echo "Restore test: PASSED âś“"
echo "Backup: $BACKUP"
echo "Date: $(date)"
Run monthly:
sudo /usr/local/bin/backup-restore-test.sh
# Log results
sudo /usr/local/bin/backup-restore-test.sh >> /var/log/backup-restore-test.log
RPO / RTO Definition
RPO (Recovery Point Objective): How much data can we afford to lose?
- Example: “Daily backups” = RPO 24 hours
RTO (Recovery Time Objective): How long to recover?
- Example: “Restore from backup + start services” = RTO 2 hours
Real-world RPO/RTO requirements by business type:
1. E-commerce site (high revenue per minute):
Business impact: $10,000/minute revenue loss during downtime
Customer tolerance: Very low (customers switch to competitors)
RPO: 5 minutes (continuous replication + point-in-time recovery)
RTO: 15 minutes (hot standby database ready to failover)
Backup strategy:
- Continuous: Database streaming replication (PostgreSQL WAL, MySQL binlog)
- Hourly: Snapshots (ZFS/Btrfs) for rollback
- Daily: Full backup to S3 (disaster recovery)
Cost: High (redundant infrastructure, hot standby)
Justification: 15-minute outage = $150k revenue loss
2. Internal analytics platform (data warehouse):
Business impact: Analytics reports delayed, no revenue impact
Customer tolerance: High (internal users can wait)
RPO: 24 hours (daily batch jobs, losing 1 day acceptable)
RTO: 4 hours (restore during business hours)
Backup strategy:
- Daily: Full database dump at 2 AM (low usage)
- Weekly: Filesystem backup to NAS
- Monthly: Offsite backup to cold storage
Cost: Low (single server, standard backups)
Justification: 4-hour outage = minor inconvenience
3. SaaS application (small startup):
Business impact: Customer churn risk, support tickets
Customer tolerance: Medium (understanding during beta, critical in production)
RPO: 1 hour (hourly snapshots)
RTO: 2 hours (restore from snapshot + restart)
Backup strategy:
- Hourly: LVM snapshots (quick rollback)
- Daily: Database dump to S3
- Weekly: Full system backup (disaster recovery)
Cost: Medium (snapshots cheap, S3 storage minimal)
Justification: Balance between cost and data safety
4. Financial services (regulatory requirements):
Business impact: Regulatory fines, audit failures, legal liability
Customer tolerance: Zero (trust is everything)
RPO: 0 seconds (synchronous replication, no data loss allowed)
RTO: 30 seconds (automatic failover)
Backup strategy:
- Real-time: Synchronous replication to 3+ servers (quorum)
- Hourly: Point-in-time snapshots (ZFS) for rollback
- Daily: Encrypted backup to geographically separate datacenter
- Weekly: Tape backup to offline vault (compliance)
Cost: Very high (multi-datacenter, compliance overhead)
Justification: Regulatory requirement (no choice)
5. Personal blog / portfolio site:
Business impact: Mild embarrassment, lost article drafts
Customer tolerance: Infinite (it's free)
RPO: 1 week (whenever you remember to backup)
RTO: "Whenever I get around to it"
Backup strategy:
- Weekly: `tar -czf backup.tar.gz /var/www` to USB drive
- Monthly: Copy to cloud storage (optional)
Cost: Near zero
Justification: It's a hobby project
How to calculate your RPO/RTO:
Step 1: Calculate hourly revenue/business impact
E-commerce: $600k/hour revenue
→ 1 hour downtime = $600k loss
→ RTO must be < 15 minutes (minimize loss)
→ RPO must be < 5 minutes (minimize lost orders)
Blog: $0/hour revenue
→ 24 hour downtime = mild inconvenience
→ RTO can be 24-48 hours (restore when convenient)
→ RPO can be 7 days (weekly backups fine)
Step 2: Calculate cost of backup infrastructure
Hot standby (RTO 1 min): $5000/month infrastructure
Daily backups (RTO 4 hours): $50/month S3 storage
Decision: If downtime costs > $5000/hour → hot standby makes sense
If downtime costs < $100/hour → daily backups sufficient
Step 3: Document and test
# Your RPO/RTO should be written down and tested quarterly
documented_rpo: 1 hour
actual_tested_rpo: 2 hours (test restore was from 2-hour-old backup)
→ Fix: Increase backup frequency to 30 minutes
documented_rto: 2 hours
actual_tested_rto: 4 hours (restore took longer than expected)
→ Fix: Practice restore procedure, optimize steps, update documentation
Document in runbook:
# Database Recovery Runbook
backup_location: /backups/postgres/
backup_frequency: daily (02:00 UTC)
retention: 30 days
RPO: 24 hours (acceptable data loss)
RTO: 2 hours (acceptable downtime)
restore_steps:
1. Stop application
2. Restore backup: pg_restore --clean --create backup.dump
3. Verify data integrity
4. Start application
5. Test connectivity
contacts:
- DBA: [email protected]
- on-call: #database-incidents (Slack)
Storage Checklist
Pre-Deployment
- Storage architecture chosen (traditional/LVM/ZFS/Btrfs)
- Partition layout planned (/, /var, /var/log, /data separate)
- Filesystem chosen (ext4 or XFS)
- Mount options tuned (noatime, discard for SSD)
- LVM/ZFS pools created (if using)
- Filesystem formatted & mounted
- /etc/fstab updated with persistent mounts
- Backup strategy defined (database + filesystem)
- Backup script tested
- Restore procedure documented
Post-Deployment
- Filesystems mounted correctly (df -h)
- Disk usage reasonable (no >80% full)
- Mount options applied (cat /proc/mounts)
- Backup running daily (cron job verified)
- Snapshots working (if LVM/Btrfs/ZFS)
- Test restore completed successfully
- RPO/RTO documented
- On-call team trained on restore procedure
Ongoing Monitoring
- Weekly: Check disk usage (df -h)
- Weekly: Verify backup completed (ls -lt /backups)
- Monthly: Test restore from backup
- Quarterly: Full disaster recovery drill
- Quarterly: Review & adjust retention policy
Quick Reference Commands
# ===== PARTITION / DISK =====
lsblk # Show block devices
parted -l # Show partitions
sudo fdisk -l # Show MBR partitions
# ===== LVM =====
sudo pvcreate /dev/sdb1 # Create PV
sudo vgcreate vg0 /dev/sdb1 # Create VG
sudo lvcreate -L 5G -n lv_db vg0 # Create LV
sudo lvextend -L +5G /dev/vg0/lv_db # Grow LV
sudo resize2fs /dev/vg0/lv_db # Resize ext4
sudo xfs_growfs /var/lib/db # Resize XFS
# ===== FILESYSTEM =====
sudo mkfs.ext4 /dev/vg0/lv_db # Create ext4
sudo mkfs.xfs /dev/vg0/lv_db # Create XFS
sudo fsck -n /dev/vg0/lv_db # Check (read-only)
df -h # Disk usage
du -sh /var/lib/db # Directory size
# ===== MOUNT =====
sudo mount /dev/vg0/lv_db /mnt/db # Mount
sudo umount /mnt/db # Unmount
sudo fstrim -v / # TRIM SSD
# ===== SNAPSHOTS =====
sudo lvcreate -L 2G -s -n snap /dev/vg0/lv_db # LVM snapshot
sudo btrfs subvolume snapshot /mnt/data snap # Btrfs snapshot
sudo zfs snapshot data/db@snap # ZFS snapshot
# ===== BACKUP & RESTORE =====
sudo tar -czf /backups/db.tar.gz /var/lib/db # Full backup
sudo rsync -av /var/lib/db /backups/ # Incremental
sudo tar -xzf /backups/db.tar.gz -C /mnt/ # Restore