Linux Storage: Partitions, LVM, ZFS/Btrfs, Filesystems, Snapshots, and Backups

Executive Summary

Storage strategy = reliable data access with recovery guarantees. Choose based on workload (traditional vs. modern, single vs. multi-server).

Why storage matters: Most production disasters involve storage—disk fills up and crashes your application, a database corruption loses customer data, or backups fail during a restore attempt. Proper storage management prevents these scenarios.

Real-world disasters prevented by good storage:

Disk full at 3 AM: Application logs fill /var → server crashes → customers can’t access site. Solution: Separate /var/log partition with monitoring.
Failed database restore: Backup runs nightly for 2 years, never tested. Server dies, restore fails (corrupt backup). Solution: Monthly test restores.
Database grows unexpectedly: 50GB database grows to 500GB in 3 months. Traditional partitions = downtime to resize. Solution: LVM allows online growth.
Ransomware encrypts production data: No snapshots available, last backup is 24 hours old. Solution: ZFS/Btrfs snapshots provide instant point-in-time recovery.

This guide covers:

Partitioning: LVM (flexible, standard) vs. ZFS/Btrfs (modern, copy-on-write)
Filesystems: ext4 (safe default) vs. XFS (high-performance)
Mount Options: Performance tuning (noatime, discard, nobarrier)
Snapshots: Point-in-time copies (instant restore)
Backups: Database-specific + filesystem strategies
Test Restores: Verify backups work (critical!)

1. Storage Architecture: Partitioning & Volume Management

Comparison Table

Feature	Traditional (fdisk/MBR)	LVM (Logical)	ZFS/Btrfs (Modern CoW)
Flexibility	Fixed partitions	Add disk → grow volume	Add disk → grow pool
Snapshots	Manual (image)	LVM snapshots (thin)	Native (instant)
Redundancy	External RAID	External RAID	Built-in RAID (data+metadata)
Complexity	Low	Medium	High
Overhead	None	Small	5-10%
Use case	Servers (single disk)	Servers (growing storage)	Big data, NAS, databases
Learning curve	Easy	Medium	Steep

Recommendation

Single-disk servers: Traditional partitions (simple)
Growing storage: LVM (flexible, proven)
Mission-critical: ZFS (redundancy built-in, data protection)
High-performance: Btrfs (modern, Linux-native, RAID)

2. Partitioning: LVM (Logical Volume Management)

Why LVM

Problem: Disk full → can’t add more space without downtime Solution: LVM abstracts disk→logical volumes → grow on-the-fly Benefit: Non-disruptive capacity expansion

Real-world scenario - What happens WITHOUT LVM:

Timeline of a traditional partition disaster:

Day 1 (Setup): You provision a server with a 100GB disk, create traditional partitions:

/dev/sda1  →  /      (50GB)
/dev/sda2  →  /var   (50GB)

Month 3: Your database in /var/lib/postgresql grows from 10GB to 45GB. Disk at 90% full.

Month 4 (Crisis): Database hits 50GB. Disk 100% full. Database crashes. Application down. Customers affected.

Your options (all bad):

Delete data - Risky, might break application
Add new disk + migrate - Requires downtime:
- Stop database (downtime starts)
- Mount new 100GB disk at /mnt/new
- Copy 50GB of data: rsync -av /var /mnt/new (takes 30+ minutes)
- Update /etc/fstab to mount new disk at /var
- Reboot
- Pray it works
- Total downtime: 1-2 hours minimum

With LVM - Same scenario:

Month 4 (Crisis): Database hits limit. Disk 90% full.

Your action (5 minutes, zero downtime):

# Add new 100GB disk
sudo pvcreate /dev/sdb
sudo vgextend vg0 /dev/sdb

# Grow /var volume by 50GB (online!)
sudo lvextend -L +50G /dev/vg0/lv_var
sudo resize2fs /dev/vg0/lv_var

# Done - no reboot, no downtime

Total downtime: 0 seconds

This is why every production server should use LVM.

LVM Concepts

Physical Volume (PV):  /dev/sda, /dev/sdb       (physical disks)
    ↓
Volume Group (VG):     vg0                       (pool)
    ↓
Logical Volume (LV):   lv_root, lv_var, lv_db  (partitions)
    ↓
Filesystem:            ext4, xfs                 (mount /data, etc.)

Create LVM Stack

Step 1: Create Physical Volume

# Prepare disk (wipe if needed)
sudo wipefs -a /dev/sdb
sudo parted -s /dev/sdb mklabel gpt
sudo parted -s /dev/sdb mkpart primary 0% 100%

# Create PV
sudo pvcreate /dev/sdb1
sudo pvdisplay              # Verify

Step 2: Create Volume Group

# Create VG "vg0" with /dev/sdb1
sudo vgcreate vg0 /dev/sdb1

# Verify
sudo vgdisplay vg0
# Output: Total PE: 2560 (10GB), Allocated PE: 0

Step 3: Create Logical Volumes

# Create 5GB volume for root
sudo lvcreate -L 5G -n lv_root vg0

# Create 4GB volume for database (leave headroom)
sudo lvcreate -L 4G -n lv_db vg0

# Create 1GB for logs
sudo lvcreate -L 1G -n lv_var vg0

# Verify
sudo lvdisplay vg0

Step 4: Create Filesystems

# Format volumes
sudo mkfs.ext4 /dev/vg0/lv_root
sudo mkfs.ext4 /dev/vg0/lv_db
sudo mkfs.ext4 /dev/vg0/lv_var

# Verify
sudo blkid /dev/vg0/*

Step 5: Mount

# Mount permanently (add to /etc/fstab)
sudo mkdir -p /mnt/data /var/lib/db /var/log

echo "/dev/vg0/lv_root / ext4 defaults,nofail 0 1" | sudo tee -a /etc/fstab
echo "/dev/vg0/lv_db /var/lib/db ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
echo "/dev/vg0/lv_var /var/log ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab

sudo mount -a
df -h

Grow LVM Volumes (Non-Disruptive)

Add new disk to VG:

# Create PV on /dev/sdc
sudo pvcreate /dev/sdc1

# Add to VG
sudo vgextend vg0 /dev/sdc1

# Verify
sudo vgdisplay vg0  # Total PE should increase

Grow LV (mounted, no downtime):

# Increase lv_db from 4GB to 8GB
sudo lvextend -L +4G /dev/vg0/lv_db

# Resize filesystem (online)
sudo resize2fs /dev/vg0/lv_db

# Verify
df -h /var/lib/db

3. ZFS: Modern, Copy-on-Write

Why ZFS

Strengths:

Built-in RAID (no external controller needed)
Data integrity (checksums on all blocks)
Compression (often 2-3x)
Snapshots (instant, atomic)
Pooled storage (grow easily)

Drawbacks:

High memory usage (min 1GB per TB)
Licensing (CDDL, not GPL)
Steep learning curve
Less portable than LVM + ext4

ZFS Quick Setup

Create pool:

# Single-disk pool
sudo zpool create data /dev/sdb

# RAID-1 (mirrored)
sudo zpool create data mirror /dev/sdb /dev/sdc

# RAID-Z1 (3-way with 1 fault tolerance, like RAID-5)
sudo zpool create data raidz /dev/sdb /dev/sdc /dev/sdd

# Verify
zpool list
zpool status

Create datasets (like LVM volumes):

# Create dataset
zfs create data/db
zfs create data/var_log

# Enable compression
zfs set compression=lz4 data/db

# Set quota (5GB max)
zfs set quota=5G data/db

# Verify
zfs list -o name,used,available,quota

Mount automatically:

# ZFS mounts automatically at /data/db, /data/var_log
df -h | grep data

4. Btrfs: Linux-Native CoW

Why Btrfs

Strengths:

Linux-native (part of kernel)
RAID support (1, 10, 5, 6)
Snapshots (instant, writable)
Compression
Growing pool

Drawbacks:

Still considered “experimental” (but production-ready)
RAID-5/6 has known data loss issues (use RAID-1 or RAID-10)
Less mature than ZFS

Btrfs Quick Setup

Create filesystem with RAID-1:

# RAID-1 across 2 disks
sudo mkfs.btrfs -m raid1 -d raid1 /dev/sdb /dev/sdc

# Mount
sudo mkdir /mnt/btrfs
sudo mount /dev/sdb /mnt/btrfs

df -h /mnt/btrfs

Create subvolumes:

# Like LVM volumes
sudo btrfs subvolume create /mnt/btrfs/db
sudo btrfs subvolume create /mnt/btrfs/var_log

# Mount subvolumes
echo "/dev/sdb /mnt/btrfs/db btrfs subvol=db,defaults 0 0" | sudo tee -a /etc/fstab
echo "/dev/sdb /mnt/btrfs/var_log btrfs subvol=var_log,defaults 0 0" | sudo tee -a /etc/fstab

sudo mount -a

5. Filesystem Selection: ext4 vs. XFS

Comparison

Feature	ext4	XFS
Maturity	Stable (since 2008)	Mature (since 1994)
Max filesize	16TB	9EB (exabytes!)
Journaling	Yes	Delayed allocation
Speed	Good	Excellent (high-I/O)
Repair	fsck (slow)	xfs_repair (faster)
Use case	General purpose	Databases, big files
Learning curve	Easy	Easy

Recommendation

ext4: Default for most servers (safe, proven)
XFS: High-I/O workloads (databases, data processing, video)

Create Filesystems

ext4:

sudo mkfs.ext4 -F /dev/vg0/lv_db

XFS:

sudo mkfs.xfs -f /dev/vg0/lv_db

6. Mount Options: Performance & Reliability

Safe Production Defaults

ext4:

/dev/vg0/lv_root / ext4 defaults,noatime,nodiratime,errors=remount-ro 0 1
/dev/vg0/lv_db /var/lib/db ext4 defaults,noatime,nodiratime,nofail 0 2

XFS:

/dev/vg0/lv_root / xfs defaults,noatime,nodiratime 0 1
/dev/vg0/lv_db /var/lib/db xfs defaults,noatime,nodiratime 0 2

Mount Options Explained

Why mount options matter: A single option can improve performance by 20-30% or prevent boot failures. These are not optional tweaks—they’re production best practices.

Option	Effect	Use Case	Performance Impact
`defaults`	rw, suid, dev, exec, auto, nouser, async	Base	Baseline
`noatime`	Don’t update file access time (big I/O savings)	All servers	+20-30% read performance
`nodiratime`	Don’t update dir access time	All servers	+5-10% directory ops
`nofail`	Don’t fail boot if disk missing (NAS, multi-disk)	External/optional disks	Prevents boot hang
`discard`	TRIM for SSD (async discard, safer)	SSDs only	Maintains SSD speed
`nobarrier`	Skip disk barrier (risky, faster, needs UPS)	Databases with UPS	+15-25% write speed (risky!)
`errors=remount-ro`	Remount read-only on error (data safety)	Data disks	Prevents corruption
`relatime`	Update atime only if > ctime/mtime (compromise)	When atime needed	Better than default atime

Real-world impact examples:

1. noatime - The 20% performance boost you’re missing:

What it does: By default, Linux updates file access time on every read. This means reading a file triggers a write operation (updating metadata).

Without noatime:

# Reading a log file = 1 read + 1 write (access time update)
cat /var/log/app.log  # Disk: read file + write metadata

With noatime:

# Reading a log file = 1 read only
cat /var/log/app.log  # Disk: read file (no metadata write)

Measured impact:

Web server with 1000 req/sec: 20% reduction in disk I/O
Database server: 15-25% faster SELECT queries
Log aggregation: 30% faster file reads

When NOT to use: Email servers (need atime for mailbox cleanup), backup tools that use atime

2. nobarrier - Fast but dangerous:

What it does: Skips filesystem barriers (flushes to disk). Normally, Linux forces critical writes to disk before continuing. nobarrier trusts your disk’s write cache.

Risk: If power loss occurs during write, filesystem can corrupt.

When it’s safe:

Server has UPS (uninterruptible power supply)
Battery-backed RAID controller
Cloud VMs with persistent disks (AWS EBS, GCP Persistent Disk)

Performance gain: 15-25% faster writes

Disaster scenario without UPS: Power outage → PostgreSQL WAL corruption → database won’t start → restore from backup (hours of downtime)

3. nofail - Prevents boot disasters:

Scenario: You mount a network share (NFS, CIFS) in /etc/fstab. Network is down at boot.

Without nofail:

Boot process → mounts /mnt/network → network unreachable → boot HANGS
Server stuck at "Mounting /mnt/network..." forever
Emergency maintenance required

With nofail:

Boot process → tries /mnt/network → fails gracefully → continues boot
Server boots successfully, you mount network share manually later

Always use nofail for:

Network mounts (NFS, CIFS)
External USB drives
Optional data disks
Any mount that might not be available at boot

SSD-Specific Mount Options

Warning: discard can cause latency spikes. Use fstrim instead.

# Option A: Async TRIM (safer, default on modern kernels)
/dev/nvme0n1p1 / ext4 defaults,noatime,discard 0 1

# Option B: Periodic TRIM (recommended)
/dev/nvme0n1p1 / ext4 defaults,noatime 0 1

# Then add weekly TRIM job:
# sudo fstrim -v /

Verify TRIM:

sudo fstrim -v /
# Output: /: 50 GiB (53687091200 bytes) trimmed on /dev/sda1

7. Snapshots: Point-in-Time Recovery

What snapshots do: Create instant point-in-time copies of your filesystem. Think of it like “save game” before a risky operation—if something goes wrong, you restore the snapshot in seconds.

Critical use cases:

1. Before risky changes (the “undo” button):

10:00 AM - Snapshot database before major migration
10:05 AM - Run migration script
10:15 AM - Migration fails, data corrupted
10:16 AM - Restore snapshot (30 seconds)
10:17 AM - Back to 10:00 AM state, no data loss

2. Consistent backups (no application downtime):

Problem: Can't backup live database (data inconsistent)
Traditional: Stop database → backup (30 min downtime) → start
With snapshots: Snapshot (instant) → backup snapshot → delete snapshot
Downtime: 0 seconds

3. Ransomware/accidental deletion recovery:

2:00 PM - Automated hourly snapshot created
2:30 PM - Ransomware encrypts /var/lib/db
2:35 PM - Restore 2:00 PM snapshot
Data loss: 30 minutes (acceptable for most workloads)

4. Testing in production (clone environment):

Create snapshot → mount snapshot as /mnt/test → run tests → delete
Production unaffected, testing on real data

When to use each snapshot type:

Scenario	LVM	Btrfs	ZFS	Why
Quick pre-upgrade backup	✓	✓	✓	All work well
Hourly snapshots (many)	✗	✓	✓	LVM snapshots have overhead
Incremental remote backups	✗	△	✓	ZFS `send/recv` is best
Writable snapshots (testing)	✗	✓	✓	Btrfs/ZFS support writable
Existing ext4/XFS setup	✓	✗	✗	LVM works with any filesystem
Database servers (PostgreSQL/MySQL)	✓	✓	✓	All provide crash-consistent copies

LVM Snapshots (Thin)

Create snapshot (2GB size):

# Snapshot of /dev/vg0/lv_db for backup
sudo lvcreate -L 2G -s -n lv_db_snap /dev/vg0/lv_db

# Verify
sudo lvdisplay vg0

Mount & backup:

# Mount snapshot (read-only)
sudo mkdir /mnt/snap_db
sudo mount -o ro /dev/vg0/lv_db_snap /mnt/snap_db

# Backup (no lock needed!)
sudo tar -czf /backups/db.tar.gz -C /mnt/snap_db .

# Unmount & remove
sudo umount /mnt/snap_db
sudo lvremove -f /dev/vg0/lv_db_snap

Pitfall: If snapshot fills up, LV becomes read-only. Size appropriately (at least 20% of LV).

Btrfs Snapshots (Instant)

Create snapshot:

# Snapshot of subvolume /mnt/btrfs/db
sudo btrfs subvolume snapshot /mnt/btrfs/db /mnt/btrfs/db_snap

# Verify
sudo btrfs subvolume list /mnt/btrfs

Backup & delete:

# Backup
sudo tar -czf /backups/db.tar.gz -C /mnt/btrfs db_snap

# Delete
sudo btrfs subvolume delete /mnt/btrfs/db_snap

Advantage: Instant (CoW), no extra space needed until modified.

ZFS Snapshots (Atomic)

Create snapshot:

# Snapshot named "backup-2025-10-16"
sudo zfs snapshot data/db@backup-2025-10-16

# List
sudo zfs list -t snapshot

Backup (incremental):

# Send to file
sudo zfs send data/db@backup-2025-10-16 | gzip > /backups/db.zfs.gz

# Restore
sudo zfs recv -F data/db_restored < /backups/db.zfs.gz

Advantage: Atomic (consistent), incremental send/recv efficient.

8. Backups: Database-Specific + Filesystem

Database Backups (Examples)

PostgreSQL (pg_dump):

# Full backup
sudo -u postgres pg_dump mydb | gzip > /backups/mydb-$(date +%Y%m%d).sql.gz

# Point-in-time recovery (with WAL archiving)
# See PostgreSQL docs for setup

MySQL (mysqldump):

# Full backup (all databases)
sudo mysqldump -u root -p'password' --all-databases | gzip > /backups/all-$(date +%Y%m%d).sql.gz

Neo4j (neo4j-admin dump):

# Dump database (offline)
sudo neo4j-admin dump --database=neo4j --to=/backups/neo4j-$(date +%Y%m%d).dump

MongoDB (mongodump):

# Dump all databases
sudo mongodump --out /backups/mongo-$(date +%Y%m%d)

Filesystem Backups

rsync (incremental):

# Backup /var/lib/db to /backups
sudo rsync -av --delete /var/lib/db/ /backups/db/

# Or to remote server
sudo rsync -av --delete /var/lib/db/ user@backup-server:/backups/db/

tar (full + incremental):

# Full backup (day 1)
sudo tar -czf /backups/db-full-$(date +%Y%m%d).tar.gz /var/lib/db

# Incremental (daily)
sudo tar -czf /backups/db-incr-$(date +%Y%m%d).tar.gz -g /backups/db.snar /var/lib/db

With snapshot (safe, no locks):

# Create snapshot
sudo lvcreate -L 2G -s -n db_backup /dev/vg0/lv_db

# Mount & backup (application doesn't know)
sudo mount -o ro /dev/vg0/db_backup /mnt/backup
sudo tar -czf /backups/db-$(date +%Y%m%d).tar.gz -C /mnt/backup .

# Cleanup
sudo umount /mnt/backup
sudo lvremove -f /dev/vg0/db_backup

9. Test Restores: Critical (But Often Skipped)

Why Test?

Reality: Backups that haven’t been tested usually fail when needed.

Horror stories from production (all real):

1. The GitLab.com disaster (2017):

Incident: Database replication lag → engineer deletes wrong database directory → 300GB of production data gone
Backup status: 5 different backup methods configured
Restore attempt: ALL 5 backup methods failed:
- LVM snapshots: disabled accidentally weeks earlier
- Regular backups: corrupted, wouldn’t restore
- Disk snapshots: accidentally disabled
- Backup to S3: sync had been failing for months (nobody checked)
- PostgreSQL WAL-E: restore failed (configuration error)
Data loss: 6 hours of customer data (issues, merge requests, comments)
Root cause: Backups never tested
Lesson: “We verified backups exist” ≠ “We verified backups restore”

2. The MySQL backup that never was:

Company: E-commerce startup (name withheld)
Setup: Nightly mysqldump running for 2 years, cronjob shows success
Disaster: Database server dies, hardware failure
Restore attempt: Backup file is 0 bytes
Root cause: Mysqldump password changed 2 years ago, cron job failing silently (output redirected to /dev/null)
Result: Company went out of business (no data = no customer orders)

3. The backup on the same disk:

Setup: PostgreSQL database at /var/lib/postgresql, backup script saves to /backups (same disk)
Disaster: Disk failure (physical damage)
Result: Both production DB and all backups lost
Lesson: Backups on the same disk = not a backup

4. The untested compression:

Setup: Tar backups with compression: tar -czf backup.tar.gz /data
Backup size: 500GB compressed to 50GB (90% compression - suspicious)
Disaster: Need to restore after ransomware attack
Restore attempt: tar -xzf backup.tar.gz → “gzip: invalid compressed data–format violated”
Root cause: Disk was 95% full during backup, tar silently truncated the archive
Result: Backup corrupted, unrecoverable

5. The permissions disaster:

Setup: Backups created as root, restored as regular user
Disaster: Restore successful, but all files owned by wrong user
Result: Application can’t read config files, database won’t start (wrong permissions on data directory)
Time to fix: 4 hours of manually fixing permissions on 2 million files

Key lesson: The only backup that matters is the one you’ve successfully restored.

How often should you test?

Data criticality	Test frequency	Why
Critical (customer data, financial)	Weekly	Data loss = business loss
Important (internal tools, logs)	Monthly	Downtime acceptable, but painful
Nice-to-have (dev environments)	Quarterly	Can rebuild if needed

Minimum test: Extract backup, verify basic structure (file count, database schema) Better test: Full restore to staging environment, run smoke tests Best test: Full disaster recovery drill (restore to new server, start application, verify functionality)

Restore Checklist

Monthly test:

#!/bin/bash
echo "=== Backup Restore Test ==="

# Step 1: Verify backup exists & is accessible
BACKUP=/backups/db-$(date -d '7 days ago' +%Y%m%d).tar.gz
if [ ! -f "$BACKUP" ]; then
  echo "ERROR: Backup not found: $BACKUP"
  exit 1
fi
echo "✓ Backup found: $BACKUP ($(du -h $BACKUP | cut -f1))"

# Step 2: Extract to staging
STAGING=/tmp/restore_test
mkdir -p $STAGING
tar -xzf $BACKUP -C $STAGING
echo "✓ Backup extracted to $STAGING"

# Step 3: Verify database (PostgreSQL example)
echo "Verifying database..."
if [ -f "$STAGING/PG_VERSION" ]; then
  echo "✓ PostgreSQL database structure valid"
else
  echo "ERROR: Database structure missing"
  rm -rf $STAGING
  exit 1
fi

# Step 4: Check file count (sanity check)
FILE_COUNT=$(find $STAGING -type f | wc -l)
echo "✓ Database contains $FILE_COUNT files"

# Step 5: Cleanup
rm -rf $STAGING
echo "✓ Test restore completed successfully"

# Step 6: Report
echo ""
echo "Restore test: PASSED ✓"
echo "Backup: $BACKUP"
echo "Date: $(date)"

Run monthly:

sudo /usr/local/bin/backup-restore-test.sh

# Log results
sudo /usr/local/bin/backup-restore-test.sh >> /var/log/backup-restore-test.log

RPO / RTO Definition

RPO (Recovery Point Objective): How much data can we afford to lose?

Example: “Daily backups” = RPO 24 hours

RTO (Recovery Time Objective): How long to recover?

Example: “Restore from backup + start services” = RTO 2 hours

Real-world RPO/RTO requirements by business type:

1. E-commerce site (high revenue per minute):

Business impact: $10,000/minute revenue loss during downtime
Customer tolerance: Very low (customers switch to competitors)

RPO: 5 minutes (continuous replication + point-in-time recovery)
RTO: 15 minutes (hot standby database ready to failover)

Backup strategy:
  - Continuous: Database streaming replication (PostgreSQL WAL, MySQL binlog)
  - Hourly: Snapshots (ZFS/Btrfs) for rollback
  - Daily: Full backup to S3 (disaster recovery)

Cost: High (redundant infrastructure, hot standby)
Justification: 15-minute outage = $150k revenue loss

2. Internal analytics platform (data warehouse):

Business impact: Analytics reports delayed, no revenue impact
Customer tolerance: High (internal users can wait)

RPO: 24 hours (daily batch jobs, losing 1 day acceptable)
RTO: 4 hours (restore during business hours)

Backup strategy:
  - Daily: Full database dump at 2 AM (low usage)
  - Weekly: Filesystem backup to NAS
  - Monthly: Offsite backup to cold storage

Cost: Low (single server, standard backups)
Justification: 4-hour outage = minor inconvenience

3. SaaS application (small startup):

Business impact: Customer churn risk, support tickets
Customer tolerance: Medium (understanding during beta, critical in production)

RPO: 1 hour (hourly snapshots)
RTO: 2 hours (restore from snapshot + restart)

Backup strategy:
  - Hourly: LVM snapshots (quick rollback)
  - Daily: Database dump to S3
  - Weekly: Full system backup (disaster recovery)

Cost: Medium (snapshots cheap, S3 storage minimal)
Justification: Balance between cost and data safety

4. Financial services (regulatory requirements):

Business impact: Regulatory fines, audit failures, legal liability
Customer tolerance: Zero (trust is everything)

RPO: 0 seconds (synchronous replication, no data loss allowed)
RTO: 30 seconds (automatic failover)

Backup strategy:
  - Real-time: Synchronous replication to 3+ servers (quorum)
  - Hourly: Point-in-time snapshots (ZFS) for rollback
  - Daily: Encrypted backup to geographically separate datacenter
  - Weekly: Tape backup to offline vault (compliance)

Cost: Very high (multi-datacenter, compliance overhead)
Justification: Regulatory requirement (no choice)

5. Personal blog / portfolio site:

Business impact: Mild embarrassment, lost article drafts
Customer tolerance: Infinite (it's free)

RPO: 1 week (whenever you remember to backup)
RTO: "Whenever I get around to it"

Backup strategy:
  - Weekly: `tar -czf backup.tar.gz /var/www` to USB drive
  - Monthly: Copy to cloud storage (optional)

Cost: Near zero
Justification: It's a hobby project

How to calculate your RPO/RTO:

Step 1: Calculate hourly revenue/business impact

E-commerce: $600k/hour revenue
  → 1 hour downtime = $600k loss
  → RTO must be < 15 minutes (minimize loss)
  → RPO must be < 5 minutes (minimize lost orders)

Blog: $0/hour revenue
  → 24 hour downtime = mild inconvenience
  → RTO can be 24-48 hours (restore when convenient)
  → RPO can be 7 days (weekly backups fine)

Step 2: Calculate cost of backup infrastructure

Hot standby (RTO 1 min): $5000/month infrastructure
Daily backups (RTO 4 hours): $50/month S3 storage

Decision: If downtime costs > $5000/hour → hot standby makes sense
         If downtime costs < $100/hour → daily backups sufficient

Step 3: Document and test

# Your RPO/RTO should be written down and tested quarterly

documented_rpo: 1 hour
actual_tested_rpo: 2 hours (test restore was from 2-hour-old backup)
  → Fix: Increase backup frequency to 30 minutes

documented_rto: 2 hours
actual_tested_rto: 4 hours (restore took longer than expected)
  → Fix: Practice restore procedure, optimize steps, update documentation

Document in runbook:

# Database Recovery Runbook
backup_location: /backups/postgres/
backup_frequency: daily (02:00 UTC)
retention: 30 days

RPO: 24 hours (acceptable data loss)
RTO: 2 hours (acceptable downtime)

restore_steps:
  1. Stop application
  2. Restore backup: pg_restore --clean --create backup.dump
  3. Verify data integrity
  4. Start application
  5. Test connectivity

contacts:
  - DBA: [email protected]
  - on-call: #database-incidents (Slack)

Storage Checklist

Pre-Deployment

Storage architecture chosen (traditional/LVM/ZFS/Btrfs)
Partition layout planned (/, /var, /var/log, /data separate)
Filesystem chosen (ext4 or XFS)
Mount options tuned (noatime, discard for SSD)
LVM/ZFS pools created (if using)
Filesystem formatted & mounted
/etc/fstab updated with persistent mounts
Backup strategy defined (database + filesystem)
Backup script tested
Restore procedure documented

Post-Deployment

Filesystems mounted correctly (df -h)
Disk usage reasonable (no >80% full)
Mount options applied (cat /proc/mounts)
Backup running daily (cron job verified)
Snapshots working (if LVM/Btrfs/ZFS)
Test restore completed successfully
RPO/RTO documented
On-call team trained on restore procedure

Ongoing Monitoring

Weekly: Check disk usage (df -h)
Weekly: Verify backup completed (ls -lt /backups)
Monthly: Test restore from backup
Quarterly: Full disaster recovery drill
Quarterly: Review & adjust retention policy

Quick Reference Commands

# ===== PARTITION / DISK =====
lsblk                              # Show block devices
parted -l                          # Show partitions
sudo fdisk -l                      # Show MBR partitions

# ===== LVM =====
sudo pvcreate /dev/sdb1           # Create PV
sudo vgcreate vg0 /dev/sdb1       # Create VG
sudo lvcreate -L 5G -n lv_db vg0  # Create LV
sudo lvextend -L +5G /dev/vg0/lv_db  # Grow LV
sudo resize2fs /dev/vg0/lv_db     # Resize ext4
sudo xfs_growfs /var/lib/db       # Resize XFS

# ===== FILESYSTEM =====
sudo mkfs.ext4 /dev/vg0/lv_db     # Create ext4
sudo mkfs.xfs /dev/vg0/lv_db      # Create XFS
sudo fsck -n /dev/vg0/lv_db       # Check (read-only)
df -h                              # Disk usage
du -sh /var/lib/db                # Directory size

# ===== MOUNT =====
sudo mount /dev/vg0/lv_db /mnt/db # Mount
sudo umount /mnt/db               # Unmount
sudo fstrim -v /                  # TRIM SSD

# ===== SNAPSHOTS =====
sudo lvcreate -L 2G -s -n snap /dev/vg0/lv_db  # LVM snapshot
sudo btrfs subvolume snapshot /mnt/data snap    # Btrfs snapshot
sudo zfs snapshot data/db@snap                  # ZFS snapshot

# ===== BACKUP & RESTORE =====
sudo tar -czf /backups/db.tar.gz /var/lib/db   # Full backup
sudo rsync -av /var/lib/db /backups/            # Incremental
sudo tar -xzf /backups/db.tar.gz -C /mnt/      # Restore

Executive Summary#

1. Storage Architecture: Partitioning & Volume Management#

Comparison Table#

Recommendation#

2. Partitioning: LVM (Logical Volume Management)#

Why LVM#

LVM Concepts#

Create LVM Stack#

Grow LVM Volumes (Non-Disruptive)#

3. ZFS: Modern, Copy-on-Write#

Why ZFS#

ZFS Quick Setup#

4. Btrfs: Linux-Native CoW#

Why Btrfs#

Btrfs Quick Setup#

5. Filesystem Selection: ext4 vs. XFS#

Comparison#

Recommendation#

Create Filesystems#

6. Mount Options: Performance & Reliability#

Safe Production Defaults#

Mount Options Explained#

SSD-Specific Mount Options#

7. Snapshots: Point-in-Time Recovery#

LVM Snapshots (Thin)#

Btrfs Snapshots (Instant)#

ZFS Snapshots (Atomic)#

8. Backups: Database-Specific + Filesystem#

Database Backups (Examples)#

Filesystem Backups#

9. Test Restores: Critical (But Often Skipped)#

Why Test?#

How often should you test?#

Restore Checklist#

RPO / RTO Definition#

Storage Checklist#

Pre-Deployment#

Post-Deployment#

Ongoing Monitoring#

Quick Reference Commands#

Further Reading#

Executive Summary

1. Storage Architecture: Partitioning & Volume Management

Comparison Table

Recommendation

2. Partitioning: LVM (Logical Volume Management)

Why LVM

LVM Concepts

Create LVM Stack

Grow LVM Volumes (Non-Disruptive)

3. ZFS: Modern, Copy-on-Write

Why ZFS

ZFS Quick Setup

4. Btrfs: Linux-Native CoW

Why Btrfs

Btrfs Quick Setup

5. Filesystem Selection: ext4 vs. XFS

Comparison

Recommendation

Create Filesystems

6. Mount Options: Performance & Reliability

Safe Production Defaults

Mount Options Explained

SSD-Specific Mount Options

7. Snapshots: Point-in-Time Recovery

LVM Snapshots (Thin)

Btrfs Snapshots (Instant)

ZFS Snapshots (Atomic)

8. Backups: Database-Specific + Filesystem

Database Backups (Examples)

Filesystem Backups

9. Test Restores: Critical (But Often Skipped)

Why Test?

How often should you test?

Restore Checklist

RPO / RTO Definition

Storage Checklist

Pre-Deployment

Post-Deployment

Ongoing Monitoring

Quick Reference Commands

Further Reading