Executive Summary

Reliability means predictable, auditable behavior. This guide covers:

  • Time sync: Chrony for clock accuracy (critical for logging, security)
  • Networking: Stable interface names & hostnames (infrastructure consistency)
  • Logging: Persistent journald + logrotate (audit trail + disk management)
  • Shutdown: Clean hooks to prevent data loss
  • Patching: Kernel updates with livepatch (zero-downtime), tested rollback

1. Time Synchronization (chrony)

Why Time Matters

Critical for:

  • Logging: Accurate timestamps for debugging, compliance audits
  • Security: TLS cert validation, Kerberos, API token expiry
  • Distributed systems: Causality ordering (happens-before relationships)
  • Monitoring: Alert timing, metric correlation

Cost of poor time sync:

  • Logs from different servers appear out-of-order
  • API clients rejected (clock skew > tolerance)
  • Kerberos auth fails (clock > 5 min off)
  • Monitoring alerts triggered on old data

Chrony Installation & Configuration

What it is:

  • Modern NTP daemon (faster sync than ntpd)
  • Handles drifting clocks, network jitter
  • Accurate to milliseconds (suitable for most workloads)
  • Survives network outages, re-syncs quickly

Install & configure (Ubuntu/Debian):

apt install chrony

# Configure: /etc/chrony/chrony.conf
sudo tee /etc/chrony/chrony.conf > /dev/null << 'CHRONY'
# Default NTP servers (Debian/Ubuntu defaults)
pool ntp.ubuntu.com iburst

# Or, use specific servers (lower latency):
server time.cloudflare.com iburst
server time.google.com iburst
server time.aws.com iburst

# Allow local clients (NTP query from localhost)
allow 127.0.0.1
allow ::1

# For containers: allow from Docker/Kubernetes subnet
allow 172.17.0.0/16
allow 10.0.0.0/8

# Drift file (tracks clock rate)
driftfile /var/lib/chrony/chrony.drift

# Leap seconds file
leapsectz right/UTC

# Make time adjustments gradually (better for apps)
makestep 1.0 3  # First 3 updates: adjust if > 1 sec

# RTC (real-time clock) sync
rtcsync

# Enable hardware timestamping (if supported)
hwtimestamp *
CHRONY

# Apply
sudo systemctl restart chrony
sudo systemctl enable chrony

# Verify
chronyc sources          # Show NTP sources
chronyc tracking         # Show sync details & estimated error
timedatectl show         # systemd time status

Install & configure (RHEL/Fedora):

dnf install chrony

# Same /etc/chrony/chrony.conf (above)
# Default RHEL servers: pool.ntp.org

sudo systemctl enable chronyd
sudo systemctl restart chronyd

# Verify
chronyc sources
timedatectl status

Time Sync Verification & Monitoring

Check sync status:

# Current time vs NTP source
timedatectl

# Chrony details
chronyc tracking
# Output:
#   Reference ID    : C0248D97 (time.cloudflare.com)
#   Stratum         : 3
#   Ref time (UTC)  : Wed Oct 16 12:34:56 2025
#   System time     : 0.000001234 seconds fast
#   Last offset     : +0.000001234 seconds
#   RMS offset      : 0.000000789 seconds
#   Residual freq   : 0.001 ppm
#   Residual skew   : 0.002 ppm
#   Root delay      : 0.023456 seconds
#   Root dispersion : 0.012345 seconds
#   Update interval : 64.2 seconds
#   Leap status     : Normal

# Offset should be < 1ms for production

Monitor in production:

# Prometheus exporter (node-exporter includes time metrics)
curl localhost:9100/metrics | grep node_time

# Manual check (every minute)
chronyc -c tracking | awk '{print $(NF-5), $(NF-4), $(NF-3)}'  # offset, skew, delay

# Alert on large offset (> 10ms)
# Alert on sync loss (stratum > 15 = unsync)

2. Networking: Stable Interface Names & Hostnames

Why Predictable Networking Matters

Problem: MAC address randomization, BIOS reordering β†’ eth0 becomes eth1 β†’ network broken
Solution: Predictable names + systemd networkd or cloud-init
Bonus: Infrastructure-as-code can reference stable names

Predictable Interface Names

systemd naming scheme:

  • en = Ethernet, wl = Wireless, ww = Wireless WAN
  • o<index> = On-board (by device index)
  • s<slot> = PCI slot
  • p<bus>s<slot> = PCI address
  • Example: eno1 = on-board Ethernet #1, enp0s25 = PCI slot 0, device 25

Check current scheme:

ip link show
# eno1: flags=UP...
# enp0s25: flags=UP...

Verify predictability is enabled:

# Check kernel command line (should NOT have net.ifnames=0)
cat /proc/cmdline

# Verify udev rule (should exist)
ls -la /etc/udev/rules.d/ | grep -i net

If using DHCP or cloud-init, ensure stable config:

# /etc/netplan/00-installer-config.yaml (Ubuntu)
sudo tee /etc/netplan/00-installer-config.yaml > /dev/null << 'NETPLAN'
network:
  version: 2
  ethernets:
    eno1:
      dhcp4: true
      dhcp6: true
      optional: true
    enp0s25:
      dhcp4: true
      optional: true
NETPLAN

# Apply
sudo netplan apply

# Verify
ip addr show

Stable Hostnames

What it is:

  • hostname command (transient, lost on reboot)
  • /etc/hostname file (persistent, survives reboot)
  • hostnamectl (systemd tool, recommended)
  • DNS A record (matches hostname for orchestration)

Set stable hostname:

# View current
hostnamectl

# Set persistent hostname
sudo hostnamectl set-hostname prod-app-01.example.com

# Verify
cat /etc/hostname
hostnamectl

# Update /etc/hosts (localhost resolution)
sudo sed -i 's/127.0.1.1.*/127.0.1.1 prod-app-01.example.com prod-app-01/' /etc/hosts
grep 127.0.1.1 /etc/hosts

# Test
hostname
hostname -f

In Kubernetes/cloud environments:

# Use cloud-init to set hostname from instance metadata
# cloud.yaml (cloud-init)
hostname: prod-app-01
fqdn: prod-app-01.example.com
prefer_fqdn_over_hostname: true

# Or via user-data script
#!/bin/bash
hostnamectl set-hostname $(ec2-metadata --instance-id | cut -d' ' -f2).example.com

3. Logging: journald Persistent & logrotate

journald Persistent Storage

What it is:

  • journald = systemd logging daemon (replaces syslog)
  • By default, logs stored in /run/log/journal/ (volatile, lost on reboot)
  • Persistent mode: Logs in /var/log/journal/ (survives reboot)
  • Indexed, queryable, per-unit organization

Enable persistent journald:

# Create directory
sudo mkdir -p /var/log/journal

# Set permissions
sudo chmod 2755 /var/log/journal

# Configure: /etc/systemd/journald.conf
sudo tee /etc/systemd/journald.conf > /dev/null << 'JOURNALD'
[Journal]
# Storage (auto=persistent if /var/log/journal exists, volatile otherwise)
Storage=persistent

# Max size: 10% of /var or explicit value
SystemMaxUse=10G
RuntimeMaxUse=256M

# Max file size before rotation
SystemMaxFileSize=100M
RuntimeMaxFileSize=10M

# Retention: keep for N days
MaxRetentionSec=30day

# Forward to syslog (optional, for legacy log aggregation)
ForwardToSyslog=no

# Compress old journals
Compress=yes

# Split by UID
SplitMode=uid

# Sync to disk frequency (trade-off: speed vs. durability)
SyncIntervalSec=5min
JOURNALD

# Apply
sudo systemctl restart systemd-journald

# Verify
sudo journalctl --list-boots  # Should show multiple boots
sudo du -sh /var/log/journal/

Query persistent logs:

# Logs from last boot
journalctl -b

# Logs from previous boot (-1 = last, -2 = second-last)
journalctl -b -1

# Logs in time range
journalctl --since "2025-10-16 10:00:00" --until "2025-10-16 11:00:00"

# Per-unit
journalctl -u sshd.service -n 100  # Last 100 lines

# Priority filter
journalctl -p err          # Errors only
journalctl -p warning..err # Warnings & errors

# Follow (tail -f style)
journalctl -f

logrotate: Disk Management

What it is:

  • Rotates log files when they grow too large
  • Compresses old logs
  • Deletes old logs after N days
  • Runs daily via cron/timer

Install & configure:

# Usually pre-installed
apt install logrotate

# Main config: /etc/logrotate.conf
sudo cat /etc/logrotate.conf

# Per-app config: /etc/logrotate.d/myapp
sudo tee /etc/logrotate.d/myapp > /dev/null << 'LOGROTATE'
/var/log/myapp/*.log {
    # Rotate when file > 100MB
    size 100M
    
    # Keep 30 old compressed logs
    rotate 30
    
    # Compress old logs (gzip)
    compress
    
    # Don't compress yesterday's log (keep readable)
    delaycompress
    
    # Keep empty log files after rotation
    missingok
    
    # Don't error if log file missing
    notifempty
    
    # Create new log file with these perms
    create 0640 myapp myapp
    
    # Reload app after rotation (send signal)
    sharedscripts
    postrotate
        systemctl reload myapp > /dev/null 2>&1 || true
    endscript
}
LOGROTATE

# Test rotation (dry-run)
sudo logrotate -d /etc/logrotate.d/myapp

# Force rotation (for testing)
sudo logrotate -f /etc/logrotate.d/myapp

# Check schedule
cat /etc/cron.daily/logrotate  # Usually runs daily

Separate /var/log Partition

Why:

  • Prevents log disk-full from crashing system
  • Easier to manage quota independently
  • Can use different filesystem/performance tuning

Setup on existing system:

# 1. Create logical volume (if using LVM)
sudo lvcreate -L 50G -n var_log vg0

# 2. Format
sudo mkfs.ext4 /dev/vg0/var_log

# 3. Backup current logs
sudo tar -czf /tmp/var-log-backup.tar.gz /var/log

# 4. Mount new partition
sudo mkdir -p /var/log.new
sudo mount /dev/vg0/var_log /var/log.new

# 5. Restore logs
sudo tar -xzf /tmp/var-log-backup.tar.gz -C /var/log.new --strip-components=1

# 6. Add to /etc/fstab for persistence
echo "/dev/vg0/var_log /var/log ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab

# 7. Unmount & remount
sudo umount /var/log.new
sudo mount /var/log
sudo df -h /var/log

4. Clean Shutdown Hooks

Why Graceful Shutdown Matters

Problem: SIGKILL (immediate shutdown) β†’ data loss, corrupted files, incomplete transactions
Solution: Shutdown hooks β†’ services flush data, connections close gracefully
Benefit: Faster recovery, no fsck/repair needed

systemd Shutdown Hooks

What it is:

  • ExecStop = graceful shutdown command
  • TimeoutStopSec = max wait (then SIGKILL)
  • KillMode = how to kill the process (control-group, process, mixed)

Example hardened service with graceful shutdown:

[Unit]
Description=My Database
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=postgres
ExecStart=/usr/bin/postgres -D /var/lib/postgresql

# Graceful shutdown: send SIGTERM, wait for clean exit
ExecStop=/usr/bin/pg_ctl -D /var/lib/postgresql stop -m fast

# Max 60 seconds for graceful shutdown (then SIGKILL)
TimeoutStopSec=60

# Kill entire process group (children too)
KillMode=control-group

# Kill signal (default SIGTERM, alternative SIGINT)
KillSignal=SIGTERM

# Restart policy
Restart=on-failure
RestartSec=5s

# Resource limits
MemoryLimit=4G
CPUQuota=80%

[Install]
WantedBy=multi-user.target

Load and test:

sudo systemctl daemon-reload
sudo systemctl start mydb.service

# Trigger graceful shutdown
sudo systemctl stop mydb.service
# Watch logs
sudo journalctl -u mydb.service -f

# Verify exit was clean (should be 0)
sudo systemctl show -p ExecMainStatus mydb.service

Pre-Shutdown Hooks (ExecStopPost)

Scenario: Need to clean up before service stops (flush buffers, dump state)

[Service]
ExecStart=/opt/myapp/run

# Pre-stop: notify app of shutdown
ExecStop=/bin/sh -c 'curl -s http://localhost:8080/shutdown || true'

# Post-stop: archive state (after process dead)
ExecStopPost=/bin/sh -c 'tar -czf /var/backups/myapp-state.tar.gz /var/lib/myapp && logger "Saved state"'

TimeoutStopSec=30

5. Watchdog Timer

Why Watchdog

Problem: Process hangs (deadlock, infinite loop) β†’ looks running but unresponsive
Solution: Watchdog timer β†’ if not “pinged”, reboot system
Benefit: Self-healing (brief reboot better than hung state)

systemd Watchdog

What it is:

  • WatchdogSec = interval to “kick” watchdog
  • RuntimeWatchdogSec = reboot if watchdog not kicked for N seconds

Enable watchdog (system-wide):

# /etc/systemd/system.conf
sudo tee -a /etc/systemd/system.conf > /dev/null << 'WATCHDOG'
# Reboot if systemd hangs (not responding to signals)
RuntimeWatchdogSec=20
# Don't reboot on normal shutdown
ShutdownWatchdogSec=0
WATCHDOG

sudo systemctl daemon-reload

Per-service watchdog:

[Service]
ExecStart=/usr/bin/myapp

# "Kick" watchdog every 10 seconds
WatchdogSec=10

# systemd will reload this service every 10 seconds
# If app doesn't call sd_notify(0, "WATCHDOG=1"), service restarts

In app code (example: Python):

import sys
import time
import dbus
import systemd.daemon

while True:
    # Do work
    process_request()
    
    # Kick watchdog every 5 seconds (< WatchdogSec)
    try:
        systemd.daemon.notify('WATCHDOG=1')
    except:
        pass  # systemd not available
    
    time.sleep(5)

6. Kernel & Livepatch Policy

Kernel Update Strategy

Options:

  1. Scheduled reboot (safest): Deploy patch, reboot at maintenance window
  2. Livepatch (zero-downtime): Hot-patch without reboot
  3. Hybrid (recommended): Use livepatch for urgent CVEs, schedule full reboot monthly

Trade-off matrix:

ApproachDowntimeComplexityRollbackRisk
Scheduled rebootBrief (seconds)LowEasy (boot old kernel)Low
LivepatchNoneHighComplex (state inconsistency)Medium
HybridNone/briefMediumEasy (revert live patch, reboot)Low-Medium

Livepatch Setup (RHEL/Ubuntu Pro)

What it is:

  • kpatch (Red Hat) or livepatch (Canonical): In-kernel hot patching
  • Patches applied without reboot
  • Persists until next boot (reverts to stock kernel after reboot)

RHEL with kpatch:

# Install
dnf install kpatch

# Check available patches
kpatch list

# Apply patch (no reboot)
sudo kpatch load /usr/src/kernels/kpatch/CVE-2025-xxxxx.patch

# Verify
kpatch list
# Output: Applied patch: CVE-2025-xxxxx

# Persist across reboot (or unload)
sudo kpatch save  # Save state (optional)

# Revert (if issues)
sudo kpatch unload CVE-2025-xxxxx

# Full kernel update + reboot (eventually)
dnf update kernel
sudo reboot

Ubuntu with livepatch (Pro only):

# Install
snap install livepatch --classic

# Authenticate (requires Ubuntu Pro subscription)
sudo ubuntu-livepatch enable --auth-token=YOUR_TOKEN

# Check status
ubuntu-livepatch status
# Output: live patching: enabled

# Check applied patches
cat /proc/livepatch

# Revert
ubuntu-livepatch disable

Tested Rollback Path

Mandatory for production:

  1. Pre-deployment test:
# 1. Test kernel in staging environment
# 2. Verify app starts, basic functionality works
# 3. Test rollback (reboot to previous kernel)
  1. Kernel menu entry (GRUB):
# Verify old kernel still in GRUB menu
sudo grep -A2 "^menuentry" /boot/grub/grub.cfg | head -20

# Test: Reboot and select old kernel from menu
# Verify system boots & app functional
  1. Automatic rollback (if kernel panics):
# /etc/default/grub
GRUB_DEFAULT=0                          # Boot latest kernel
GRUB_RECORDFAIL_TIMEOUT=60              # Try old kernel if crash

# Update grub
sudo grub-mkconfig -o /boot/grub/grub.cfg

# Test: Add bad kernel param to force panic, verify auto-rollback
sudo sed -i 's/GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="panic=10 /' /etc/default/grub
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot  # Should kernel panic, then roll back
# After rollback, revert the panic param
  1. Document rollback procedure:
# Runbook: /docs/kernel-rollback.md

# Manual rollback steps:
# 1. Reboot server
# 2. At GRUB menu, select older kernel entry
# 3. Boot
# 4. Verify app operational
# 5. (Optional) Remove new kernel: apt remove linux-image-NEW

Reliability Checklist

Pre-Deployment

  • Chrony installed & syncing (offset < 1ms)
  • Hostname set persistently (hostnamectl)
  • Network interfaces have predictable names (eno1, enp0s25)
  • journald configured for persistent storage (/var/log/journal/)
  • /var/log on separate partition (if possible) with adequate space
  • logrotate configured for all log paths
  • Services have graceful shutdown hooks (ExecStop)
  • Critical services have watchdog timers
  • Kernel rollback path tested (GRUB old entry, manual reboot)
  • Livepatch staging tested (if using livepatch)

Post-Deployment

  • Logs persistent across reboot (sudo journalctl -b -1)
  • Time synced on multiple boots (chronyc tracking)
  • Hostname persists after reboot (hostnamectl)
  • Network interfaces keep same name after reboot (ip link)
  • Service graceful shutdown tested (no hard kills)
  • Watchdog recovery tested (force hang, verify reboot)
  • Kernel patches applied (check uname -r)
  • Previous kernel available in GRUB menu
  • Log rotation working (sudo logrotate -f /etc/logrotate.d/*)

Ongoing

  • Weekly: Check time sync drift (chronyc tracking β†’ offset)
  • Weekly: Monitor disk usage (du -sh /var/log)
  • Monthly: Test service restart (verify graceful shutdown)
  • Monthly: Test kernel rollback (if not using auto-rollback)
  • Quarterly: Test full system reboot (boot, apps running, logs persist)
  • Quarterly: Verify livepatch/kpatch still patching

Quick Reference Commands

# Time Sync
chronyc sources          # NTP sources & sync status
chronyc tracking         # Detailed sync info
timedatectl status       # systemd time view

# Networking
hostnamectl              # View/set hostname
ip link show             # Interface names & status
cat /proc/cmdline        # Verify net.ifnames & net.ifnames

# Logging
sudo journalctl -b       # This boot's logs
sudo journalctl -b -1    # Previous boot
sudo du -sh /var/log/journal/  # Journal size
sudo logrotate -d /etc/logrotate.d/*  # Test rotation

# Shutdown
sudo systemctl show -p ExecMainStatus SERVICE  # Exit code
sudo journalctl -u SERVICE -n 50  # Last 50 lines

# Watchdog
sudo systemctl status systemd-journald  # Check watchdog
cat /proc/sys/kernel/watchdog_thresh  # Watchdog timeout

# Kernel & Livepatch
uname -r                 # Current kernel version
cat /proc/cmdline        # Kernel params at boot
sudo grub-set-default 0  # Set default kernel (GRUB)
kpatch list              # (Red Hat) Applied patches
ubuntu-livepatch status  # (Ubuntu Pro) Patch status

Common Pitfalls

Pitfall 1: journald Loses Logs on Reboot

Problem: Logs in /run/log/journal/ disappear after reboot
Fix: Enable persistent storage: mkdir -p /var/log/journal && systemctl restart systemd-journald

Pitfall 2: Service Killed Abruptly on Shutdown

Problem: TimeoutStopSec too short β†’ process SIGKILL before flushing
Fix: Increase timeout, verify ExecStop command works standalone: sudo -u SERVICE_USER /path/to/stop/cmd

Pitfall 3: Watchdog Reboots Too Often

Problem: Legitimate workload takes > WatchdogSec to “kick”
Fix: Increase WatchdogSec, ensure app calls sd_notify(0, "WATCHDOG=1") periodically

Pitfall 4: Can’t Rollback to Old Kernel

Problem: Old kernel removed (apt autoremove) or not in GRUB
Fix: Keep 2 kernels: apt-mark hold linux-image-OLD, verify GRUB: sudo grub-mkconfig

Pitfall 5: Chrony Sync Lost in VM/Container

Problem: VM clock jitter > network jitter β†’ no sync
Fix: Use host NTP, set --privileged on container, or accept higher offset tolerance


Further Reading