Executive Summary
Reliability means predictable, auditable behavior. This guide covers:
- Time sync: Chrony for clock accuracy (critical for logging, security)
- Networking: Stable interface names & hostnames (infrastructure consistency)
- Logging: Persistent journald + logrotate (audit trail + disk management)
- Shutdown: Clean hooks to prevent data loss
- Patching: Kernel updates with livepatch (zero-downtime), tested rollback
1. Time Synchronization (chrony)
Why Time Matters
Critical for:
- Logging: Accurate timestamps for debugging, compliance audits
- Security: TLS cert validation, Kerberos, API token expiry
- Distributed systems: Causality ordering (happens-before relationships)
- Monitoring: Alert timing, metric correlation
Cost of poor time sync:
- Logs from different servers appear out-of-order
- API clients rejected (clock skew > tolerance)
- Kerberos auth fails (clock > 5 min off)
- Monitoring alerts triggered on old data
Chrony Installation & Configuration
What it is:
- Modern NTP daemon (faster sync than ntpd)
- Handles drifting clocks, network jitter
- Accurate to milliseconds (suitable for most workloads)
- Survives network outages, re-syncs quickly
Install & configure (Ubuntu/Debian):
apt install chrony
# Configure: /etc/chrony/chrony.conf
sudo tee /etc/chrony/chrony.conf > /dev/null << 'CHRONY'
# Default NTP servers (Debian/Ubuntu defaults)
pool ntp.ubuntu.com iburst
# Or, use specific servers (lower latency):
server time.cloudflare.com iburst
server time.google.com iburst
server time.aws.com iburst
# Allow local clients (NTP query from localhost)
allow 127.0.0.1
allow ::1
# For containers: allow from Docker/Kubernetes subnet
allow 172.17.0.0/16
allow 10.0.0.0/8
# Drift file (tracks clock rate)
driftfile /var/lib/chrony/chrony.drift
# Leap seconds file
leapsectz right/UTC
# Make time adjustments gradually (better for apps)
makestep 1.0 3 # First 3 updates: adjust if > 1 sec
# RTC (real-time clock) sync
rtcsync
# Enable hardware timestamping (if supported)
hwtimestamp *
CHRONY
# Apply
sudo systemctl restart chrony
sudo systemctl enable chrony
# Verify
chronyc sources # Show NTP sources
chronyc tracking # Show sync details & estimated error
timedatectl show # systemd time status
Install & configure (RHEL/Fedora):
dnf install chrony
# Same /etc/chrony/chrony.conf (above)
# Default RHEL servers: pool.ntp.org
sudo systemctl enable chronyd
sudo systemctl restart chronyd
# Verify
chronyc sources
timedatectl status
Time Sync Verification & Monitoring
Check sync status:
# Current time vs NTP source
timedatectl
# Chrony details
chronyc tracking
# Output:
# Reference ID : C0248D97 (time.cloudflare.com)
# Stratum : 3
# Ref time (UTC) : Wed Oct 16 12:34:56 2025
# System time : 0.000001234 seconds fast
# Last offset : +0.000001234 seconds
# RMS offset : 0.000000789 seconds
# Residual freq : 0.001 ppm
# Residual skew : 0.002 ppm
# Root delay : 0.023456 seconds
# Root dispersion : 0.012345 seconds
# Update interval : 64.2 seconds
# Leap status : Normal
# Offset should be < 1ms for production
Monitor in production:
# Prometheus exporter (node-exporter includes time metrics)
curl localhost:9100/metrics | grep node_time
# Manual check (every minute)
chronyc -c tracking | awk '{print $(NF-5), $(NF-4), $(NF-3)}' # offset, skew, delay
# Alert on large offset (> 10ms)
# Alert on sync loss (stratum > 15 = unsync)
2. Networking: Stable Interface Names & Hostnames
Why Predictable Networking Matters
Problem: MAC address randomization, BIOS reordering β eth0 becomes eth1 β network broken
Solution: Predictable names + systemd networkd or cloud-init
Bonus: Infrastructure-as-code can reference stable names
Predictable Interface Names
systemd naming scheme:
en
= Ethernet,wl
= Wireless,ww
= Wireless WANo<index>
= On-board (by device index)s<slot>
= PCI slotp<bus>s<slot>
= PCI address- Example:
eno1
= on-board Ethernet #1,enp0s25
= PCI slot 0, device 25
Check current scheme:
ip link show
# eno1: flags=UP...
# enp0s25: flags=UP...
Verify predictability is enabled:
# Check kernel command line (should NOT have net.ifnames=0)
cat /proc/cmdline
# Verify udev rule (should exist)
ls -la /etc/udev/rules.d/ | grep -i net
If using DHCP or cloud-init, ensure stable config:
# /etc/netplan/00-installer-config.yaml (Ubuntu)
sudo tee /etc/netplan/00-installer-config.yaml > /dev/null << 'NETPLAN'
network:
version: 2
ethernets:
eno1:
dhcp4: true
dhcp6: true
optional: true
enp0s25:
dhcp4: true
optional: true
NETPLAN
# Apply
sudo netplan apply
# Verify
ip addr show
Stable Hostnames
What it is:
hostname
command (transient, lost on reboot)/etc/hostname
file (persistent, survives reboot)hostnamectl
(systemd tool, recommended)- DNS A record (matches hostname for orchestration)
Set stable hostname:
# View current
hostnamectl
# Set persistent hostname
sudo hostnamectl set-hostname prod-app-01.example.com
# Verify
cat /etc/hostname
hostnamectl
# Update /etc/hosts (localhost resolution)
sudo sed -i 's/127.0.1.1.*/127.0.1.1 prod-app-01.example.com prod-app-01/' /etc/hosts
grep 127.0.1.1 /etc/hosts
# Test
hostname
hostname -f
In Kubernetes/cloud environments:
# Use cloud-init to set hostname from instance metadata
# cloud.yaml (cloud-init)
hostname: prod-app-01
fqdn: prod-app-01.example.com
prefer_fqdn_over_hostname: true
# Or via user-data script
#!/bin/bash
hostnamectl set-hostname $(ec2-metadata --instance-id | cut -d' ' -f2).example.com
3. Logging: journald Persistent & logrotate
journald Persistent Storage
What it is:
journald
= systemd logging daemon (replaces syslog)- By default, logs stored in
/run/log/journal/
(volatile, lost on reboot) - Persistent mode: Logs in
/var/log/journal/
(survives reboot) - Indexed, queryable, per-unit organization
Enable persistent journald:
# Create directory
sudo mkdir -p /var/log/journal
# Set permissions
sudo chmod 2755 /var/log/journal
# Configure: /etc/systemd/journald.conf
sudo tee /etc/systemd/journald.conf > /dev/null << 'JOURNALD'
[Journal]
# Storage (auto=persistent if /var/log/journal exists, volatile otherwise)
Storage=persistent
# Max size: 10% of /var or explicit value
SystemMaxUse=10G
RuntimeMaxUse=256M
# Max file size before rotation
SystemMaxFileSize=100M
RuntimeMaxFileSize=10M
# Retention: keep for N days
MaxRetentionSec=30day
# Forward to syslog (optional, for legacy log aggregation)
ForwardToSyslog=no
# Compress old journals
Compress=yes
# Split by UID
SplitMode=uid
# Sync to disk frequency (trade-off: speed vs. durability)
SyncIntervalSec=5min
JOURNALD
# Apply
sudo systemctl restart systemd-journald
# Verify
sudo journalctl --list-boots # Should show multiple boots
sudo du -sh /var/log/journal/
Query persistent logs:
# Logs from last boot
journalctl -b
# Logs from previous boot (-1 = last, -2 = second-last)
journalctl -b -1
# Logs in time range
journalctl --since "2025-10-16 10:00:00" --until "2025-10-16 11:00:00"
# Per-unit
journalctl -u sshd.service -n 100 # Last 100 lines
# Priority filter
journalctl -p err # Errors only
journalctl -p warning..err # Warnings & errors
# Follow (tail -f style)
journalctl -f
logrotate: Disk Management
What it is:
- Rotates log files when they grow too large
- Compresses old logs
- Deletes old logs after N days
- Runs daily via cron/timer
Install & configure:
# Usually pre-installed
apt install logrotate
# Main config: /etc/logrotate.conf
sudo cat /etc/logrotate.conf
# Per-app config: /etc/logrotate.d/myapp
sudo tee /etc/logrotate.d/myapp > /dev/null << 'LOGROTATE'
/var/log/myapp/*.log {
# Rotate when file > 100MB
size 100M
# Keep 30 old compressed logs
rotate 30
# Compress old logs (gzip)
compress
# Don't compress yesterday's log (keep readable)
delaycompress
# Keep empty log files after rotation
missingok
# Don't error if log file missing
notifempty
# Create new log file with these perms
create 0640 myapp myapp
# Reload app after rotation (send signal)
sharedscripts
postrotate
systemctl reload myapp > /dev/null 2>&1 || true
endscript
}
LOGROTATE
# Test rotation (dry-run)
sudo logrotate -d /etc/logrotate.d/myapp
# Force rotation (for testing)
sudo logrotate -f /etc/logrotate.d/myapp
# Check schedule
cat /etc/cron.daily/logrotate # Usually runs daily
Separate /var/log Partition
Why:
- Prevents log disk-full from crashing system
- Easier to manage quota independently
- Can use different filesystem/performance tuning
Setup on existing system:
# 1. Create logical volume (if using LVM)
sudo lvcreate -L 50G -n var_log vg0
# 2. Format
sudo mkfs.ext4 /dev/vg0/var_log
# 3. Backup current logs
sudo tar -czf /tmp/var-log-backup.tar.gz /var/log
# 4. Mount new partition
sudo mkdir -p /var/log.new
sudo mount /dev/vg0/var_log /var/log.new
# 5. Restore logs
sudo tar -xzf /tmp/var-log-backup.tar.gz -C /var/log.new --strip-components=1
# 6. Add to /etc/fstab for persistence
echo "/dev/vg0/var_log /var/log ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab
# 7. Unmount & remount
sudo umount /var/log.new
sudo mount /var/log
sudo df -h /var/log
4. Clean Shutdown Hooks
Why Graceful Shutdown Matters
Problem: SIGKILL (immediate shutdown) β data loss, corrupted files, incomplete transactions
Solution: Shutdown hooks β services flush data, connections close gracefully
Benefit: Faster recovery, no fsck/repair needed
systemd Shutdown Hooks
What it is:
ExecStop
= graceful shutdown commandTimeoutStopSec
= max wait (then SIGKILL)KillMode
= how to kill the process (control-group, process, mixed)
Example hardened service with graceful shutdown:
[Unit]
Description=My Database
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=postgres
ExecStart=/usr/bin/postgres -D /var/lib/postgresql
# Graceful shutdown: send SIGTERM, wait for clean exit
ExecStop=/usr/bin/pg_ctl -D /var/lib/postgresql stop -m fast
# Max 60 seconds for graceful shutdown (then SIGKILL)
TimeoutStopSec=60
# Kill entire process group (children too)
KillMode=control-group
# Kill signal (default SIGTERM, alternative SIGINT)
KillSignal=SIGTERM
# Restart policy
Restart=on-failure
RestartSec=5s
# Resource limits
MemoryLimit=4G
CPUQuota=80%
[Install]
WantedBy=multi-user.target
Load and test:
sudo systemctl daemon-reload
sudo systemctl start mydb.service
# Trigger graceful shutdown
sudo systemctl stop mydb.service
# Watch logs
sudo journalctl -u mydb.service -f
# Verify exit was clean (should be 0)
sudo systemctl show -p ExecMainStatus mydb.service
Pre-Shutdown Hooks (ExecStopPost)
Scenario: Need to clean up before service stops (flush buffers, dump state)
[Service]
ExecStart=/opt/myapp/run
# Pre-stop: notify app of shutdown
ExecStop=/bin/sh -c 'curl -s http://localhost:8080/shutdown || true'
# Post-stop: archive state (after process dead)
ExecStopPost=/bin/sh -c 'tar -czf /var/backups/myapp-state.tar.gz /var/lib/myapp && logger "Saved state"'
TimeoutStopSec=30
5. Watchdog Timer
Why Watchdog
Problem: Process hangs (deadlock, infinite loop) β looks running but unresponsive
Solution: Watchdog timer β if not “pinged”, reboot system
Benefit: Self-healing (brief reboot better than hung state)
systemd Watchdog
What it is:
WatchdogSec
= interval to “kick” watchdogRuntimeWatchdogSec
= reboot if watchdog not kicked for N seconds
Enable watchdog (system-wide):
# /etc/systemd/system.conf
sudo tee -a /etc/systemd/system.conf > /dev/null << 'WATCHDOG'
# Reboot if systemd hangs (not responding to signals)
RuntimeWatchdogSec=20
# Don't reboot on normal shutdown
ShutdownWatchdogSec=0
WATCHDOG
sudo systemctl daemon-reload
Per-service watchdog:
[Service]
ExecStart=/usr/bin/myapp
# "Kick" watchdog every 10 seconds
WatchdogSec=10
# systemd will reload this service every 10 seconds
# If app doesn't call sd_notify(0, "WATCHDOG=1"), service restarts
In app code (example: Python):
import sys
import time
import dbus
import systemd.daemon
while True:
# Do work
process_request()
# Kick watchdog every 5 seconds (< WatchdogSec)
try:
systemd.daemon.notify('WATCHDOG=1')
except:
pass # systemd not available
time.sleep(5)
6. Kernel & Livepatch Policy
Kernel Update Strategy
Options:
- Scheduled reboot (safest): Deploy patch, reboot at maintenance window
- Livepatch (zero-downtime): Hot-patch without reboot
- Hybrid (recommended): Use livepatch for urgent CVEs, schedule full reboot monthly
Trade-off matrix:
Approach | Downtime | Complexity | Rollback | Risk |
---|---|---|---|---|
Scheduled reboot | Brief (seconds) | Low | Easy (boot old kernel) | Low |
Livepatch | None | High | Complex (state inconsistency) | Medium |
Hybrid | None/brief | Medium | Easy (revert live patch, reboot) | Low-Medium |
Livepatch Setup (RHEL/Ubuntu Pro)
What it is:
kpatch
(Red Hat) orlivepatch
(Canonical): In-kernel hot patching- Patches applied without reboot
- Persists until next boot (reverts to stock kernel after reboot)
RHEL with kpatch:
# Install
dnf install kpatch
# Check available patches
kpatch list
# Apply patch (no reboot)
sudo kpatch load /usr/src/kernels/kpatch/CVE-2025-xxxxx.patch
# Verify
kpatch list
# Output: Applied patch: CVE-2025-xxxxx
# Persist across reboot (or unload)
sudo kpatch save # Save state (optional)
# Revert (if issues)
sudo kpatch unload CVE-2025-xxxxx
# Full kernel update + reboot (eventually)
dnf update kernel
sudo reboot
Ubuntu with livepatch (Pro only):
# Install
snap install livepatch --classic
# Authenticate (requires Ubuntu Pro subscription)
sudo ubuntu-livepatch enable --auth-token=YOUR_TOKEN
# Check status
ubuntu-livepatch status
# Output: live patching: enabled
# Check applied patches
cat /proc/livepatch
# Revert
ubuntu-livepatch disable
Tested Rollback Path
Mandatory for production:
- Pre-deployment test:
# 1. Test kernel in staging environment
# 2. Verify app starts, basic functionality works
# 3. Test rollback (reboot to previous kernel)
- Kernel menu entry (GRUB):
# Verify old kernel still in GRUB menu
sudo grep -A2 "^menuentry" /boot/grub/grub.cfg | head -20
# Test: Reboot and select old kernel from menu
# Verify system boots & app functional
- Automatic rollback (if kernel panics):
# /etc/default/grub
GRUB_DEFAULT=0 # Boot latest kernel
GRUB_RECORDFAIL_TIMEOUT=60 # Try old kernel if crash
# Update grub
sudo grub-mkconfig -o /boot/grub/grub.cfg
# Test: Add bad kernel param to force panic, verify auto-rollback
sudo sed -i 's/GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="panic=10 /' /etc/default/grub
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot # Should kernel panic, then roll back
# After rollback, revert the panic param
- Document rollback procedure:
# Runbook: /docs/kernel-rollback.md
# Manual rollback steps:
# 1. Reboot server
# 2. At GRUB menu, select older kernel entry
# 3. Boot
# 4. Verify app operational
# 5. (Optional) Remove new kernel: apt remove linux-image-NEW
Reliability Checklist
Pre-Deployment
- Chrony installed & syncing (offset < 1ms)
- Hostname set persistently (
hostnamectl
) - Network interfaces have predictable names (eno1, enp0s25)
- journald configured for persistent storage (
/var/log/journal/
) - /var/log on separate partition (if possible) with adequate space
- logrotate configured for all log paths
- Services have graceful shutdown hooks (
ExecStop
) - Critical services have watchdog timers
- Kernel rollback path tested (GRUB old entry, manual reboot)
- Livepatch staging tested (if using livepatch)
Post-Deployment
- Logs persistent across reboot (
sudo journalctl -b -1
) - Time synced on multiple boots (
chronyc tracking
) - Hostname persists after reboot (
hostnamectl
) - Network interfaces keep same name after reboot (
ip link
) - Service graceful shutdown tested (no hard kills)
- Watchdog recovery tested (force hang, verify reboot)
- Kernel patches applied (check
uname -r
) - Previous kernel available in GRUB menu
- Log rotation working (
sudo logrotate -f /etc/logrotate.d/*
)
Ongoing
- Weekly: Check time sync drift (
chronyc tracking
β offset) - Weekly: Monitor disk usage (
du -sh /var/log
) - Monthly: Test service restart (verify graceful shutdown)
- Monthly: Test kernel rollback (if not using auto-rollback)
- Quarterly: Test full system reboot (boot, apps running, logs persist)
- Quarterly: Verify livepatch/kpatch still patching
Quick Reference Commands
# Time Sync
chronyc sources # NTP sources & sync status
chronyc tracking # Detailed sync info
timedatectl status # systemd time view
# Networking
hostnamectl # View/set hostname
ip link show # Interface names & status
cat /proc/cmdline # Verify net.ifnames & net.ifnames
# Logging
sudo journalctl -b # This boot's logs
sudo journalctl -b -1 # Previous boot
sudo du -sh /var/log/journal/ # Journal size
sudo logrotate -d /etc/logrotate.d/* # Test rotation
# Shutdown
sudo systemctl show -p ExecMainStatus SERVICE # Exit code
sudo journalctl -u SERVICE -n 50 # Last 50 lines
# Watchdog
sudo systemctl status systemd-journald # Check watchdog
cat /proc/sys/kernel/watchdog_thresh # Watchdog timeout
# Kernel & Livepatch
uname -r # Current kernel version
cat /proc/cmdline # Kernel params at boot
sudo grub-set-default 0 # Set default kernel (GRUB)
kpatch list # (Red Hat) Applied patches
ubuntu-livepatch status # (Ubuntu Pro) Patch status
Common Pitfalls
Pitfall 1: journald Loses Logs on Reboot
Problem: Logs in /run/log/journal/
disappear after reboot
Fix: Enable persistent storage: mkdir -p /var/log/journal && systemctl restart systemd-journald
Pitfall 2: Service Killed Abruptly on Shutdown
Problem: TimeoutStopSec
too short β process SIGKILL before flushing
Fix: Increase timeout, verify ExecStop
command works standalone: sudo -u SERVICE_USER /path/to/stop/cmd
Pitfall 3: Watchdog Reboots Too Often
Problem: Legitimate workload takes > WatchdogSec to “kick”
Fix: Increase WatchdogSec
, ensure app calls sd_notify(0, "WATCHDOG=1")
periodically
Pitfall 4: Can’t Rollback to Old Kernel
Problem: Old kernel removed (apt autoremove) or not in GRUB
Fix: Keep 2 kernels: apt-mark hold linux-image-OLD
, verify GRUB: sudo grub-mkconfig
Pitfall 5: Chrony Sync Lost in VM/Container
Problem: VM clock jitter > network jitter β no sync
Fix: Use host NTP, set --privileged
on container, or accept higher offset tolerance