Linux Reliability & Lifecycle: Time Sync, Logging, Shutdown, and Patching

Executive Summary

Reliability means predictable, auditable behavior. This guide covers:

Time sync: Chrony for clock accuracy (critical for logging, security)
Networking: Stable interface names & hostnames (infrastructure consistency)
Logging: Persistent journald + logrotate (audit trail + disk management)
Shutdown: Clean hooks to prevent data loss
Patching: Kernel updates with livepatch (zero-downtime), tested rollback

1. Time Synchronization (chrony)

Why Time Matters

Critical for:

Logging: Accurate timestamps for debugging, compliance audits
Security: TLS cert validation, Kerberos, API token expiry
Distributed systems: Causality ordering (happens-before relationships)
Monitoring: Alert timing, metric correlation

Cost of poor time sync:

Logs from different servers appear out-of-order
API clients rejected (clock skew > tolerance)
Kerberos auth fails (clock > 5 min off)
Monitoring alerts triggered on old data

Chrony Installation & Configuration

What it is:

Modern NTP daemon (faster sync than ntpd)
Handles drifting clocks, network jitter
Accurate to milliseconds (suitable for most workloads)
Survives network outages, re-syncs quickly

Install & configure (Ubuntu/Debian):

apt install chrony

# Configure: /etc/chrony/chrony.conf
sudo tee /etc/chrony/chrony.conf > /dev/null << 'CHRONY'
# Default NTP servers (Debian/Ubuntu defaults)
pool ntp.ubuntu.com iburst

# Or, use specific servers (lower latency):
server time.cloudflare.com iburst
server time.google.com iburst
server time.aws.com iburst

# Allow local clients (NTP query from localhost)
allow 127.0.0.1
allow ::1

# For containers: allow from Docker/Kubernetes subnet
allow 172.17.0.0/16
allow 10.0.0.0/8

# Drift file (tracks clock rate)
driftfile /var/lib/chrony/chrony.drift

# Leap seconds file
leapsectz right/UTC

# Make time adjustments gradually (better for apps)
makestep 1.0 3  # First 3 updates: adjust if > 1 sec

# RTC (real-time clock) sync
rtcsync

# Enable hardware timestamping (if supported)
hwtimestamp *
CHRONY

# Apply
sudo systemctl restart chrony
sudo systemctl enable chrony

# Verify
chronyc sources          # Show NTP sources
chronyc tracking         # Show sync details & estimated error
timedatectl show         # systemd time status

Install & configure (RHEL/Fedora):

dnf install chrony

# Same /etc/chrony/chrony.conf (above)
# Default RHEL servers: pool.ntp.org

sudo systemctl enable chronyd
sudo systemctl restart chronyd

# Verify
chronyc sources
timedatectl status

Time Sync Verification & Monitoring

Check sync status:

# Current time vs NTP source
timedatectl

# Chrony details
chronyc tracking
# Output:
#   Reference ID    : C0248D97 (time.cloudflare.com)
#   Stratum         : 3
#   Ref time (UTC)  : Wed Oct 16 12:34:56 2025
#   System time     : 0.000001234 seconds fast
#   Last offset     : +0.000001234 seconds
#   RMS offset      : 0.000000789 seconds
#   Residual freq   : 0.001 ppm
#   Residual skew   : 0.002 ppm
#   Root delay      : 0.023456 seconds
#   Root dispersion : 0.012345 seconds
#   Update interval : 64.2 seconds
#   Leap status     : Normal

# Offset should be < 1ms for production

Monitor in production:

# Prometheus exporter (node-exporter includes time metrics)
curl localhost:9100/metrics | grep node_time

# Manual check (every minute)
chronyc -c tracking | awk '{print $(NF-5), $(NF-4), $(NF-3)}'  # offset, skew, delay

# Alert on large offset (> 10ms)
# Alert on sync loss (stratum > 15 = unsync)

2. Networking: Stable Interface Names & Hostnames

Why Predictable Networking Matters

Problem: MAC address randomization, BIOS reordering → eth0 becomes eth1 → network broken
Solution: Predictable names + systemd networkd or cloud-init
Bonus: Infrastructure-as-code can reference stable names

Predictable Interface Names

systemd naming scheme:

en = Ethernet, wl = Wireless, ww = Wireless WAN
o<index> = On-board (by device index)
s<slot> = PCI slot
p<bus>s<slot> = PCI address
Example: eno1 = on-board Ethernet #1, enp0s25 = PCI slot 0, device 25

Check current scheme:

ip link show
# eno1: flags=UP...
# enp0s25: flags=UP...

Verify predictability is enabled:

# Check kernel command line (should NOT have net.ifnames=0)
cat /proc/cmdline

# Verify udev rule (should exist)
ls -la /etc/udev/rules.d/ | grep -i net

If using DHCP or cloud-init, ensure stable config:

# /etc/netplan/00-installer-config.yaml (Ubuntu)
sudo tee /etc/netplan/00-installer-config.yaml > /dev/null << 'NETPLAN'
network:
  version: 2
  ethernets:
    eno1:
      dhcp4: true
      dhcp6: true
      optional: true
    enp0s25:
      dhcp4: true
      optional: true
NETPLAN

# Apply
sudo netplan apply

# Verify
ip addr show

Stable Hostnames

What it is:

hostname command (transient, lost on reboot)
/etc/hostname file (persistent, survives reboot)
hostnamectl (systemd tool, recommended)
DNS A record (matches hostname for orchestration)

Set stable hostname:

# View current
hostnamectl

# Set persistent hostname
sudo hostnamectl set-hostname prod-app-01.example.com

# Verify
cat /etc/hostname
hostnamectl

# Update /etc/hosts (localhost resolution)
sudo sed -i 's/127.0.1.1.*/127.0.1.1 prod-app-01.example.com prod-app-01/' /etc/hosts
grep 127.0.1.1 /etc/hosts

# Test
hostname
hostname -f

In Kubernetes/cloud environments:

# Use cloud-init to set hostname from instance metadata
# cloud.yaml (cloud-init)
hostname: prod-app-01
fqdn: prod-app-01.example.com
prefer_fqdn_over_hostname: true

# Or via user-data script
#!/bin/bash
hostnamectl set-hostname $(ec2-metadata --instance-id | cut -d' ' -f2).example.com

3. Logging: journald Persistent & logrotate

journald Persistent Storage

What it is:

journald = systemd logging daemon (replaces syslog)
By default, logs stored in /run/log/journal/ (volatile, lost on reboot)
Persistent mode: Logs in /var/log/journal/ (survives reboot)
Indexed, queryable, per-unit organization

Enable persistent journald:

# Create directory
sudo mkdir -p /var/log/journal

# Set permissions
sudo chmod 2755 /var/log/journal

# Configure: /etc/systemd/journald.conf
sudo tee /etc/systemd/journald.conf > /dev/null << 'JOURNALD'
[Journal]
# Storage (auto=persistent if /var/log/journal exists, volatile otherwise)
Storage=persistent

# Max size: 10% of /var or explicit value
SystemMaxUse=10G
RuntimeMaxUse=256M

# Max file size before rotation
SystemMaxFileSize=100M
RuntimeMaxFileSize=10M

# Retention: keep for N days
MaxRetentionSec=30day

# Forward to syslog (optional, for legacy log aggregation)
ForwardToSyslog=no

# Compress old journals
Compress=yes

# Split by UID
SplitMode=uid

# Sync to disk frequency (trade-off: speed vs. durability)
SyncIntervalSec=5min
JOURNALD

# Apply
sudo systemctl restart systemd-journald

# Verify
sudo journalctl --list-boots  # Should show multiple boots
sudo du -sh /var/log/journal/

Query persistent logs:

# Logs from last boot
journalctl -b

# Logs from previous boot (-1 = last, -2 = second-last)
journalctl -b -1

# Logs in time range
journalctl --since "2025-10-16 10:00:00" --until "2025-10-16 11:00:00"

# Per-unit
journalctl -u sshd.service -n 100  # Last 100 lines

# Priority filter
journalctl -p err          # Errors only
journalctl -p warning..err # Warnings & errors

# Follow (tail -f style)
journalctl -f

logrotate: Disk Management

What it is:

Rotates log files when they grow too large
Compresses old logs
Deletes old logs after N days
Runs daily via cron/timer

Install & configure:

# Usually pre-installed
apt install logrotate

# Main config: /etc/logrotate.conf
sudo cat /etc/logrotate.conf

# Per-app config: /etc/logrotate.d/myapp
sudo tee /etc/logrotate.d/myapp > /dev/null << 'LOGROTATE'
/var/log/myapp/*.log {
    # Rotate when file > 100MB
    size 100M
    
    # Keep 30 old compressed logs
    rotate 30
    
    # Compress old logs (gzip)
    compress
    
    # Don't compress yesterday's log (keep readable)
    delaycompress
    
    # Keep empty log files after rotation
    missingok
    
    # Don't error if log file missing
    notifempty
    
    # Create new log file with these perms
    create 0640 myapp myapp
    
    # Reload app after rotation (send signal)
    sharedscripts
    postrotate
        systemctl reload myapp > /dev/null 2>&1 || true
    endscript
}
LOGROTATE

# Test rotation (dry-run)
sudo logrotate -d /etc/logrotate.d/myapp

# Force rotation (for testing)
sudo logrotate -f /etc/logrotate.d/myapp

# Check schedule
cat /etc/cron.daily/logrotate  # Usually runs daily

Separate /var/log Partition

Why:

Prevents log disk-full from crashing system
Easier to manage quota independently
Can use different filesystem/performance tuning

Setup on existing system:

# 1. Create logical volume (if using LVM)
sudo lvcreate -L 50G -n var_log vg0

# 2. Format
sudo mkfs.ext4 /dev/vg0/var_log

# 3. Backup current logs
sudo tar -czf /tmp/var-log-backup.tar.gz /var/log

# 4. Mount new partition
sudo mkdir -p /var/log.new
sudo mount /dev/vg0/var_log /var/log.new

# 5. Restore logs
sudo tar -xzf /tmp/var-log-backup.tar.gz -C /var/log.new --strip-components=1

# 6. Add to /etc/fstab for persistence
echo "/dev/vg0/var_log /var/log ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab

# 7. Unmount & remount
sudo umount /var/log.new
sudo mount /var/log
sudo df -h /var/log

4. Clean Shutdown Hooks

Why Graceful Shutdown Matters

Problem: SIGKILL (immediate shutdown) → data loss, corrupted files, incomplete transactions
Solution: Shutdown hooks → services flush data, connections close gracefully
Benefit: Faster recovery, no fsck/repair needed

systemd Shutdown Hooks

What it is:

ExecStop = graceful shutdown command
TimeoutStopSec = max wait (then SIGKILL)
KillMode = how to kill the process (control-group, process, mixed)

Example hardened service with graceful shutdown:

[Unit]
Description=My Database
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=postgres
ExecStart=/usr/bin/postgres -D /var/lib/postgresql

# Graceful shutdown: send SIGTERM, wait for clean exit
ExecStop=/usr/bin/pg_ctl -D /var/lib/postgresql stop -m fast

# Max 60 seconds for graceful shutdown (then SIGKILL)
TimeoutStopSec=60

# Kill entire process group (children too)
KillMode=control-group

# Kill signal (default SIGTERM, alternative SIGINT)
KillSignal=SIGTERM

# Restart policy
Restart=on-failure
RestartSec=5s

# Resource limits
MemoryLimit=4G
CPUQuota=80%

[Install]
WantedBy=multi-user.target

Load and test:

sudo systemctl daemon-reload
sudo systemctl start mydb.service

# Trigger graceful shutdown
sudo systemctl stop mydb.service
# Watch logs
sudo journalctl -u mydb.service -f

# Verify exit was clean (should be 0)
sudo systemctl show -p ExecMainStatus mydb.service

Pre-Shutdown Hooks (ExecStopPost)

Scenario: Need to clean up before service stops (flush buffers, dump state)

[Service]
ExecStart=/opt/myapp/run

# Pre-stop: notify app of shutdown
ExecStop=/bin/sh -c 'curl -s http://localhost:8080/shutdown || true'

# Post-stop: archive state (after process dead)
ExecStopPost=/bin/sh -c 'tar -czf /var/backups/myapp-state.tar.gz /var/lib/myapp && logger "Saved state"'

TimeoutStopSec=30

5. Watchdog Timer

Why Watchdog

Problem: Process hangs (deadlock, infinite loop) → looks running but unresponsive
Solution: Watchdog timer → if not “pinged”, reboot system
Benefit: Self-healing (brief reboot better than hung state)

systemd Watchdog

What it is:

WatchdogSec = interval to “kick” watchdog
RuntimeWatchdogSec = reboot if watchdog not kicked for N seconds

Enable watchdog (system-wide):

# /etc/systemd/system.conf
sudo tee -a /etc/systemd/system.conf > /dev/null << 'WATCHDOG'
# Reboot if systemd hangs (not responding to signals)
RuntimeWatchdogSec=20
# Don't reboot on normal shutdown
ShutdownWatchdogSec=0
WATCHDOG

sudo systemctl daemon-reload

Per-service watchdog:

[Service]
ExecStart=/usr/bin/myapp

# "Kick" watchdog every 10 seconds
WatchdogSec=10

# systemd will reload this service every 10 seconds
# If app doesn't call sd_notify(0, "WATCHDOG=1"), service restarts

In app code (example: Python):

import sys
import time
import dbus
import systemd.daemon

while True:
    # Do work
    process_request()
    
    # Kick watchdog every 5 seconds (< WatchdogSec)
    try:
        systemd.daemon.notify('WATCHDOG=1')
    except:
        pass  # systemd not available
    
    time.sleep(5)

6. Kernel & Livepatch Policy

Kernel Update Strategy

Options:

Scheduled reboot (safest): Deploy patch, reboot at maintenance window
Livepatch (zero-downtime): Hot-patch without reboot
Hybrid (recommended): Use livepatch for urgent CVEs, schedule full reboot monthly

Trade-off matrix:

Approach	Downtime	Complexity	Rollback	Risk
Scheduled reboot	Brief (seconds)	Low	Easy (boot old kernel)	Low
Livepatch	None	High	Complex (state inconsistency)	Medium
Hybrid	None/brief	Medium	Easy (revert live patch, reboot)	Low-Medium

Livepatch Setup (RHEL/Ubuntu Pro)

What it is:

kpatch (Red Hat) or livepatch (Canonical): In-kernel hot patching
Patches applied without reboot
Persists until next boot (reverts to stock kernel after reboot)

RHEL with kpatch:

# Install
dnf install kpatch

# Check available patches
kpatch list

# Apply patch (no reboot)
sudo kpatch load /usr/src/kernels/kpatch/CVE-2025-xxxxx.patch

# Verify
kpatch list
# Output: Applied patch: CVE-2025-xxxxx

# Persist across reboot (or unload)
sudo kpatch save  # Save state (optional)

# Revert (if issues)
sudo kpatch unload CVE-2025-xxxxx

# Full kernel update + reboot (eventually)
dnf update kernel
sudo reboot

Ubuntu with livepatch (Pro only):

# Install
snap install livepatch --classic

# Authenticate (requires Ubuntu Pro subscription)
sudo ubuntu-livepatch enable --auth-token=YOUR_TOKEN

# Check status
ubuntu-livepatch status
# Output: live patching: enabled

# Check applied patches
cat /proc/livepatch

# Revert
ubuntu-livepatch disable

Tested Rollback Path

Mandatory for production:

Pre-deployment test:

# 1. Test kernel in staging environment
# 2. Verify app starts, basic functionality works
# 3. Test rollback (reboot to previous kernel)

Kernel menu entry (GRUB):

# Verify old kernel still in GRUB menu
sudo grep -A2 "^menuentry" /boot/grub/grub.cfg | head -20

# Test: Reboot and select old kernel from menu
# Verify system boots & app functional

Automatic rollback (if kernel panics):

# /etc/default/grub
GRUB_DEFAULT=0                          # Boot latest kernel
GRUB_RECORDFAIL_TIMEOUT=60              # Try old kernel if crash

# Update grub
sudo grub-mkconfig -o /boot/grub/grub.cfg

# Test: Add bad kernel param to force panic, verify auto-rollback
sudo sed -i 's/GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="panic=10 /' /etc/default/grub
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot  # Should kernel panic, then roll back
# After rollback, revert the panic param

Document rollback procedure:

# Runbook: /docs/kernel-rollback.md

# Manual rollback steps:
# 1. Reboot server
# 2. At GRUB menu, select older kernel entry
# 3. Boot
# 4. Verify app operational
# 5. (Optional) Remove new kernel: apt remove linux-image-NEW

Reliability Checklist

Pre-Deployment

Chrony installed & syncing (offset < 1ms)
Hostname set persistently (hostnamectl)
Network interfaces have predictable names (eno1, enp0s25)
journald configured for persistent storage (/var/log/journal/)
/var/log on separate partition (if possible) with adequate space
logrotate configured for all log paths
Services have graceful shutdown hooks (ExecStop)
Critical services have watchdog timers
Kernel rollback path tested (GRUB old entry, manual reboot)
Livepatch staging tested (if using livepatch)

Post-Deployment

Logs persistent across reboot (sudo journalctl -b -1)
Time synced on multiple boots (chronyc tracking)
Hostname persists after reboot (hostnamectl)
Network interfaces keep same name after reboot (ip link)
Service graceful shutdown tested (no hard kills)
Watchdog recovery tested (force hang, verify reboot)
Kernel patches applied (check uname -r)
Previous kernel available in GRUB menu
Log rotation working (sudo logrotate -f /etc/logrotate.d/*)

Ongoing

Weekly: Check time sync drift (chronyc tracking → offset)
Weekly: Monitor disk usage (du -sh /var/log)
Monthly: Test service restart (verify graceful shutdown)
Monthly: Test kernel rollback (if not using auto-rollback)
Quarterly: Test full system reboot (boot, apps running, logs persist)
Quarterly: Verify livepatch/kpatch still patching

Quick Reference Commands

# Time Sync
chronyc sources          # NTP sources & sync status
chronyc tracking         # Detailed sync info
timedatectl status       # systemd time view

# Networking
hostnamectl              # View/set hostname
ip link show             # Interface names & status
cat /proc/cmdline        # Verify net.ifnames & net.ifnames

# Logging
sudo journalctl -b       # This boot's logs
sudo journalctl -b -1    # Previous boot
sudo du -sh /var/log/journal/  # Journal size
sudo logrotate -d /etc/logrotate.d/*  # Test rotation

# Shutdown
sudo systemctl show -p ExecMainStatus SERVICE  # Exit code
sudo journalctl -u SERVICE -n 50  # Last 50 lines

# Watchdog
sudo systemctl status systemd-journald  # Check watchdog
cat /proc/sys/kernel/watchdog_thresh  # Watchdog timeout

# Kernel & Livepatch
uname -r                 # Current kernel version
cat /proc/cmdline        # Kernel params at boot
sudo grub-set-default 0  # Set default kernel (GRUB)
kpatch list              # (Red Hat) Applied patches
ubuntu-livepatch status  # (Ubuntu Pro) Patch status

Common Pitfalls

Pitfall 1: journald Loses Logs on Reboot

Problem: Logs in /run/log/journal/ disappear after reboot
Fix: Enable persistent storage: mkdir -p /var/log/journal && systemctl restart systemd-journald

Pitfall 2: Service Killed Abruptly on Shutdown

Problem: TimeoutStopSec too short → process SIGKILL before flushing
Fix: Increase timeout, verify ExecStop command works standalone: sudo -u SERVICE_USER /path/to/stop/cmd

Pitfall 3: Watchdog Reboots Too Often

Problem: Legitimate workload takes > WatchdogSec to “kick”
Fix: Increase WatchdogSec, ensure app calls sd_notify(0, "WATCHDOG=1") periodically

Pitfall 4: Can’t Rollback to Old Kernel

Problem: Old kernel removed (apt autoremove) or not in GRUB
Fix: Keep 2 kernels: apt-mark hold linux-image-OLD, verify GRUB: sudo grub-mkconfig

Pitfall 5: Chrony Sync Lost in VM/Container

Problem: VM clock jitter > network jitter → no sync
Fix: Use host NTP, set --privileged on container, or accept higher offset tolerance

Executive Summary#

1. Time Synchronization (chrony)#

Why Time Matters#

Chrony Installation & Configuration#

Time Sync Verification & Monitoring#

2. Networking: Stable Interface Names & Hostnames#

Why Predictable Networking Matters#

Predictable Interface Names#

Stable Hostnames#

3. Logging: journald Persistent & logrotate#

journald Persistent Storage#

logrotate: Disk Management#

Separate /var/log Partition#

4. Clean Shutdown Hooks#

Why Graceful Shutdown Matters#

systemd Shutdown Hooks#

Pre-Shutdown Hooks (ExecStopPost)#

5. Watchdog Timer#

Why Watchdog#

systemd Watchdog#

6. Kernel & Livepatch Policy#

Kernel Update Strategy#

Livepatch Setup (RHEL/Ubuntu Pro)#

Tested Rollback Path#

Reliability Checklist#

Pre-Deployment#

Post-Deployment#

Ongoing#

Quick Reference Commands#

Common Pitfalls#

Pitfall 1: journald Loses Logs on Reboot#

Pitfall 2: Service Killed Abruptly on Shutdown#

Pitfall 3: Watchdog Reboots Too Often#

Pitfall 4: Can’t Rollback to Old Kernel#

Pitfall 5: Chrony Sync Lost in VM/Container#

Further Reading#

Executive Summary

1. Time Synchronization (chrony)

Why Time Matters

Chrony Installation & Configuration

Time Sync Verification & Monitoring

2. Networking: Stable Interface Names & Hostnames

Why Predictable Networking Matters

Predictable Interface Names

Stable Hostnames

3. Logging: journald Persistent & logrotate

journald Persistent Storage

logrotate: Disk Management

Separate /var/log Partition

4. Clean Shutdown Hooks

Why Graceful Shutdown Matters

systemd Shutdown Hooks

Pre-Shutdown Hooks (ExecStopPost)

5. Watchdog Timer

Why Watchdog

systemd Watchdog

6. Kernel & Livepatch Policy

Kernel Update Strategy

Livepatch Setup (RHEL/Ubuntu Pro)

Tested Rollback Path

Reliability Checklist

Pre-Deployment

Post-Deployment

Ongoing

Quick Reference Commands

Common Pitfalls

Pitfall 1: journald Loses Logs on Reboot

Pitfall 2: Service Killed Abruptly on Shutdown

Pitfall 3: Watchdog Reboots Too Often

Pitfall 4: Can’t Rollback to Old Kernel

Pitfall 5: Chrony Sync Lost in VM/Container

Further Reading