Executive Summary
Observability = see inside your systems: metrics (CPU, memory, I/O), logs (audit trail), traces (syscalls, latency).
This guide covers:
- Metrics: node_exporter β Prometheus (system-level health)
- Logs: journald β rsyslog/Vector/Fluent Bit (aggregation)
- eBPF tools: 5 quick wins (trace syscalls, network, I/O)
- Triage: 5-minute flowchart to diagnose CPU, memory, I/O, network issues
1. Metrics: node_exporter & Prometheus
What It Is
- node_exporter: Exposes OS metrics (CPU, memory, disk, network) as Prometheus scrape target
- Prometheus: Time-series database; collects metrics, queries, alerts
- Dashboard: Grafana visualizes Prometheus data
Install node_exporter
Ubuntu/Debian:
# Download latest release
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
# Extract & install
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'SERVICE'
[Unit]
Description=Node Exporter
After=network-online.target
Wants=network-online.target
[Service]
User=node-exporter
Group=node-exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
# Auto-restart
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
SERVICE
# Create user
sudo useradd -rs /bin/false node-exporter
# Enable & start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
# Verify
curl http://localhost:9100/metrics | head -20
RHEL/Fedora:
# Via DNF (if available)
dnf install node_exporter
# Or manual (same as above)
Key Metrics to Monitor
CPU:
node_cpu_seconds_total{cpu,mode} # CPU time by core & mode (user, system, idle, iowait)
node_load1 # 1-minute load average
node_context_switches_total # Context switches (high = contention)
Memory:
node_memory_MemTotal_bytes # Total RAM
node_memory_MemAvailable_bytes # Available RAM (for new processes)
node_memory_MemFree_bytes # Free RAM
node_memory_SwapTotal_bytes # Total swap
node_memory_SwapFree_bytes # Free swap
Disk I/O:
node_disk_io_time_ms_total{device} # Time spent in I/O
node_disk_read_bytes_total{device} # Bytes read
node_disk_write_bytes_total{device} # Bytes written
node_disk_io_reads_completed_total # Read operations
Network:
node_network_receive_bytes_total # Bytes received (per interface)
node_network_transmit_bytes_total # Bytes transmitted
node_network_receive_errs_total # RX errors (should be 0)
node_network_transmit_errs_total # TX errors (should be 0)
Filesystem:
node_filesystem_avail_bytes # Available space
node_filesystem_size_bytes # Total size
node_filesystem_files # Total inodes
node_filesystem_files_free # Free inodes
Prometheus Scrape Config
/etc/prometheus/prometheus.yml
:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100'] # node_exporter
labels:
environment: 'production'
service: 'linux'
# Multiple servers
- job_name: 'production-nodes'
static_configs:
- targets: ['server1:9100', 'server2:9100', 'server3:9100']
Alerting Rules (Prometheus)
/etc/prometheus/alert-rules.yml
:
groups:
- name: linux-alerts
interval: 30s
rules:
# CPU > 80% for 5 minutes
- alert: HighCPUUsage
expr: (100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 5m
annotations:
summary: "High CPU on {{ $labels.instance }}"
# Memory > 85% available
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
annotations:
summary: "High memory on {{ $labels.instance }}"
# Disk > 90% full
- alert: HighDiskUsage
expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 90
for: 1m
annotations:
summary: "Disk {{ $labels.device }} {{ $labels.instance }} almost full"
# I/O wait > 50%
- alert: HighIOWait
expr: avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) * 100 > 50
for: 5m
annotations:
summary: "High I/O wait on {{ $labels.instance }}"
# Network errors
- alert: NetworkErrors
expr: increase(node_network_receive_errs_total[5m]) > 0 or increase(node_network_transmit_errs_total[5m]) > 0
for: 1m
annotations:
summary: "Network errors on {{ $labels.instance }}"
2. Logging: journald β Aggregation
journald (Local, Already Covered)
Recap:
- Persistent storage:
/var/log/journal/
- Query:
journalctl -u SERVICE
,journalctl -b
,journalctl --since
Log Aggregation: Choices
Tool | Setup | Complexity | Cost | Notes |
---|---|---|---|---|
rsyslog | TCP/syslog forwarding | Low | Free | Traditional, works with old systems |
Vector | TOML config, Rust | Medium | Free | Modern, efficient, rich transforms |
Fluent Bit | Lightweight C daemon | Medium | Free | Small footprint, good for containers |
Loki | Log aggregator (Grafana) | High | Free/Paid | Queryable like Prometheus |
Option 1: rsyslog (Simple, Traditional)
Forward journald logs to remote syslog server:
Client (/etc/rsyslog.d/forwarding.conf
):
# Send all logs to remote syslog server
*.* @@syslog.example.com:514 # TCP
# or
*.* @syslog.example.com:514 # UDP (faster, unreliable)
Server (/etc/rsyslog.d/server.conf
):
# Listen for syslog
$ModLoad imtcp
$InputTCPServerRun 514
# Save by hostname & program
$DirCreateMode 0755
$FileCreateMode 0644
$Umask 0022
$WorkDirectory /var/lib/rsyslog
template(name="DynFile" type="string" string="/var/log/rsyslog/%HOSTNAME%/%syslogtag%.log")
*.* ?DynFile
Option 2: Vector (Modern, Flexible)
Install:
# Install Vector
curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | sh
# Or DNF/APT
dnf install vector
apt install vector
Config (/etc/vector/vector.toml
):
# Read from journald
[sources.journald]
type = "journald"
# Optional: filter only errors
# include = { priority = "err" }
# Parse JSON in some logs
[transforms.parse_json]
type = "remap"
inputs = ["journald"]
source = '''
.json = parse_json!(.message)
'''
# Ship to HTTP endpoint (e.g., Loki, Datadog)
[sinks.http]
type = "http"
inputs = ["parse_json"]
uri = "https://logs.example.com/api/v1/push"
method = "post"
encoding.codec = "json"
# Or, file rotation (local backup)
[sinks.files]
type = "file"
inputs = ["journald"]
path = "/var/log/vector/%HOST%/%year%-%month%-%day%.log"
rotation.size_bytes = 104857600 # 100 MB
Start:
sudo systemctl enable vector
sudo systemctl start vector
sudo systemctl status vector
# Monitor
journalctl -u vector -f
Option 3: Fluent Bit (Lightweight, Container-Friendly)
Install:
apt install fluent-bit
# or
dnf install fluent-bit
Config (/etc/fluent-bit/fluent-bit.conf
):
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name systemd
Tag journald.*
Read_From_Tail On
[FILTER]
Name modify
Match *
Add hostname ${HOSTNAME}
Add environment production
[OUTPUT]
Name stackdriver
Match *
google_service_credentials /path/to/credentials.json
resource k8s_node
resource_labels project_id=my-project,location=us-central1,node_id=${HOSTNAME}
# Or, forward to Loki
[OUTPUT]
Name loki
Match *
Host loki.example.com
Port 3100
Labels job=systemd, hostname=${HOSTNAME}
Start:
sudo systemctl enable fluent-bit
sudo systemctl start fluent-bit
3. eBPF Tools: 5 Quick Wins
Why eBPF
- Zero-code-change tracing: Hook syscalls, kernel functions without recompiling
- Production-safe: Runs in sandboxed kernel VM
- Low-overhead: Minimal CPU impact
Prerequisites
# Install BCC (eBPF Compiler Collection)
apt install bpfcc-tools linux-headers-$(uname -r)
# or
dnf install bcc bcc-tools kernel-devel
# Or, install bpftrace (higher-level)
apt install bpftrace
# or
dnf install bpftrace
# Kernel requirement: β₯ 4.9 (ideally β₯ 5.0)
uname -r
5 Quick Wins
1. execsnoop: Trace Process Execution
What: Shows every process spawned (command, PID, UID, exit code)
Use case: Debug which processes are running, unexplained resource usage
Command:
sudo /usr/share/bcc/tools/execsnoop -h
# or
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s\n", comm); }'
# Real example:
sudo /usr/share/bcc/tools/execsnoop
# Output:
# PCOMM PID PPID RET ARGS
# bash 1234 1200 0 /bin/ls -la
# python 1235 1234 0 python script.py
5-minute triage:
- Run
execsnoop
for 60 seconds - Look for unexpected processes (malware, resource hogs)
- Check PID, PPID, command line
2. opensnoop: Trace File Open Operations
What: Shows every file opened (path, flags, PID, success/failure)
Use case: Find performance bottlenecks (config file re-reads), unexpected file access
Command:
sudo /usr/share/bcc/tools/opensnoop -h
# Filter by process name
sudo /usr/share/bcc/tools/opensnoop -n nginx
# Real example:
sudo /usr/share/bcc/tools/opensnoop
# Output:
# PID COMM FD ERR PATH
# 1234 nginx 10 0 /etc/nginx/nginx.conf
# 1234 nginx 11 0 /var/log/nginx/access.log
# 1235 python -1 2 /tmp/config.json (file not found)
5-minute triage:
- Look for repeated file opens (caching miss, config re-load)
- Check for permission errors (FD=-1, ERR=13)
- Identify I/O bottlenecks
3. tcplife: Trace TCP Connection Lifecycle
What: Shows TCP connections (source, dest, port, duration, bytes)
Use case: Slow connections, connection leaks, bandwidth analysis
Command:
sudo /usr/share/bcc/tools/tcplife -h
# Show active connections
sudo /usr/share/bcc/tools/tcplife
# Real example:
# PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
# 1234 curl 10.0.0.5 45678 1.2.3.4 443 5 10 1234
# 1235 nginx 0.0.0.0 80 10.0.0.6 54321 100 200 500
5-minute triage:
- Check connection duration (MS column): slow = timeout issue?
- Look at bytes transferred (TX_KB, RX_KB): bandwidth saturation?
- Find long-lived idle connections (might need keep-alive tuning)
4. biolatency: Measure Block I/O Latency
What: Histogram of I/O latency (how long disk requests take)
Use case: Identify I/O bottlenecks (slow disk, queue saturation)
Command:
sudo /usr/share/bcc/tools/biolatency -h
# Run for 10 seconds, show histogram
sudo /usr/share/bcc/tools/biolatency 10 1
# Real example:
# Tracing block I/O latency... Hit Ctrl+C to end
# nsecs : count distribution
# 0->1K : 0 | |
# 1K->2K : 0 | |
# 2K->4K : 100 |*** |
# 4K->8K : 500 |***************************** |
# 8K->16K : 200 |************* |
# 16K->32K: 50 |*** |
# 32K->64K: 10 | |
5-minute triage:
- Most I/O in 4K-16K range = normal SSD (~5-15 microseconds)
- Most I/O > 32K = slow disk or I/O queue saturation
- Compare before/after tuning to measure improvement
5. offcputime: Find Threads Blocked/Sleeping
What: Histogram of time threads spent OFF-CPU (blocked, waiting)
Use case: Identify why app is slow (lock contention, I/O wait, context switch)
Command:
sudo /usr/share/bcc/tools/offcputime -h
# Run for 10 seconds
sudo /usr/share/bcc/tools/offcputime -u 10
# Real example:
# Tracing off-CPU time (excluding idle) for 10 seconds
# Thread stacks (>= 100 usecs):
#
# kernel function stack
# sys_futex
# do_futex
# futex_wait_queue_me
# schedule
#
# user function stack
# pthread_cond_wait
# worker_thread
# main
#
# off-CPU time: 5.234 seconds
5-minute triage:
- Large off-CPU time = app blocked (waiting for locks, I/O, network)
- Identify which mutex/condition variable is blocking
- Look for runnable threads (ready but not scheduled): CPU contention
4. systemd Unit Status Monitoring
View All Units
# List all running services
systemctl list-units --type=service --state=running
# List failed services
systemctl list-units --type=service --state=failed
systemctl list-units --failed
# Show service details
systemctl status myapp.service
systemctl show myapp.service
# Show active cgroup (resource usage)
systemctl show --property=MemoryCurrent myapp.service
systemctl show --property=CPUUsageNSec myapp.service
Monitor in Real-Time
# Follow service logs
journalctl -u myapp.service -f
# Show service restarts
journalctl -u myapp.service | grep -i restart
# Monitor resource usage (requires cgroup v2)
watch -n 1 'systemctl show --property=MemoryCurrent,CPUUsageNSec myapp.service'
5. 5-Minute Performance Triage Flowchart
Quick Decision Tree
START: System Slow / High CPU / Out of Memory / I/O Stalled
ββ Q1: Check Load Average
β ββ load > num_cores? YES β Continue
β ββ load ~ 0? β Check latency (query slow? upstream lag?)
β
ββ Q2: CPU Usage High (> 80%)?
β ββ YES β Go to CPU Triage
β ββ NO β Continue to Q3
β
ββ Q3: Memory Usage High (> 85%)?
β ββ YES β Go to Memory Triage
β ββ NO β Continue to Q4
β
ββ Q4: I/O Saturation High (iostat: %util > 80%)?
β ββ YES β Go to I/O Triage
β ββ NO β Continue to Q5
β
ββ Q5: Network Errors or Saturation?
ββ YES β Go to Network Triage
ββ NO β Latency issue (app-level) or external
βββββββββββββββββββββββββββββββββββββββββββββββββ
CPU TRIAGE (if load high):
1. top -bn1 | head -20 # Find hot process
2. perf top -p PID # Where is CPU spent?
3. sudo bpftrace -e 'profile { @[kstack] = count(); }' # Kernel stacks
Decision:
ββ Busy kernel β Check syscalls (strace, eBPF)
ββ Busy user β Check code (perf, profiler)
ββ Many context switches β CPU contention, reduce threads
βββββββββββββββββββββββββββββββββββββββββββββββββ
MEMORY TRIAGE (if > 85% used):
1. free -h # Overall memory
2. ps aux | sort -k6 -rn | head -10 # Top processes by RSS
3. cat /proc/pressure/memory # PSI stall %
Decision:
ββ One process huge β Kill/restart/profile
ββ Multiple processes β OOM killer pending
β - Check /proc/sysrq-trigger (dmesg for OOM kills)
β - Increase swap or reduce workload
ββ Swapping high (si/so in vmstat) β Reduce swappiness
βββββββββββββββββββββββββββββββββββββββββββββββββ
I/O TRIAGE (if %util > 80%):
1. iostat -x 1 5 # Device await, svctm
2. iotop # Processes by I/O
3. sudo biolatency 10 1 # I/O latency distribution
Decision:
ββ High await + low svctm β I/O queue saturation (reduce load)
ββ High svctm β Slow disk (replace SSD, add cache)
ββ High latency spread β Scheduler issue (check: deadline vs. noop)
ββ One process hogging β Kill/throttle (cgroup limits)
βββββββββββββββββββββββββββββββββββββββββββββββββ
NETWORK TRIAGE (if errors or packet loss):
1. ip a # Interface status
2. ethtool -S eth0 | grep -i drop # NIC drops
3. ss -s # Socket stats
4. sudo tcplife -d 10 # Connection lifecycle
Decision:
ββ NIC drops > 0 β Buffer full, increase MTU, reduce load
ββ Connection timeouts β Check routing (ip route), firewall
ββ High retransmits β Network latency/loss (check upstream)
ββ Port exhaustion (TIME_WAIT) β tcp_tw_reuse, reduce connections
ASCII Triage Commands (Copy-Paste Ready)
# ==== 1-MINUTE SYSTEM OVERVIEW ====
echo "=== LOAD & CPU ==="; uptime; echo; echo "=== MEMORY ==="; free -h; echo; echo "=== DISK ==="; df -h /; echo; echo "=== TOP PROCS ==="; ps aux | sort -k3 -rn | head -5
# ==== CPU HOT SPOTS (5 seconds) ====
sudo perf top -p $PID --delay 5
# ==== I/O LATENCY (10 seconds) ====
sudo /usr/share/bcc/tools/biolatency 10 1
# ==== PROCESS SYSCALLS (60 seconds) ====
sudo strace -c -p $PID
# ==== NETWORK CONNECTIONS (active) ====
ss -tulnp | head -20
# ==== FULL SYSTEM TRIAGE (60 seconds) ====
echo "1. Load:"; uptime; echo "2. CPU:"; top -bn1 | head -3; echo "3. Memory:"; free -h; echo "4. Disk:"; iostat -x 1 1; echo "5. Network:"; ss -s
Observability Checklist
Day 1: Setup
- node_exporter installed & running (localhost:9100/metrics)
- Prometheus scraping node_exporter (15-second interval)
- Alert rules loaded (high CPU, memory, disk)
- journald persistent storage enabled (
/var/log/journal/
) - Log aggregation chosen (rsyslog/Vector/Fluent Bit) and configured
- eBPF tools installed (
bcc-tools
orbpftrace
)
Weekly: Validation
- Prometheus scraping all targets (Targets page)
- No missing metrics (check dashboard)
- Logs flowing to aggregation system
- Alerts not firing (baseline)
- Historical data available (past 7 days)
On-Call: Triage
- 5-minute flowchart followed (CPU/memory/I/O decision)
- eBPF tool run (execsnoop, opensnoop, biolatency)
- Root cause identified (app, system, external)
- Metrics/logs captured for postmortem
Quick Reference: Commands by Use Case
“System Slow”
uptime # Load average
top -bn1 | head -15 # CPU, memory, top procs
iostat -x 1 3 # Disk I/O
ss -s # Network stats
“Disk Full”
df -h # Disk usage
du -sh /var/log /var/lib # Largest dirs
sudo ncdu / # Interactive disk usage
“Process Unresponsive”
strace -p $PID # Syscalls (live)
sudo perf record -p $PID -F 99 -- sleep 10 # CPU profile
ps aux | grep $PID # Memory, CPU%
“Network Slow”
ping -c 10 8.8.8.8 # Latency to external
ss -tulnp # Listening ports
ethtool eth0 # NIC speed
sudo tcplife -d 30 # Connection lifecycle