Executive Summary

Observability = see inside your systems: metrics (CPU, memory, I/O), logs (audit trail), traces (syscalls, latency).

This guide covers:

  • Metrics: node_exporter β†’ Prometheus (system-level health)
  • Logs: journald β†’ rsyslog/Vector/Fluent Bit (aggregation)
  • eBPF tools: 5 quick wins (trace syscalls, network, I/O)
  • Triage: 5-minute flowchart to diagnose CPU, memory, I/O, network issues

1. Metrics: node_exporter & Prometheus

What It Is

  • node_exporter: Exposes OS metrics (CPU, memory, disk, network) as Prometheus scrape target
  • Prometheus: Time-series database; collects metrics, queries, alerts
  • Dashboard: Grafana visualizes Prometheus data

Install node_exporter

Ubuntu/Debian:

# Download latest release
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract & install
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'SERVICE'
[Unit]
Description=Node Exporter
After=network-online.target
Wants=network-online.target

[Service]
User=node-exporter
Group=node-exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes

# Auto-restart
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target
SERVICE

# Create user
sudo useradd -rs /bin/false node-exporter

# Enable & start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify
curl http://localhost:9100/metrics | head -20

RHEL/Fedora:

# Via DNF (if available)
dnf install node_exporter

# Or manual (same as above)

Key Metrics to Monitor

CPU:

node_cpu_seconds_total{cpu,mode}      # CPU time by core & mode (user, system, idle, iowait)
node_load1                             # 1-minute load average
node_context_switches_total            # Context switches (high = contention)

Memory:

node_memory_MemTotal_bytes             # Total RAM
node_memory_MemAvailable_bytes         # Available RAM (for new processes)
node_memory_MemFree_bytes              # Free RAM
node_memory_SwapTotal_bytes            # Total swap
node_memory_SwapFree_bytes             # Free swap

Disk I/O:

node_disk_io_time_ms_total{device}    # Time spent in I/O
node_disk_read_bytes_total{device}    # Bytes read
node_disk_write_bytes_total{device}   # Bytes written
node_disk_io_reads_completed_total    # Read operations

Network:

node_network_receive_bytes_total      # Bytes received (per interface)
node_network_transmit_bytes_total     # Bytes transmitted
node_network_receive_errs_total       # RX errors (should be 0)
node_network_transmit_errs_total      # TX errors (should be 0)

Filesystem:

node_filesystem_avail_bytes           # Available space
node_filesystem_size_bytes            # Total size
node_filesystem_files                 # Total inodes
node_filesystem_files_free            # Free inodes

Prometheus Scrape Config

/etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']  # node_exporter
        labels:
          environment: 'production'
          service: 'linux'

  # Multiple servers
  - job_name: 'production-nodes'
    static_configs:
      - targets: ['server1:9100', 'server2:9100', 'server3:9100']

Alerting Rules (Prometheus)

/etc/prometheus/alert-rules.yml:

groups:
  - name: linux-alerts
    interval: 30s
    rules:
      # CPU > 80% for 5 minutes
      - alert: HighCPUUsage
        expr: (100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
        for: 5m
        annotations:
          summary: "High CPU on {{ $labels.instance }}"

      # Memory > 85% available
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        annotations:
          summary: "High memory on {{ $labels.instance }}"

      # Disk > 90% full
      - alert: HighDiskUsage
        expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 90
        for: 1m
        annotations:
          summary: "Disk {{ $labels.device }} {{ $labels.instance }} almost full"

      # I/O wait > 50%
      - alert: HighIOWait
        expr: avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) * 100 > 50
        for: 5m
        annotations:
          summary: "High I/O wait on {{ $labels.instance }}"

      # Network errors
      - alert: NetworkErrors
        expr: increase(node_network_receive_errs_total[5m]) > 0 or increase(node_network_transmit_errs_total[5m]) > 0
        for: 1m
        annotations:
          summary: "Network errors on {{ $labels.instance }}"

2. Logging: journald β†’ Aggregation

journald (Local, Already Covered)

Recap:

  • Persistent storage: /var/log/journal/
  • Query: journalctl -u SERVICE, journalctl -b, journalctl --since

Log Aggregation: Choices

ToolSetupComplexityCostNotes
rsyslogTCP/syslog forwardingLowFreeTraditional, works with old systems
VectorTOML config, RustMediumFreeModern, efficient, rich transforms
Fluent BitLightweight C daemonMediumFreeSmall footprint, good for containers
LokiLog aggregator (Grafana)HighFree/PaidQueryable like Prometheus

Option 1: rsyslog (Simple, Traditional)

Forward journald logs to remote syslog server:

Client (/etc/rsyslog.d/forwarding.conf):

# Send all logs to remote syslog server
*.* @@syslog.example.com:514  # TCP
# or
*.* @syslog.example.com:514   # UDP (faster, unreliable)

Server (/etc/rsyslog.d/server.conf):

# Listen for syslog
$ModLoad imtcp
$InputTCPServerRun 514

# Save by hostname & program
$DirCreateMode 0755
$FileCreateMode 0644
$Umask 0022
$WorkDirectory /var/lib/rsyslog

template(name="DynFile" type="string" string="/var/log/rsyslog/%HOSTNAME%/%syslogtag%.log")
*.* ?DynFile

Option 2: Vector (Modern, Flexible)

Install:

# Install Vector
curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | sh

# Or DNF/APT
dnf install vector
apt install vector

Config (/etc/vector/vector.toml):

# Read from journald
[sources.journald]
type = "journald"

# Optional: filter only errors
# include = { priority = "err" }

# Parse JSON in some logs
[transforms.parse_json]
type = "remap"
inputs = ["journald"]
source = '''
.json = parse_json!(.message)
'''

# Ship to HTTP endpoint (e.g., Loki, Datadog)
[sinks.http]
type = "http"
inputs = ["parse_json"]
uri = "https://logs.example.com/api/v1/push"
method = "post"
encoding.codec = "json"

# Or, file rotation (local backup)
[sinks.files]
type = "file"
inputs = ["journald"]
path = "/var/log/vector/%HOST%/%year%-%month%-%day%.log"
rotation.size_bytes = 104857600  # 100 MB

Start:

sudo systemctl enable vector
sudo systemctl start vector
sudo systemctl status vector

# Monitor
journalctl -u vector -f

Option 3: Fluent Bit (Lightweight, Container-Friendly)

Install:

apt install fluent-bit
# or
dnf install fluent-bit

Config (/etc/fluent-bit/fluent-bit.conf):

[SERVICE]
    Flush        5
    Log_Level    info
    Parsers_File parsers.conf

[INPUT]
    Name              systemd
    Tag               journald.*
    Read_From_Tail    On

[FILTER]
    Name             modify
    Match            *
    Add              hostname ${HOSTNAME}
    Add              environment production

[OUTPUT]
    Name   stackdriver
    Match  *
    google_service_credentials /path/to/credentials.json
    resource             k8s_node
    resource_labels      project_id=my-project,location=us-central1,node_id=${HOSTNAME}

# Or, forward to Loki
[OUTPUT]
    Name   loki
    Match  *
    Host   loki.example.com
    Port   3100
    Labels job=systemd, hostname=${HOSTNAME}

Start:

sudo systemctl enable fluent-bit
sudo systemctl start fluent-bit

3. eBPF Tools: 5 Quick Wins

Why eBPF

  • Zero-code-change tracing: Hook syscalls, kernel functions without recompiling
  • Production-safe: Runs in sandboxed kernel VM
  • Low-overhead: Minimal CPU impact

Prerequisites

# Install BCC (eBPF Compiler Collection)
apt install bpfcc-tools linux-headers-$(uname -r)
# or
dnf install bcc bcc-tools kernel-devel

# Or, install bpftrace (higher-level)
apt install bpftrace
# or
dnf install bpftrace

# Kernel requirement: β‰₯ 4.9 (ideally β‰₯ 5.0)
uname -r

5 Quick Wins

1. execsnoop: Trace Process Execution

What: Shows every process spawned (command, PID, UID, exit code)
Use case: Debug which processes are running, unexplained resource usage
Command:

sudo /usr/share/bcc/tools/execsnoop -h
# or
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s\n", comm); }'

# Real example:
sudo /usr/share/bcc/tools/execsnoop
# Output:
# PCOMM         PID     PPID    RET ARGS
# bash          1234    1200    0   /bin/ls -la
# python        1235    1234    0   python script.py

5-minute triage:

  • Run execsnoop for 60 seconds
  • Look for unexpected processes (malware, resource hogs)
  • Check PID, PPID, command line

2. opensnoop: Trace File Open Operations

What: Shows every file opened (path, flags, PID, success/failure)
Use case: Find performance bottlenecks (config file re-reads), unexpected file access
Command:

sudo /usr/share/bcc/tools/opensnoop -h

# Filter by process name
sudo /usr/share/bcc/tools/opensnoop -n nginx

# Real example:
sudo /usr/share/bcc/tools/opensnoop
# Output:
# PID    COMM    FD ERR PATH
# 1234   nginx   10 0   /etc/nginx/nginx.conf
# 1234   nginx   11 0   /var/log/nginx/access.log
# 1235   python  -1 2   /tmp/config.json (file not found)

5-minute triage:

  • Look for repeated file opens (caching miss, config re-load)
  • Check for permission errors (FD=-1, ERR=13)
  • Identify I/O bottlenecks

3. tcplife: Trace TCP Connection Lifecycle

What: Shows TCP connections (source, dest, port, duration, bytes)
Use case: Slow connections, connection leaks, bandwidth analysis
Command:

sudo /usr/share/bcc/tools/tcplife -h

# Show active connections
sudo /usr/share/bcc/tools/tcplife

# Real example:
# PID   COMM    LADDR    LPORT RADDR    RPORT TX_KB RX_KB MS
# 1234  curl    10.0.0.5 45678 1.2.3.4  443   5     10    1234
# 1235  nginx   0.0.0.0  80    10.0.0.6 54321 100   200   500

5-minute triage:

  • Check connection duration (MS column): slow = timeout issue?
  • Look at bytes transferred (TX_KB, RX_KB): bandwidth saturation?
  • Find long-lived idle connections (might need keep-alive tuning)

4. biolatency: Measure Block I/O Latency

What: Histogram of I/O latency (how long disk requests take)
Use case: Identify I/O bottlenecks (slow disk, queue saturation)
Command:

sudo /usr/share/bcc/tools/biolatency -h

# Run for 10 seconds, show histogram
sudo /usr/share/bcc/tools/biolatency 10 1

# Real example:
# Tracing block I/O latency... Hit Ctrl+C to end
# nsecs   : count     distribution
# 0->1K   : 0         |                                      |
# 1K->2K  : 0         |                                      |
# 2K->4K  : 100       |***                                   |
# 4K->8K  : 500       |*****************************         |
# 8K->16K : 200       |*************                         |
# 16K->32K: 50        |***                                   |
# 32K->64K: 10        |                                      |

5-minute triage:

  • Most I/O in 4K-16K range = normal SSD (~5-15 microseconds)
  • Most I/O > 32K = slow disk or I/O queue saturation
  • Compare before/after tuning to measure improvement

5. offcputime: Find Threads Blocked/Sleeping

What: Histogram of time threads spent OFF-CPU (blocked, waiting)
Use case: Identify why app is slow (lock contention, I/O wait, context switch)
Command:

sudo /usr/share/bcc/tools/offcputime -h

# Run for 10 seconds
sudo /usr/share/bcc/tools/offcputime -u 10

# Real example:
# Tracing off-CPU time (excluding idle) for 10 seconds
# Thread stacks (>= 100 usecs):
#
#  kernel function stack
#    sys_futex
#    do_futex
#    futex_wait_queue_me
#    schedule
#
#   user function stack
#    pthread_cond_wait
#    worker_thread
#    main
#
#   off-CPU time: 5.234 seconds

5-minute triage:

  • Large off-CPU time = app blocked (waiting for locks, I/O, network)
  • Identify which mutex/condition variable is blocking
  • Look for runnable threads (ready but not scheduled): CPU contention

4. systemd Unit Status Monitoring

View All Units

# List all running services
systemctl list-units --type=service --state=running

# List failed services
systemctl list-units --type=service --state=failed
systemctl list-units --failed

# Show service details
systemctl status myapp.service
systemctl show myapp.service

# Show active cgroup (resource usage)
systemctl show --property=MemoryCurrent myapp.service
systemctl show --property=CPUUsageNSec myapp.service

Monitor in Real-Time

# Follow service logs
journalctl -u myapp.service -f

# Show service restarts
journalctl -u myapp.service | grep -i restart

# Monitor resource usage (requires cgroup v2)
watch -n 1 'systemctl show --property=MemoryCurrent,CPUUsageNSec myapp.service'

5. 5-Minute Performance Triage Flowchart

Quick Decision Tree

START: System Slow / High CPU / Out of Memory / I/O Stalled

β”œβ”€ Q1: Check Load Average
β”‚  β”œβ”€ load > num_cores? YES β†’ Continue
β”‚  └─ load ~ 0? β†’ Check latency (query slow? upstream lag?)
β”‚
β”œβ”€ Q2: CPU Usage High (> 80%)?
β”‚  β”œβ”€ YES β†’ Go to CPU Triage
β”‚  └─ NO β†’ Continue to Q3
β”‚
β”œβ”€ Q3: Memory Usage High (> 85%)?
β”‚  β”œβ”€ YES β†’ Go to Memory Triage
β”‚  └─ NO β†’ Continue to Q4
β”‚
β”œβ”€ Q4: I/O Saturation High (iostat: %util > 80%)?
β”‚  β”œβ”€ YES β†’ Go to I/O Triage
β”‚  └─ NO β†’ Continue to Q5
β”‚
└─ Q5: Network Errors or Saturation?
   β”œβ”€ YES β†’ Go to Network Triage
   └─ NO β†’ Latency issue (app-level) or external

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

CPU TRIAGE (if load high):
1. top -bn1 | head -20                    # Find hot process
2. perf top -p PID                        # Where is CPU spent?
3. sudo bpftrace -e 'profile { @[kstack] = count(); }'  # Kernel stacks

Decision:
β”œβ”€ Busy kernel β†’ Check syscalls (strace, eBPF)
β”œβ”€ Busy user β†’ Check code (perf, profiler)
└─ Many context switches β†’ CPU contention, reduce threads

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

MEMORY TRIAGE (if > 85% used):
1. free -h                                # Overall memory
2. ps aux | sort -k6 -rn | head -10      # Top processes by RSS
3. cat /proc/pressure/memory              # PSI stall %

Decision:
β”œβ”€ One process huge β†’ Kill/restart/profile
β”œβ”€ Multiple processes β†’ OOM killer pending
β”‚  - Check /proc/sysrq-trigger (dmesg for OOM kills)
β”‚  - Increase swap or reduce workload
└─ Swapping high (si/so in vmstat) β†’ Reduce swappiness

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

I/O TRIAGE (if %util > 80%):
1. iostat -x 1 5                          # Device await, svctm
2. iotop                                  # Processes by I/O
3. sudo biolatency 10 1                   # I/O latency distribution

Decision:
β”œβ”€ High await + low svctm β†’ I/O queue saturation (reduce load)
β”œβ”€ High svctm β†’ Slow disk (replace SSD, add cache)
β”œβ”€ High latency spread β†’ Scheduler issue (check: deadline vs. noop)
└─ One process hogging β†’ Kill/throttle (cgroup limits)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

NETWORK TRIAGE (if errors or packet loss):
1. ip a                                   # Interface status
2. ethtool -S eth0 | grep -i drop        # NIC drops
3. ss -s                                  # Socket stats
4. sudo tcplife -d 10                     # Connection lifecycle

Decision:
β”œβ”€ NIC drops > 0 β†’ Buffer full, increase MTU, reduce load
β”œβ”€ Connection timeouts β†’ Check routing (ip route), firewall
β”œβ”€ High retransmits β†’ Network latency/loss (check upstream)
└─ Port exhaustion (TIME_WAIT) β†’ tcp_tw_reuse, reduce connections

ASCII Triage Commands (Copy-Paste Ready)

# ==== 1-MINUTE SYSTEM OVERVIEW ====
echo "=== LOAD & CPU ==="; uptime; echo; echo "=== MEMORY ==="; free -h; echo; echo "=== DISK ==="; df -h /; echo; echo "=== TOP PROCS ==="; ps aux | sort -k3 -rn | head -5

# ==== CPU HOT SPOTS (5 seconds) ====
sudo perf top -p $PID --delay 5

# ==== I/O LATENCY (10 seconds) ====
sudo /usr/share/bcc/tools/biolatency 10 1

# ==== PROCESS SYSCALLS (60 seconds) ====
sudo strace -c -p $PID

# ==== NETWORK CONNECTIONS (active) ====
ss -tulnp | head -20

# ==== FULL SYSTEM TRIAGE (60 seconds) ====
echo "1. Load:"; uptime; echo "2. CPU:"; top -bn1 | head -3; echo "3. Memory:"; free -h; echo "4. Disk:"; iostat -x 1 1; echo "5. Network:"; ss -s

Observability Checklist

Day 1: Setup

  • node_exporter installed & running (localhost:9100/metrics)
  • Prometheus scraping node_exporter (15-second interval)
  • Alert rules loaded (high CPU, memory, disk)
  • journald persistent storage enabled (/var/log/journal/)
  • Log aggregation chosen (rsyslog/Vector/Fluent Bit) and configured
  • eBPF tools installed (bcc-tools or bpftrace)

Weekly: Validation

  • Prometheus scraping all targets (Targets page)
  • No missing metrics (check dashboard)
  • Logs flowing to aggregation system
  • Alerts not firing (baseline)
  • Historical data available (past 7 days)

On-Call: Triage

  • 5-minute flowchart followed (CPU/memory/I/O decision)
  • eBPF tool run (execsnoop, opensnoop, biolatency)
  • Root cause identified (app, system, external)
  • Metrics/logs captured for postmortem

Quick Reference: Commands by Use Case

“System Slow”

uptime                          # Load average
top -bn1 | head -15             # CPU, memory, top procs
iostat -x 1 3                   # Disk I/O
ss -s                           # Network stats

“Disk Full”

df -h                           # Disk usage
du -sh /var/log /var/lib        # Largest dirs
sudo ncdu /                     # Interactive disk usage

“Process Unresponsive”

strace -p $PID                  # Syscalls (live)
sudo perf record -p $PID -F 99 -- sleep 10  # CPU profile
ps aux | grep $PID              # Memory, CPU%

“Network Slow”

ping -c 10 8.8.8.8              # Latency to external
ss -tulnp                       # Listening ports
ethtool eth0                    # NIC speed
sudo tcplife -d 30              # Connection lifecycle

Further Reading