Executive Summary
Linux is a layered system: from firmware through kernel subsystems to containerized applications. Understanding these layers—and their interdependencies—is critical for reliable, secure, performant infrastructure.
This guide covers:
- Layered architecture (firmware → kernel → userspace → containers)
- Core subsystems: process scheduling, memory, filesystems, networking
- systemd: unit management and service lifecycle
- Production best practices: security, reliability, performance, observability
Note: For detailed boot flow and debugging, see the Linux Boot Flow & Debugging guide.
Layered Architecture
Visual Model
graph TB
subgraph Hardware["🖥️ Hardware & Firmware"]
CPU["CPU<br/>Multi-core"]
MEM["RAM"]
DISK["Storage<br/>NVMe/SSD"]
NIC["Network Card"]
FIRMWARE["UEFI/BIOS"]
end
subgraph Bootloader["🔧 Bootloader"]
GRUB["GRUB 2<br/>Boot selection"]
PARAMS["Kernel Parameters"]
end
subgraph Kernel["🔌 Kernel"]
SCHED["Scheduler<br/>CFS/RT"]
MM["Memory Mgmt<br/>VM/Paging"]
VFS["Filesystems<br/>VFS/ext4/btrfs"]
NET["Network Stack<br/>TCP/IP/eBPF"]
BLOCK["Block Layer<br/>I/O Queue"]
EBPF["eBPF Runtime<br/>Tracing/Policies"]
end
subgraph Userspace["👤 Userspace"]
INIT["init (PID 1)<br/>systemd"]
DAEMONS["Services/Daemons<br/>sshd, nginx, etc"]
SHELL["Shells & CLIs<br/>bash, zsh"]
end
subgraph Containers["📦 Containers"]
NS["Namespaces<br/>PID/Net/Mnt/IPC"]
CGROUPS["Control Groups<br/>CPU/Mem/I/O Limits"]
IMAGES["Container Images<br/>docker/OCI"]
end
subgraph Stacks["🌐 Stacks"]
NETSTACK["Networking<br/>Routing/Firewalls"]
STORAGE["Storage Subsystem<br/>LVM/MD/Ceph"]
end
Hardware --> Bootloader
Bootloader --> Kernel
Kernel --> Userspace
Userspace --> Containers
Kernel -.->|syscalls| Containers
Containers --> Stacks
Kernel --> Stacks
style Hardware fill:#f5f5f5
style Bootloader fill:#ffe6e6
style Kernel fill:#e6f3ff
style Userspace fill:#e6ffe6
style Containers fill:#fff0e6
style Stacks fill:#f0e6ff
Layer Legend
Layer | Purpose | Key Component | Interface |
---|---|---|---|
Hardware | Physical resources | CPU, RAM, NIC, storage | I/O ports, interrupts |
Bootloader | Load kernel into RAM | GRUB 2 | Bootloader → Kernel |
Kernel | OS core (process, memory, I/O, networking) | Scheduler, MM, VFS, TCP/IP | Syscalls (int 0x80, SYSENTER) |
Userspace | Services and applications | systemd, daemons, shells | Syscalls → Kernel |
Containers | Isolated process groups | Namespaces, cgroups | Kernel subsystems |
Stacks | Networking & storage topology | Routing, LVM, Ceph | Network packets, block I/O |
Kernel Subsystems
1. Process Scheduler (CFS - Completely Fair Scheduler)
Purpose: Fairly distribute CPU time among processes.
Key Concepts:
- Run queue: Per-CPU list of runnable tasks
- Virtual runtime (vruntime): Process’s accumulated CPU time (weighted)
- Nice value: -20 (highest priority) to +19 (lowest)
- Load average: Avg runnable tasks over 1/5/15 minutes
Best Practices:
# Check load average
uptime
# output: 12:34 up 10 days, 3:45, 2 users, load average: 0.80, 1.20, 1.50
# Set nice (only superuser can lower nice, i.e., increase priority)
nice -n +5 command # Lower priority
sudo nice -n -5 command # Higher priority
# Realtime priority (only for critical tasks)
chrt -r 10 command # Realtime priority 10
chrt -r 1 command # Highest realtime (reserved for kernel)
# Check process scheduling class
ps -o cmd,class,rtprio -C nginx
2. Memory Management (Virtual Memory, Paging)
Purpose: Abstract physical RAM using virtual address spaces.
Key Concepts:
- Pages: 4KB memory units (default on x86)
- Page table: Maps virtual → physical addresses (MMU walks)
- Swapping: Evicts least-used pages to disk (slow!)
- OOM killer: Kills process if system runs out of memory
Best Practices:
# Monitor memory
free -h
# output:
# total used free shared buff/cache available
# Mem: 31Gi 12Gi 2.5Gi 1.2Gi 8.3Gi 17Gi
# Swap status
swapon -s
# Per-process memory
ps aux | awk '{print $6, $11}' | sort -rn | head -20
# RSS (resident set size) = actual RAM used
# Disable swap (if you want no paging - dangerous!)
sudo swapoff -a
# Set memory pressure (tuning)
sysctl vm.swappiness=10 # 0-100: higher = more swapping
3. Virtual Filesystem (VFS)
Purpose: Abstract block device I/O through a unified filesystem interface.
Key Concepts:
- Inode: File metadata (permissions, size, pointers to blocks)
- Dentry: Filename → inode mapping (cached in dcache)
- Mountpoint: Attach filesystem to directory tree
Best Practices:
# Check filesystem space
df -h
# output:
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 50G 35G 12G 75% /
# /dev/sda2 100G 80G 15G 82% /home
# Inode exhaustion check
df -i
# If inodes 100% used, you can't create files (even with space left!)
# Monitor filesystem I/O
iostat -x 1
# key columns: r/s (reads/sec), w/s (writes/sec), %util (busy%)
# Identify slow I/O operations
iotop
4. Network Stack (TCP/IP)
Purpose: Handle network packet transmission and reception.
Key Concepts:
- Network interface: eth0, wlan0 (L2 - MAC)
- IP routing: Forwarding packets based on destination IP
- TCP/UDP: L4 protocols (connection-oriented vs connectionless)
- Netfilter: Kernel packet filter (iptables/nftables)
Best Practices:
# Check network interfaces
ip link show
# output:
# 1: lo: <LOOPBACK,UP,LOWER_UP> ...
# 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
# Check IP routing
ip route show
# output:
# default via 10.0.0.1 dev eth0
# 10.0.0.0/24 dev eth0 scope link
# Monitor network traffic (per-interface)
ifstat -i eth0 1
# Monitor active connections
ss -tuln
# output:
# LISTEN 0 128 0.0.0.0:22 0.0.0.0:* (SSH)
# LISTEN 0 128 0.0.0.0:80 0.0.0.0:* (HTTP)
5. Block Layer (I/O Scheduling)
Purpose: Queue and prioritize disk I/O requests.
Key Concepts:
- I/O scheduler: mq-deadline (low-latency), kyber (throughput)
- Request queue: Pending read/write operations
- Deadline: Fairness guarantees (starvation prevention)
Best Practices:
# Check I/O scheduler
cat /sys/block/sda/queue/scheduler
# output: [none] mq-deadline kyber bfq
# Change scheduler (for SSD)
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
# Persistent change (systemd-udev)
# /etc/udev/rules.d/60-ioscheduler.rules:
# ACTION=="add|change", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"
# Check I/O performance
iostat -x 1 | grep sda
# key metric: %util (% time device had I/O in progress)
6. eBPF Runtime (Extended Berkeley Packet Filter)
Purpose: Run sandboxed programs in kernel for tracing, networking, security policies.
Key Concepts:
- In-kernel VM: eBPF bytecode compiled to native CPU instructions
- Low-overhead tracing: Attach to kernel functions without breakpoints
- XDP (eXpress Data Path): High-speed packet processing before stack
Best Practices:
# List loaded eBPF programs
bpftool prog list
# Trace syscalls (strace-like, but kernel-level)
sudo trace-cmd record -e syscalls sleep 1
sudo trace-cmd report | head -20
# XDP load-balancing (advanced)
# (Requires XDP-capable NIC driver and libbpf)
# Monitor with perf (Linux profiler)
sudo perf top # Top CPU functions (kernel + userspace)
sudo perf stat sleep 1 # Aggregate event counts (cache misses, context switches)
Systemd: Init System & Service Manager
Purpose: Start services in dependency order, manage resources, handle restarts.
Unit Types
- service: Long-running daemon
- socket: Listen on socket, spawn service on connection
- timer: Periodic task (cron replacement)
- mount: Filesystem mount
- target: Group of units (like runlevel)
Example Service File
# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target
Requires=myapp-setup.service
[Service]
Type=simple
ExecStart=/usr/local/bin/myapp --config /etc/myapp.conf
Restart=on-failure
RestartSec=5s
User=myapp
Group=myapp
# Resource limits
MemoryLimit=512M
CPUQuota=50%
TasksMax=1000
# Logging
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Best Practices:
# Reload systemd units
sudo systemctl daemon-reload
# Enable (start on boot)
sudo systemctl enable myapp.service
# Start, stop, restart
sudo systemctl start myapp
sudo systemctl status myapp
sudo systemctl restart myapp
# View service logs
journalctl -u myapp.service -n 50
journalctl -u myapp.service -f # Follow (tail -f)
# Check dependencies
systemctl list-dependencies multi-user.target
# Analyze system state
systemd-analyze plot > system-boot.svg
systemd-analyze critical-chain
Containers: Namespaces & cgroups
Linux Namespaces (Process Isolation)
Namespace | Isolates |
---|---|
PID | Process IDs (container PID 1 = host PID N) |
Network | Network interfaces, routing, iptables |
Mount | Filesystem mounts (container has /root via chroot) |
UTS | Hostname and domain name |
IPC | Inter-process communication (message queues, shared memory) |
User | UIDs/GIDs (container root ≠ host root) |
cgroups (Resource Limits)
# Check cgroup hierarchy
mount | grep cgroup2
# Kubernetes pod cgroup example
# /sys/fs/cgroup/kubepods/pod-xxx/container-yyy/
# Set CPU limit (100ms per 100ms = 1 CPU)
echo 100000 > cgroup.cpu_period_us
echo 100000 > cgroup.cpu_quota_us
# Set memory limit (512MB)
echo 512M > memory.max
# Monitor cgroup usage
cat memory.current
cat cpu.stat
Production Best Practices
Security Hardening
# 1. Kernel parameters (/etc/sysctl.conf)
net.ipv4.ip_forward = 0 # Disable IP forwarding (if not router)
net.ipv4.tcp_syncookies = 1 # SYN flood protection
kernel.dmesg_restrict = 1 # Restrict kernel logs
kernel.kptr_restrict = 2 # Hide kernel pointers (ASLR)
fs.file-max = 2097152 # Increase file descriptor limit
# Apply
sudo sysctl -p
# 2. LSM (Linux Security Module): AppArmor or SELinux
sudo aa-status # AppArmor
getenforce # SELinux
# 3. Firewall (iptables/nftables)
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT # SSH
sudo iptables -A INPUT -m state --state ESTABLISHED -j ACCEPT
sudo iptables -P INPUT DROP # Default deny
# 4. Rootkit detection
sudo aide -i # Create baseline
sudo aide --check # Detect tampering
# 5. Update kernel regularly
sudo apt update && sudo apt upgrade
# Subscribe to kernel security bulletins
Reliability & Observability
# 1. Monitor system health
top, htop, btop # Process monitoring
iostat, iotop # I/O monitoring
vmstat # Virtual memory stats
netstat, ss # Network statistics
# 2. Logging
journalctl -n 100 # System journal (systemd)
tail -f /var/log/auth.log # Authentication log
# 3. Alerts
systemctl status # Service status
systemd-analyze verify # Unit file syntax check
# 4. Predictive maintenance
smartctl -a /dev/sda # Disk health (SMART)
dmesg | tail # Kernel warnings/errors
Performance Tuning
# CPU affinity (pin process to cores)
taskset -c 0,1 command # Run on cores 0,1
taskset -cp 0,1 $$ # Current shell to cores 0,1
# Transparent Hugepage (THP)
cat /sys/kernel/mm/transparent_hugepage/enabled
# output: always [madvise] never
# Disable THP (if causing latency)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# Disable C-states (CPU idle states) for latency-sensitive apps
# (Requires BIOS setting + idle=poll)
# Network tuning
sudo ethtool -C eth0 adaptive-rx on
sudo ethtool -C eth0 rx-usecs 100 # RX interrupt coalescing
Checklists
Pre-Production Deployment
- Kernel version pinned (security updates on schedule)
- Filesystem type chosen (ext4 for stability, btrfs for COW, NFS for scale)
- Memory overcommit disabled (
vm.overcommit_memory = 1
or2
) - Swap configured (ratio: 0.5× RAM for throughput, 1× RAM for batch jobs)
- I/O scheduler optimized (mq-deadline for latency, kyber for throughput)
- Firewall rules active (default deny inbound)
- SELinux or AppArmor enabled
- Log rotation configured (logrotate)
- Monitoring agents installed (Prometheus node-exporter, etc.)
- SSH hardened (password auth disabled, key-based only)
- NTP/chrony synced (time critical for logs, certs)
Incident Response
- OOM killer tuning (
/proc/sys/vm/panic_on_oom
) - Disk space alerts (keep >15% free)
- Runaway process identification (ps aux, perf)
- Network packet capture (tcpdump)
- Kernel logs (dmesg, journalctl)
- Service restart strategy (systemd auto-restart configured)
- Backups automated (snapshots, replication)
FAQ
Q: How do I reduce boot time?
A: Profile (systemd-analyze blame
), disable unused services (systemctl disable
), increase I/O readahead (hdparm
), use SSD, or use hibernation (trade-off: power).
Q: Why is my system slow after a year?
A: Common causes: disk fragmentation, swap thrashing, memory leak in daemon, excessive context switching. Diagnose with iostat
, vmstat
, perf
.
Q: Can I run Linux without swap? A: Yes, if you have spare RAM and OOM killer configured. Risk: processes killed without warning. Better: small swap on SSD + memory cgroups.
Q: How does containerization differ from VMs? A: Containers: lightweight, share kernel, ~50MB overhead. VMs: heavier, full OS per instance, ~500MB+ overhead. Containers faster to boot and scale.
Conclusion
Linux is complex, but the layered model makes it understandable:
- Hardware→Bootloader→Kernel: Devices → code in RAM → OS
- Kernel subsystems: Abstract resources (CPU, memory, storage, network)
- Userspace: Services, systemd orchestration
- Containers: Isolated processes sharing kernel
For production:
- Secure: harden kernel, firewall, monitor
- Reliable: monitor SLOs, auto-restart, backup
- Observable: logs, metrics, tracing (eBPF)
- Performant: profile, tune I/O scheduler, pin processes
Further Reading: