Executive Summary

Linux is a layered system: from firmware through kernel subsystems to containerized applications. Understanding these layers—and their interdependencies—is critical for reliable, secure, performant infrastructure.

This guide covers:

  • Layered architecture (firmware → kernel → userspace → containers)
  • Core subsystems: process scheduling, memory, filesystems, networking
  • systemd: unit management and service lifecycle
  • Production best practices: security, reliability, performance, observability

Note: For detailed boot flow and debugging, see the Linux Boot Flow & Debugging guide.


Layered Architecture

Visual Model

graph TB
    subgraph Hardware["🖥️ Hardware & Firmware"]
        CPU["CPU<br/>Multi-core"]
        MEM["RAM"]
        DISK["Storage<br/>NVMe/SSD"]
        NIC["Network Card"]
        FIRMWARE["UEFI/BIOS"]
    end

    subgraph Bootloader["🔧 Bootloader"]
        GRUB["GRUB 2<br/>Boot selection"]
        PARAMS["Kernel Parameters"]
    end

    subgraph Kernel["🔌 Kernel"]
        SCHED["Scheduler<br/>CFS/RT"]
        MM["Memory Mgmt<br/>VM/Paging"]
        VFS["Filesystems<br/>VFS/ext4/btrfs"]
        NET["Network Stack<br/>TCP/IP/eBPF"]
        BLOCK["Block Layer<br/>I/O Queue"]
        EBPF["eBPF Runtime<br/>Tracing/Policies"]
    end

    subgraph Userspace["👤 Userspace"]
        INIT["init (PID 1)<br/>systemd"]
        DAEMONS["Services/Daemons<br/>sshd, nginx, etc"]
        SHELL["Shells & CLIs<br/>bash, zsh"]
    end

    subgraph Containers["📦 Containers"]
        NS["Namespaces<br/>PID/Net/Mnt/IPC"]
        CGROUPS["Control Groups<br/>CPU/Mem/I/O Limits"]
        IMAGES["Container Images<br/>docker/OCI"]
    end

    subgraph Stacks["🌐 Stacks"]
        NETSTACK["Networking<br/>Routing/Firewalls"]
        STORAGE["Storage Subsystem<br/>LVM/MD/Ceph"]
    end

    Hardware --> Bootloader
    Bootloader --> Kernel
    Kernel --> Userspace
    Userspace --> Containers
    Kernel -.->|syscalls| Containers
    Containers --> Stacks
    Kernel --> Stacks

    style Hardware fill:#f5f5f5
    style Bootloader fill:#ffe6e6
    style Kernel fill:#e6f3ff
    style Userspace fill:#e6ffe6
    style Containers fill:#fff0e6
    style Stacks fill:#f0e6ff

Layer Legend

LayerPurposeKey ComponentInterface
HardwarePhysical resourcesCPU, RAM, NIC, storageI/O ports, interrupts
BootloaderLoad kernel into RAMGRUB 2Bootloader → Kernel
KernelOS core (process, memory, I/O, networking)Scheduler, MM, VFS, TCP/IPSyscalls (int 0x80, SYSENTER)
UserspaceServices and applicationssystemd, daemons, shellsSyscalls → Kernel
ContainersIsolated process groupsNamespaces, cgroupsKernel subsystems
StacksNetworking & storage topologyRouting, LVM, CephNetwork packets, block I/O

Kernel Subsystems

1. Process Scheduler (CFS - Completely Fair Scheduler)

Purpose: Fairly distribute CPU time among processes.

Key Concepts:

  • Run queue: Per-CPU list of runnable tasks
  • Virtual runtime (vruntime): Process’s accumulated CPU time (weighted)
  • Nice value: -20 (highest priority) to +19 (lowest)
  • Load average: Avg runnable tasks over 1/5/15 minutes

Best Practices:

# Check load average
uptime
# output: 12:34  up 10 days,  3:45,  2 users,  load average: 0.80, 1.20, 1.50

# Set nice (only superuser can lower nice, i.e., increase priority)
nice -n +5 command      # Lower priority
sudo nice -n -5 command # Higher priority

# Realtime priority (only for critical tasks)
chrt -r 10 command      # Realtime priority 10
chrt -r 1 command       # Highest realtime (reserved for kernel)

# Check process scheduling class
ps -o cmd,class,rtprio -C nginx

2. Memory Management (Virtual Memory, Paging)

Purpose: Abstract physical RAM using virtual address spaces.

Key Concepts:

  • Pages: 4KB memory units (default on x86)
  • Page table: Maps virtual → physical addresses (MMU walks)
  • Swapping: Evicts least-used pages to disk (slow!)
  • OOM killer: Kills process if system runs out of memory

Best Practices:

# Monitor memory
free -h
# output:
#               total        used        free      shared  buff/cache   available
# Mem:           31Gi       12Gi       2.5Gi       1.2Gi       8.3Gi        17Gi

# Swap status
swapon -s

# Per-process memory
ps aux | awk '{print $6, $11}' | sort -rn | head -20
# RSS (resident set size) = actual RAM used

# Disable swap (if you want no paging - dangerous!)
sudo swapoff -a

# Set memory pressure (tuning)
sysctl vm.swappiness=10  # 0-100: higher = more swapping

3. Virtual Filesystem (VFS)

Purpose: Abstract block device I/O through a unified filesystem interface.

Key Concepts:

  • Inode: File metadata (permissions, size, pointers to blocks)
  • Dentry: Filename → inode mapping (cached in dcache)
  • Mountpoint: Attach filesystem to directory tree

Best Practices:

# Check filesystem space
df -h
# output:
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        50G   35G   12G  75% /
# /dev/sda2       100G   80G   15G  82% /home

# Inode exhaustion check
df -i
# If inodes 100% used, you can't create files (even with space left!)

# Monitor filesystem I/O
iostat -x 1
# key columns: r/s (reads/sec), w/s (writes/sec), %util (busy%)

# Identify slow I/O operations
iotop

4. Network Stack (TCP/IP)

Purpose: Handle network packet transmission and reception.

Key Concepts:

  • Network interface: eth0, wlan0 (L2 - MAC)
  • IP routing: Forwarding packets based on destination IP
  • TCP/UDP: L4 protocols (connection-oriented vs connectionless)
  • Netfilter: Kernel packet filter (iptables/nftables)

Best Practices:

# Check network interfaces
ip link show
# output:
# 1: lo: <LOOPBACK,UP,LOWER_UP> ...
# 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> ...

# Check IP routing
ip route show
# output:
# default via 10.0.0.1 dev eth0
# 10.0.0.0/24 dev eth0 scope link

# Monitor network traffic (per-interface)
ifstat -i eth0 1

# Monitor active connections
ss -tuln
# output:
# LISTEN  0  128  0.0.0.0:22  0.0.0.0:*  (SSH)
# LISTEN  0  128  0.0.0.0:80  0.0.0.0:*  (HTTP)

5. Block Layer (I/O Scheduling)

Purpose: Queue and prioritize disk I/O requests.

Key Concepts:

  • I/O scheduler: mq-deadline (low-latency), kyber (throughput)
  • Request queue: Pending read/write operations
  • Deadline: Fairness guarantees (starvation prevention)

Best Practices:

# Check I/O scheduler
cat /sys/block/sda/queue/scheduler
# output: [none] mq-deadline kyber bfq

# Change scheduler (for SSD)
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler

# Persistent change (systemd-udev)
# /etc/udev/rules.d/60-ioscheduler.rules:
# ACTION=="add|change", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"

# Check I/O performance
iostat -x 1 | grep sda
# key metric: %util (% time device had I/O in progress)

6. eBPF Runtime (Extended Berkeley Packet Filter)

Purpose: Run sandboxed programs in kernel for tracing, networking, security policies.

Key Concepts:

  • In-kernel VM: eBPF bytecode compiled to native CPU instructions
  • Low-overhead tracing: Attach to kernel functions without breakpoints
  • XDP (eXpress Data Path): High-speed packet processing before stack

Best Practices:

# List loaded eBPF programs
bpftool prog list

# Trace syscalls (strace-like, but kernel-level)
sudo trace-cmd record -e syscalls sleep 1
sudo trace-cmd report | head -20

# XDP load-balancing (advanced)
# (Requires XDP-capable NIC driver and libbpf)

# Monitor with perf (Linux profiler)
sudo perf top          # Top CPU functions (kernel + userspace)
sudo perf stat sleep 1 # Aggregate event counts (cache misses, context switches)

Systemd: Init System & Service Manager

Purpose: Start services in dependency order, manage resources, handle restarts.

Unit Types

  • service: Long-running daemon
  • socket: Listen on socket, spawn service on connection
  • timer: Periodic task (cron replacement)
  • mount: Filesystem mount
  • target: Group of units (like runlevel)

Example Service File

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target
Requires=myapp-setup.service

[Service]
Type=simple
ExecStart=/usr/local/bin/myapp --config /etc/myapp.conf
Restart=on-failure
RestartSec=5s
User=myapp
Group=myapp

# Resource limits
MemoryLimit=512M
CPUQuota=50%
TasksMax=1000

# Logging
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Best Practices:

# Reload systemd units
sudo systemctl daemon-reload

# Enable (start on boot)
sudo systemctl enable myapp.service

# Start, stop, restart
sudo systemctl start myapp
sudo systemctl status myapp
sudo systemctl restart myapp

# View service logs
journalctl -u myapp.service -n 50
journalctl -u myapp.service -f  # Follow (tail -f)

# Check dependencies
systemctl list-dependencies multi-user.target

# Analyze system state
systemd-analyze plot > system-boot.svg
systemd-analyze critical-chain

Containers: Namespaces & cgroups

Linux Namespaces (Process Isolation)

NamespaceIsolates
PIDProcess IDs (container PID 1 = host PID N)
NetworkNetwork interfaces, routing, iptables
MountFilesystem mounts (container has /root via chroot)
UTSHostname and domain name
IPCInter-process communication (message queues, shared memory)
UserUIDs/GIDs (container root ≠ host root)

cgroups (Resource Limits)

# Check cgroup hierarchy
mount | grep cgroup2

# Kubernetes pod cgroup example
# /sys/fs/cgroup/kubepods/pod-xxx/container-yyy/

# Set CPU limit (100ms per 100ms = 1 CPU)
echo 100000 > cgroup.cpu_period_us
echo 100000 > cgroup.cpu_quota_us

# Set memory limit (512MB)
echo 512M > memory.max

# Monitor cgroup usage
cat memory.current
cat cpu.stat

Production Best Practices

Security Hardening

# 1. Kernel parameters (/etc/sysctl.conf)
net.ipv4.ip_forward = 0           # Disable IP forwarding (if not router)
net.ipv4.tcp_syncookies = 1       # SYN flood protection
kernel.dmesg_restrict = 1          # Restrict kernel logs
kernel.kptr_restrict = 2           # Hide kernel pointers (ASLR)
fs.file-max = 2097152             # Increase file descriptor limit

# Apply
sudo sysctl -p

# 2. LSM (Linux Security Module): AppArmor or SELinux
sudo aa-status                     # AppArmor
getenforce                        # SELinux

# 3. Firewall (iptables/nftables)
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT   # SSH
sudo iptables -A INPUT -m state --state ESTABLISHED -j ACCEPT
sudo iptables -P INPUT DROP                            # Default deny

# 4. Rootkit detection
sudo aide -i                       # Create baseline
sudo aide --check                  # Detect tampering

# 5. Update kernel regularly
sudo apt update && sudo apt upgrade
# Subscribe to kernel security bulletins

Reliability & Observability

# 1. Monitor system health
top, htop, btop          # Process monitoring
iostat, iotop            # I/O monitoring
vmstat                   # Virtual memory stats
netstat, ss              # Network statistics

# 2. Logging
journalctl -n 100        # System journal (systemd)
tail -f /var/log/auth.log # Authentication log

# 3. Alerts
systemctl status         # Service status
systemd-analyze verify   # Unit file syntax check

# 4. Predictive maintenance
smartctl -a /dev/sda     # Disk health (SMART)
dmesg | tail             # Kernel warnings/errors

Performance Tuning

# CPU affinity (pin process to cores)
taskset -c 0,1 command   # Run on cores 0,1
taskset -cp 0,1 $$       # Current shell to cores 0,1

# Transparent Hugepage (THP)
cat /sys/kernel/mm/transparent_hugepage/enabled
# output: always [madvise] never

# Disable THP (if causing latency)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Disable C-states (CPU idle states) for latency-sensitive apps
# (Requires BIOS setting + idle=poll)

# Network tuning
sudo ethtool -C eth0 adaptive-rx on
sudo ethtool -C eth0 rx-usecs 100  # RX interrupt coalescing

Checklists

Pre-Production Deployment

  • Kernel version pinned (security updates on schedule)
  • Filesystem type chosen (ext4 for stability, btrfs for COW, NFS for scale)
  • Memory overcommit disabled (vm.overcommit_memory = 1 or 2)
  • Swap configured (ratio: 0.5× RAM for throughput, 1× RAM for batch jobs)
  • I/O scheduler optimized (mq-deadline for latency, kyber for throughput)
  • Firewall rules active (default deny inbound)
  • SELinux or AppArmor enabled
  • Log rotation configured (logrotate)
  • Monitoring agents installed (Prometheus node-exporter, etc.)
  • SSH hardened (password auth disabled, key-based only)
  • NTP/chrony synced (time critical for logs, certs)

Incident Response

  • OOM killer tuning (/proc/sys/vm/panic_on_oom)
  • Disk space alerts (keep >15% free)
  • Runaway process identification (ps aux, perf)
  • Network packet capture (tcpdump)
  • Kernel logs (dmesg, journalctl)
  • Service restart strategy (systemd auto-restart configured)
  • Backups automated (snapshots, replication)

FAQ

Q: How do I reduce boot time? A: Profile (systemd-analyze blame), disable unused services (systemctl disable), increase I/O readahead (hdparm), use SSD, or use hibernation (trade-off: power).

Q: Why is my system slow after a year? A: Common causes: disk fragmentation, swap thrashing, memory leak in daemon, excessive context switching. Diagnose with iostat, vmstat, perf.

Q: Can I run Linux without swap? A: Yes, if you have spare RAM and OOM killer configured. Risk: processes killed without warning. Better: small swap on SSD + memory cgroups.

Q: How does containerization differ from VMs? A: Containers: lightweight, share kernel, ~50MB overhead. VMs: heavier, full OS per instance, ~500MB+ overhead. Containers faster to boot and scale.


Conclusion

Linux is complex, but the layered model makes it understandable:

  1. Hardware→Bootloader→Kernel: Devices → code in RAM → OS
  2. Kernel subsystems: Abstract resources (CPU, memory, storage, network)
  3. Userspace: Services, systemd orchestration
  4. Containers: Isolated processes sharing kernel

For production:

  • Secure: harden kernel, firewall, monitor
  • Reliable: monitor SLOs, auto-restart, backup
  • Observable: logs, metrics, tracing (eBPF)
  • Performant: profile, tune I/O scheduler, pin processes

Further Reading: