Linux Core Subsystems: One-Page Reference Map

Overview

This is a one-page cheat sheet for Linux kernel subsystems. Each subsystem controls a critical resource; understanding them is essential for troubleshooting, optimization, and security.

Why understanding subsystems matters:

Imagine your server is slow. Without subsystem knowledge, you’re guessing:

“Maybe add more RAM?” (might be CPU scheduler issue)
“Maybe faster disk?” (might be memory cache problem)
“Maybe more CPU?” (might be I/O scheduler misconfiguration)

With subsystem knowledge, you diagnose systematically:

Symptom: Application slow
↓
Check: top shows 80% CPU "wait" (wa)
↓
Diagnosis: NOT a CPU problem → I/O wait means disk subsystem
↓
Check: iostat shows %util=100% on /dev/sda
↓
Diagnosis: Block I/O subsystem bottleneck
↓
Fix: Check I/O scheduler, investigate slow queries, add SSD

What this guide covers:

Each of the 9 subsystems below answers one question:

Scheduler: How does Linux decide which process runs next?
Memory: How does Linux manage RAM, cache, and swap?
VFS: How does Linux present files to applications?
Block I/O: How does Linux talk to disks?
Networking: How does Linux send/receive packets?
Namespaces: How does Linux isolate containers?
cgroups: How does Linux limit resource usage?
LSM: How does Linux enforce security policies?
eBPF: How does Linux enable custom kernel-level observability?

How to use this guide:

Learning: Read each subsystem to understand what it does
Troubleshooting: Jump to subsystem matching your symptom (use the table at bottom)
Reference: Copy-paste commands when investigating issues

1. Process Scheduler (CPU)

What it is:

Fairly distributes CPU time among processes using CFS (Completely Fair Scheduler)
Tracks virtual runtime (vruntime); process with lowest vruntime runs next
Supports real-time (RT) scheduling for critical tasks (separate scheduler)
Balances load across CPU cores while respecting NUMA locality

Detailed explanation:

CFS (Completely Fair Scheduler) - The “fairness” guarantee:

Think of CFS like a teacher distributing speaking time among students. The student who has spoken the least gets to speak next.

How it works:

Each process has a vruntime (virtual runtime) counter that tracks total CPU time used
When CPU becomes available, CFS picks the process with the LOWEST vruntime
Process runs for a time slice (~4-6ms by default), then vruntime increases
Process goes back to queue, next lowest vruntime runs

Example:

Process A: vruntime = 100ms (ran a lot)
Process B: vruntime = 50ms  (ran less)
Process C: vruntime = 75ms  (middle)

→ CFS picks Process B (lowest = 50ms)
→ B runs for 5ms → vruntime becomes 55ms
→ CFS picks B again (still lowest)
→ B runs for 5ms → vruntime becomes 60ms
→ CFS picks B again... until B's vruntime catches up to others

Why this matters: CPU-intensive processes don’t starve I/O-bound processes. A background video encoder (high vruntime) won’t block your SSH session (low vruntime).

Real-time scheduling - When “fairness” isn’t enough:

What it is: Some tasks can’t wait—audio processing must happen within 10ms or you hear glitches. RT scheduler bypasses CFS entirely.

RT scheduling classes:

SCHED_FIFO (First In First Out): Process runs until it yields or higher-priority RT process arrives. No time slicing.
SCHED_RR (Round Robin): Like FIFO but with time slicing among same-priority RT processes.
SCHED_DEADLINE: Advanced—specify CPU time needed + deadline. Kernel schedules to meet deadline.

Real-world example:

# Audio processing daemon needs guaranteed CPU
sudo chrt -f 80 /usr/bin/audio-daemon

What happens:
- audio-daemon gets priority 80 (higher = more important)
- When audio-daemon wants CPU, it preempts ALL normal processes
- CFS processes wait until audio-daemon sleeps/finishes

Danger: RT process in infinite loop = system hang. All other processes starve.

Load balancing across cores - Why your 8-core CPU matters:

What it is: If you have 8 CPU cores, scheduler tries to use all 8 evenly. Otherwise core 0 might be 100% busy while core 7 idles.

How it works:

Scheduler periodically (every ~4ms) checks if cores are imbalanced
If core 0 has 10 processes and core 1 has 2, scheduler migrates some from 0→1
Respects CPU affinity (taskset binds process to specific cores)

NUMA locality - Why memory proximity matters:

Modern servers have multiple NUMA nodes (CPU+RAM pairs)
Accessing local RAM (same NUMA node) is 2-3x faster than remote RAM
Scheduler tries to keep process on same NUMA node as its memory

Example:

NUMA node 0: CPU 0-7, RAM 0-64GB
NUMA node 1: CPU 8-15, RAM 64-128GB

Process on CPU 0 accessing RAM at 10GB: Fast (local)
Process on CPU 8 accessing RAM at 10GB: Slow (crosses NUMA boundary)

→ Scheduler prefers keeping process on node 0

Pitfall 1: Ignoring load average

High load (> num_cores) indicates CPU contention
uptime shows 1/5/15-min averages; trending matters more than absolute value
Fix: Use systemd-analyze + perf to identify CPU-bound processes

Pitfall 2: Misusing real-time priority

RT tasks bypass CFS, can starve other processes
Setting chrt -f 99 command can hang the system if not careful
Fix: Reserve RT for genuinely critical, bounded-time work; use SCHED_DEADLINE for advanced users

Key metrics/tools:

uptime                    # Load average
top -p PID -s 1          # Per-process CPU% (user/system time)
ps -eo pid,comm,class    # Scheduling class (TS=SCHED_OTHER, FF=FIFO, RR=round-robin)
perf stat command        # IPC, cache misses, context switches
systemd-analyze plot     # Boot parallelization

2. Memory Management (RAM, Swap, Cache)

What it is:

Provides virtual address spaces to each process (MMU translates to physical RAM)
Page cache: OS caches file data in RAM for speed; pages evicted when RAM needed
Swap: Moves inactive pages to disk (slow fallback when RAM full)
THP (Transparent Hugepages): Automatically uses 2MB/1GB pages instead of 4KB to reduce TLB misses

Detailed explanation:

Virtual memory - Why every process thinks it owns all RAM:

What it is: Each process sees its own private address space (like 0x00000000 to 0xFFFFFFFF on 32-bit). The MMU (Memory Management Unit, hardware) translates these “virtual” addresses to actual physical RAM locations.

Why it matters:

Process A reads from address 0x1000 → MMU translates to physical RAM 0x500000
Process B reads from address 0x1000 → MMU translates to physical RAM 0x700000

Same virtual address, different physical RAM = processes isolated from each other

Real-world benefit: Process crashes can’t corrupt other processes’ memory. Container at virtual 0x1000 can’t read host memory.

Page cache - Why your second ls is instant:

What it is: When you read a file, Linux copies it into RAM. Next time you read, Linux serves from RAM (instant) instead of disk (milliseconds).

Example:

# First read: 500ms (from disk)
time cat /var/log/syslog > /dev/null
# real    0m0.500s

# Second read: 5ms (from page cache)
time cat /var/log/syslog > /dev/null
# real    0m0.005s  ← 100x faster!

How it works:

File reads → Linux copies disk blocks into RAM pages
RAM pages marked as “page cache”
When RAM needed for applications, kernel evicts least-recently-used cache pages
Modified pages (writes) must be flushed to disk first (dirty pages)

This is why free -h shows “used” RAM as high: Linux uses “free” RAM for caching. It’s not wasted—it’s optimized.

$ free -h
              total        used        free      shared  buff/cache
Mem:           31Gi       5.0Gi       1.0Gi       100Mi        25Gi

25GB in "buff/cache" = page cache
If application needs RAM, kernel evicts cache automatically
This is GOOD, not bad

Swap - The emergency pressure valve:

What it is: When RAM is full, Linux moves inactive memory pages to disk (swap partition or swap file). This frees RAM for active processes.

Why it’s slow:

RAM access:  ~100 nanoseconds
Disk access: ~10 milliseconds
→ Swap is 100,000x slower than RAM

When swap is good:

Inactive background process (like old SSH session) swapped out → active database gets more RAM
Temporary RAM spike → swap absorbs it, prevents OOM kill

When swap is bad:

Active process swapping in/out repeatedly (thrashing)
Example: Database with 8GB working set but only 4GB RAM → constant swap I/O → queries take 100x longer

How to detect thrashing:

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  2 500000  10000  20000  30000  5000 5000   100   200  500 1000 10 20 50 20  0

si=5000, so=5000 (swap in/out) = THRASHING
→ Application using RAM faster than available
→ Fix: Add RAM or reduce working set

Transparent Hugepages (THP) - Bigger pages, fewer lookups:

What it is: Normally Linux uses 4KB memory pages. THP uses 2MB pages automatically (512x larger).

Why bigger is sometimes better:

CPU has Translation Lookaside Buffer (TLB) that caches virtual→physical address mappings
TLB has limited entries (like ~1024)
With 4KB pages: 1024 entries × 4KB = 4MB coverage
With 2MB pages: 1024 entries × 2MB = 2GB coverage
Larger coverage = fewer TLB misses = faster memory access

Example benefit:

Database with 10GB dataset, sequential access:
- 4KB pages: Frequent TLB misses (10GB > 4MB TLB coverage)
- 2MB pages: Fewer TLB misses (10GB fits better in 2GB coverage)
→ 10-20% performance improvement in some workloads

When THP hurts:

Fragmentation: Kernel tries to find contiguous 2MB physical RAM
If RAM fragmented (lots of small allocations), kernel spends time defragmenting
Defragmentation can cause 10-100ms latency spikes
Databases (Redis, MongoDB) with strict latency SLAs disable THP

Pitfall 1: Excessive swapping kills performance

Swapping page from disk → 1000x slower than RAM access
High si/so (swap in/out) in vmstat indicates thrashing
Fix: Disable swap for latency-sensitive apps (swapoff -a); monitor PSI (Pressure Stall Info)

Pitfall 2: THP backfires on databases

Defragmentation latency spikes when THP pages get fragmented
Some DBs (MongoDB, Redis) prefer 4KB pages for predictability
Fix: Disable: echo never > /sys/kernel/mm/transparent_hugepage/enabled

Key metrics/tools:

free -h                              # Overall RAM/swap/cache
vmstat 1 5                           # si/so (swap I/O), wa (I/O wait)
ps aux | sort -k6 -rn | head -10    # Top memory consumers (RSS)
cat /proc/pressure/memory            # PSI: CPU, I/O, memory stall percentages
sar -B 1 5                           # Page faults, THP usage
echo madvise > /sys/kernel/mm/transparent_hugepage/sysfs_enabled  # Enable for app opt-in

3. Virtual Filesystem (VFS) & Filesystems

What it is:

VFS: Unified interface above ext4, XFS, Btrfs, NFS (abstraction layer)
Inode: File metadata (permissions, size, block pointers, timestamps)
Dentry: Filename → inode mapping (cached in dcache for speed)
Mount: Attach filesystem to directory tree; multiple filesystems can coexist

Detailed explanation:

VFS - Why cat works on ext4, XFS, NFS, and even /proc:

What it is: VFS is an abstraction layer. Applications call open("/path/to/file"), and VFS translates that into the appropriate filesystem-specific operation.

Why it matters:

Application: open("/etc/passwd")
    ↓
VFS: "Which filesystem owns /etc?"
    ↓
VFS: "ext4 filesystem on /dev/sda1"
    ↓
VFS calls: ext4_open()
    ↓
ext4 driver reads inode, returns file descriptor

Same for all filesystems:

open("/mnt/nfs/file") → VFS → nfs_open() → Network request to NFS server
open("/proc/cpuinfo") → VFS → proc_open() → Kernel generates CPU info on-the-fly
open("/dev/sda") → VFS → block_device_open() → Direct disk access

Application doesn't care—VFS handles it

Inode - The “real” file (not the filename):

What it is: An inode is a data structure that stores everything about a file EXCEPT its name:

File type (regular, directory, symlink)
Permissions (rwxr-xr-x)
Owner (UID/GID)
Size (in bytes)
Timestamps (created, modified, accessed)
Pointers to data blocks on disk

Filename is separate: Directory entries (dentries) map names → inode numbers.

Example:

$ ls -li /etc/passwd
12345678 -rw-r--r-- 1 root root 2048 Oct 16 12:00 /etc/passwd
         ↑
         Inode number

$ stat /etc/passwd
  File: /etc/passwd
  Size: 2048        Blocks: 8
  Inode: 12345678   Links: 1
  Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

Hard links - Multiple names, one inode:

$ ln /etc/passwd /tmp/passwd-hardlink
$ ls -li /etc/passwd /tmp/passwd-hardlink
12345678 -rw-r--r-- 2 root root 2048 Oct 16 12:00 /etc/passwd
12345678 -rw-r--r-- 2 root root 2048 Oct 16 12:00 /tmp/passwd-hardlink
         ↑ Same inode = same file, two names

Why running out of inodes breaks things:

$ touch /tmp/newfile
touch: cannot touch '/tmp/newfile': No space left on device

$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   50G   50G  50% /tmp

$ df -i /tmp
Filesystem      Inodes  IUsed  IFree IUse% Mounted on
/dev/sda1      6500000 6500000    0  100% /tmp
                       ↑ All inodes used!

Even though 50GB free, can't create files (no inodes available)

Dentry cache (dcache) - Why ls the second time is instant:

What it is: Linux caches the mapping of filename → inode in RAM.

Example:

# First lookup: Must read directory blocks from disk
$ ls /var/log/syslog
(kernel reads /var/log directory inode, finds syslog entry, caches it)

# Second lookup: Served from dcache (RAM)
$ ls /var/log/syslog
(instant - no disk I/O)

How it helps: Applications frequently access same files (/etc/hosts, /lib/x86_64-linux-gnu/libc.so.6). Dcache avoids disk reads.

Mount - Attaching filesystems to the directory tree:

What it is: Linux has one unified directory tree starting at /. You “mount” filesystems at specific paths.

Example:

# Root filesystem (ext4 on /dev/sda1) mounted at /
# Home directories (XFS on /dev/sdb1) mounted at /home
# NFS share mounted at /mnt/shared

$ mount
/dev/sda1 on / type ext4 (rw,relatime)
/dev/sdb1 on /home type xfs (rw,noatime)
nfs-server:/export on /mnt/shared type nfs (rw,soft,timeo=30)

When you access /home/user/file:
→ VFS sees "/home" is a mount point
→ Redirects to /dev/sdb1 (XFS filesystem)
→ XFS handles the request

Why this matters: You can have different filesystems for different directories (fast SSD for /var/lib/postgresql, slow HDD for /var/log).

Pitfall 1: Running out of inodes

Each file/dir/link = 1 inode; if df -i shows 100%, can’t create files even with free space
Millions of small files (temp logs, session stores) exhaust inodes quickly
Fix: tune2fs -l /dev/sda1 shows inode count; recreate fs with more inodes: mkfs.ext4 -N 1000000 /dev/sda1

Pitfall 2: Suboptimal mount options

Default relatime still updates atime on read (small I/O overhead)
No nobarrier on SSD without UPS = slow writes (fsync waits for disk flush)
Fix: Use mount -o noatime,nodiratime,nobarrier for non-critical data; keep barriers for databases

Key metrics/tools:

df -h                        # Disk space by filesystem
df -i                        # Inode usage (critical!)
mount | grep -E 'ext4|xfs'   # Show mount options
lsof | head -20              # Files open by process
sync; echo 3 > /proc/sys/vm/drop_caches  # Clear page cache (test)
fstrim -v /mount             # Discard unused blocks (SSDs)

4. Block I/O (Disk Scheduling)

What it is:

I/O scheduler: Orders disk requests to minimize seek time (deadline, CFQ, mq-deadline, noop)
io_uring: Modern async I/O interface (replaces aio); supports polling, fixed buffers, kernel bypass
Request queue: Batches I/O requests before sending to device
Throughput vs latency: High throughput needs batching; low latency needs quick service

Pitfall 1: Wrong scheduler for your device

Spinning disk: use deadline (prioritizes reads)
SSD/NVMe: use none or noop (let device schedule)
Using CFQ on fast SSDs adds unnecessary latency (sorting overhead)
Fix: Check cat /sys/block/sda/queue/scheduler; for production: none is usually safe

Pitfall 2: io_uring without proper resource limits

io_uring buffers can pin kernel memory; many async ops → OOM
Fix: Limit memory pinning: ulimit -l (check cat /proc/sys/kernel/memlock)

Key metrics/tools:

iostat -x 1 5                     # %util, await (avg service time), svctm
iotop                             # Top processes by disk I/O
blktrace -d /dev/sda -o - | blkparse  # Detailed I/O tracing
perf record -e block:block_rq_* -- command  # I/O event tracing
fio --name=random-read --ioengine=libaio    # Disk benchmark

5. Networking Stack (Network I/O)

What it is:

nftables: Modern packet filtering framework (replaces iptables); rules in kernel eBPF
conntrack: Tracks TCP/UDP connection state; enables stateful firewall
qdisc (queuing discipline): Schedules outbound packets (pfifo, fq, cake, htb for traffic shaping)
tc (traffic control): Linux traffic shaping tool; applies qdiscs, classes, filters

Pitfall 1: Conntrack table exhaustion

Malicious/buggy clients create many short-lived connections; conntrack table fills
Result: “Cannot assign requested address” on outbound connections
Fix: Monitor cat /proc/net/stat/nf_conntrack | tail -1; increase: sysctl -w net.netfilter.nf_conntrack_max=2000000

Pitfall 2: No egress rate limiting → noisy neighbor

One container/VM burns all bandwidth; others starve
Fix: Apply tc qdisc: tc qdisc add dev eth0 root tbf rate 100mbit burst 32kb latency 400ms

Key metrics/tools:

ss -tulnp                              # TCP/UDP sockets, listening ports
cat /proc/net/netstat                  # IP stats (dropped, errors)
nftrace list ruleset                   # View nftables rules
ip netns list; ip netns exec NS ss -an # Namespace inspection
ethtool -S eth0                        # NIC driver stats (RX/TX drops, errors)
tc -s qdisc show dev eth0              # Queue discipline stats
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_connect { ... }'  # Trace connects

6. Namespaces (Isolation for Containers)

What it is:

PID namespace: Process sees only its own PID tree; PID 1 in container ≠ PID 1 on host
Network namespace: Isolated network stack (veth, lo, routing table)
Mount namespace: Private root filesystem view; mount /dev/sda1 in container doesn’t affect host
UTS namespace: Isolated hostname, domainname
IPC namespace: Private semaphores, message queues, shared memory
User namespace: UID/GID remapping (container root = unprivileged host user)

Detailed explanation:

Namespaces - The foundation of container isolation:

What it is: Namespaces make each container think it’s running on its own dedicated machine. Processes inside a container can’t see (or affect) processes in other containers or on the host.

How Docker/Kubernetes use namespaces:

Docker run creates:
1. New PID namespace → container sees PID 1 as its own init
2. New network namespace → container gets its own lo, eth0
3. New mount namespace → container sees its own /etc, /usr, /var
4. New UTS namespace → container has its own hostname
5. New IPC namespace → container's shared memory isolated
6. New user namespace → container root ≠ host root (optional)

Result: Container thinks it's a separate machine

Visual diagram of namespace isolation:

┌─────────────────────────────────────────────────────────────────────┐
│ HOST SYSTEM (Real Linux Kernel)                                     │
│                                                                      │
│  ┌──────────────────────┐  ┌──────────────────────┐                │
│  │  Container A          │  │  Container B          │                │
│  │  (Namespace Set #1)   │  │  (Namespace Set #2)   │                │
│  │                       │  │                       │                │
│  │  ┌─────────────────┐ │  │  ┌─────────────────┐ │                │
│  │  │ PID Namespace   │ │  │  │ PID Namespace   │ │                │
│  │  │                 │ │  │  │                 │ │                │
│  │  │ PID 1: nginx    │ │  │  │ PID 1: postgres │ │                │
│  │  │ PID 2: worker   │ │  │  │ PID 2: worker   │ │                │
│  │  │                 │ │  │  │                 │ │                │
│  │  │ (Isolated PIDs) │ │  │  │ (Isolated PIDs) │ │                │
│  │  └─────────────────┘ │  │  └─────────────────┘ │                │
│  │          ↓            │  │          ↓            │                │
│  │  ┌─────────────────┐ │  │  ┌─────────────────┐ │                │
│  │  │ Net Namespace   │ │  │  │ Net Namespace   │ │                │
│  │  │                 │ │  │  │                 │ │                │
│  │  │ eth0: 172.17.0.2│ │  │  │ eth0: 172.17.0.3│ │                │
│  │  │ lo: 127.0.0.1   │ │  │  │ lo: 127.0.0.1   │ │                │
│  │  │                 │ │  │  │                 │ │                │
│  │  │ (Own IP stack)  │ │  │  │ (Own IP stack)  │ │                │
│  │  └─────────────────┘ │  │  └─────────────────┘ │                │
│  │          ↓            │  │          ↓            │                │
│  │  ┌─────────────────┐ │  │  ┌─────────────────┐ │                │
│  │  │ Mount Namespace │ │  │  │ Mount Namespace │ │                │
│  │  │                 │ │  │  │                 │ │                │
│  │  │ /: overlay2 fs  │ │  │  │ /: overlay2 fs  │ │                │
│  │  │ /etc: container │ │  │  │ /etc: container │ │                │
│  │  │ /var: container │ │  │  │ /var: container │ │                │
│  │  │                 │ │  │  │                 │ │                │
│  │  │ (Own root FS)   │ │  │  │ (Own root FS)   │ │                │
│  │  └─────────────────┘ │  │  └─────────────────┘ │                │
│  └──────────────────────┘  └──────────────────────┘                │
│              ↓                          ↓                            │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │             ACTUAL KERNEL RESOURCES                          │   │
│  │                                                               │   │
│  │  PIDs: 1234 (nginx), 1235 (worker), 1236 (postgres)...      │   │
│  │  Network: Real NICs (eth0), bridges (docker0), veth pairs   │   │
│  │  Mounts: /var/lib/docker/overlay2/abc123, /var/lib/docker...│   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Key insight:
- Container A's PID 1 → Real kernel PID 1234
- Container B's PID 1 → Real kernel PID 1236
- Containers can't see each other's PIDs, networks, or filesystems

1. PID Namespace - Process isolation:

What it is: Each PID namespace has its own process tree starting at PID 1. Processes in the namespace only see other processes in the same namespace.

Important distinction: PID Namespace vs PID (Process ID):

Many people confuse these two concepts. Let’s clarify:

PID (Process ID):

A number assigned to a running process
Example: nginx process has PID 1234
Every running process has a PID
PIDs are unique within a namespace

PID Namespace:

An isolation mechanism (like a container)
Groups processes together
Each namespace has its own PID numbering starting from 1
Same process can have different PID numbers in different namespaces

Visual comparison:

┌─────────────────────────────────────────────────────────────────────┐
│ Concept: PID (Process ID)                                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│ What it is: A NUMBER assigned to a running process                  │
│                                                                      │
│ Example:                                                             │
│   $ ps aux                                                           │
│   PID   USER    COMMAND                                             │
│   1234  root    nginx: master process                               │
│   1235  www     nginx: worker process                               │
│   1236  postgres postgres -D /var/lib/postgresql                    │
│           ↑                                                          │
│           These are PIDs (just numbers)                              │
│                                                                      │
│ Analogy: Like a house number (123 Main Street)                      │
│          The number identifies the house                             │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Concept: PID Namespace                                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│ What it is: An ISOLATION CONTAINER for processes                    │
│             Each namespace has its own PID numbering                 │
│                                                                      │
│ Example:                                                             │
│                                                                      │
│  ┌─────────────────────┐    ┌─────────────────────┐               │
│  │ PID Namespace A     │    │ PID Namespace B     │               │
│  │ (Container 1)       │    │ (Container 2)       │               │
│  │                     │    │                     │               │
│  │  PID 1: nginx       │    │  PID 1: postgres    │               │
│  │  PID 2: worker      │    │  PID 2: worker      │               │
│  │  PID 3: bash        │    │  PID 3: bash        │               │
│  └─────────────────────┘    └─────────────────────┘               │
│           ↑                           ↑                             │
│     These are PID Namespaces (containers)                           │
│     Each has its own PID numbering (both start at 1)                │
│                                                                      │
│ Analogy: Like different cities (New York vs Tokyo)                  │
│          Each city has its own "123 Main Street"                    │
│          The street name is the same, but they're in different cities│
└─────────────────────────────────────────────────────────────────────┘

Real-world example showing the difference:

# HOST SYSTEM (Default PID Namespace)
$ ps aux
PID   USER     COMMAND
1     root     /sbin/init                    ← Host init (real PID 1)
1234  root     nginx: master                 ← nginx in container A
1235  www      nginx: worker
1236  postgres postgres: main                ← postgres in container B
1237  postgres postgres: worker

# CONTAINER A (New PID Namespace)
$ docker exec -it container-a ps aux
PID   USER     COMMAND
1     root     nginx: master                 ← Same process as host PID 1234
2     www      nginx: worker                 ← Same process as host PID 1235

# CONTAINER B (Another PID Namespace)
$ docker exec -it container-b ps aux
PID   USER     COMMAND
1     postgres postgres: main                ← Same process as host PID 1236
2     postgres postgres: worker              ← Same process as host PID 1237

Key insight - Same process, multiple PIDs:

                          ┌──────────────────────┐
                          │  ACTUAL PROCESS      │
                          │  (nginx master)      │
                          │  Running in kernel   │
                          └──────────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
                    ↓                               ↓
         ┌────────────────────┐         ┌────────────────────┐
         │ How HOST sees it:  │         │ How CONTAINER sees:│
         │   PID 1234         │         │   PID 1           │
         └────────────────────┘         └────────────────────┘

Same process = Different PID numbers depending on namespace viewing it

Why this matters:

Isolation: Container A’s PID 1 can’t kill Container B’s PID 1 (different namespaces)
Security: Container sees PID 1-3, can’t see host PIDs 1234-1237
Debugging: When you see “PID 1” in container logs, it’s NOT the host’s PID 1

Common confusion resolved:

❌ Wrong thinking: “My container has PID 1, that means it’s the system’s init process” ✅ Correct thinking: “My container has PID 1 in its namespace. On the host, it’s probably PID 12345”

❌ Wrong: “I killed PID 1234 on host, why did the container die?” ✅ Correct: “PID 1234 on host is PID 1 in container namespace. Killing container’s PID 1 kills entire container”

Summary table:

Aspect	PID (Process ID)	PID Namespace
What it is	A number (identifier)	An isolation container
Example	1234, 1235, 1236	Container A, Container B, Host
Uniqueness	Unique within a namespace	Each namespace is separate
Purpose	Identify a specific process	Isolate groups of processes
Analogy	House number (123)	City/neighborhood (New York, Tokyo)
Created by	Kernel when process starts	`unshare(CLONE_NEWPID)` or Docker
Lifetime	Until process exits	Until last process in namespace exits

Real-world example:

# On host
$ ps aux | grep nginx
root      1234  0.0  0.1  nginx: master process
www-data  1235  0.0  0.1  nginx: worker process

# Inside container
$ ps aux
PID   USER     COMMAND
1     root     nginx: master process    ← This is PID 1234 on host
2     www-data nginx: worker process    ← This is PID 1235 on host

Container sees: PID 1, 2
Host sees: PID 1234, 1235
Same process, different PID numbers in different namespaces

Why PID 1 matters:

PID 1 in Unix is special—it’s the init process
When PID 1 exits, kernel kills all processes in that namespace
Docker container exits when PID 1 (entrypoint) exits

Diagram of PID namespace mapping:

Container PID Namespace          Host PID Namespace
┌─────────────────────┐         ┌──────────────────────┐
│                     │         │                      │
│  PID 1 (nginx)    ──┼────────▶│  PID 1234 (nginx)    │
│  PID 2 (worker)   ──┼────────▶│  PID 1235 (worker)   │
│  PID 3 (bash)     ──┼────────▶│  PID 1236 (bash)     │
│                     │         │                      │
│  Can only see      │         │  Can see ALL PIDs:   │
│  PIDs 1, 2, 3      │         │  1, 1234, 1235, 1236 │
└─────────────────────┘         └──────────────────────┘

Security benefit: Container process can’t send signals (kill) to host processes because it can’t see their PIDs.

2. Network Namespace - Network stack isolation:

What it is: Each network namespace has its own network interfaces, IP addresses, routing tables, firewall rules, and sockets.

How containers get networking:

Step 1: Create network namespace
  → Container gets isolated network stack (empty)
  → No interfaces except loopback (lo)

Step 2: Create veth pair (virtual ethernet cable)
  → One end in container namespace (eth0)
  → Other end in host namespace (vethXXX)

Step 3: Connect host end to bridge (docker0)
  → Now container can reach host and other containers

Step 4: Configure NAT on host
  → Container can reach internet (host translates addresses)

Visual diagram of container networking:

┌────────────────────────────────────────────────────────────────────┐
│ HOST                                                                │
│                                                                     │
│  ┌─────────────────────┐          ┌─────────────────────┐         │
│  │ Container A         │          │ Container B         │         │
│  │ Network Namespace   │          │ Network Namespace   │         │
│  │                     │          │                     │         │
│  │  eth0: 172.17.0.2   │          │  eth0: 172.17.0.3   │         │
│  │  gateway: 172.17.0.1│          │  gateway: 172.17.0.1│         │
│  └──────┬──────────────┘          └──────┬──────────────┘         │
│         │                                 │                        │
│         │ veth pair                       │ veth pair              │
│         │ (virtual cable)                 │ (virtual cable)        │
│         ↓                                 ↓                        │
│    ┌────┴─────────────────────────────────┴────┐                  │
│    │       docker0 bridge (172.17.0.1)         │                  │
│    │    (Virtual switch connecting containers) │                  │
│    └────┬──────────────────────────────────────┘                  │
│         │                                                          │
│         │ NAT (iptables)                                          │
│         ↓                                                          │
│    ┌────┴──────┐                                                  │
│    │  eth0     │ (Host physical NIC)                              │
│    │  Public IP│                                                  │
│    └────┬──────┘                                                  │
│         │                                                          │
└─────────┼──────────────────────────────────────────────────────────┘
          │
          ↓
     Internet

Data flow example:
Container A (172.17.0.2) → docker0 bridge → NAT (source IP changed to host IP) → eth0 → Internet

Real-world example:

# On host
$ ip addr show docker0
docker0: <BROADCAST,MULTICAST,UP>
    inet 172.17.0.1/16 scope global docker0

$ ip addr show veth1a2b3c
veth1a2b3c: <BROADCAST,MULTICAST,UP>
    (connected to container A's eth0)

# In container
$ ip addr show eth0
eth0: <BROADCAST,MULTICAST,UP>
    inet 172.17.0.2/16 scope global eth0
    (connected to host's veth1a2b3c)

$ ip route
default via 172.17.0.1 dev eth0
(Container routes all traffic through docker0)

Security benefit: Container can’t sniff packets from other containers. Each has isolated network stack.

3. Mount Namespace - Filesystem isolation:

What it is: Each mount namespace sees its own filesystem tree. Mounting a filesystem in one namespace doesn’t affect others.

Real-world example:

# On host
$ mount | grep /var/lib/docker
overlay on /var/lib/docker/overlay2/abc123/merged type overlay

# In container
$ mount
overlay on / type overlay (rw,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k)
proc on /proc type proc (rw,nosuid,nodev,noexec)

Container sees "/" as root
Host sees it as /var/lib/docker/overlay2/abc123/merged

Diagram of mount namespace:

Container Mount Namespace        Host Mount Namespace
┌──────────────────────┐        ┌─────────────────────────────────┐
│                      │        │                                 │
│  /                   │        │  /                              │
│  ├── /etc            │        │  ├── /etc (real host /etc)      │
│  ├── /usr            │        │  ├── /usr                       │
│  ├── /var            │        │  ├── /var                       │
│  └── /app            │        │  └── /var/lib/docker/overlay2/  │
│      (container FS)  │        │      ├── abc123/merged/ ← Container A root
│                      │        │      └── def456/merged/ ← Container B root
└──────────────────────┘        └─────────────────────────────────┘

When container reads /etc/passwd:
  → Container sees /etc/passwd (in its namespace)
  → Kernel translates to /var/lib/docker/overlay2/abc123/merged/etc/passwd
  → Container can't access host's real /etc/passwd

Security benefit: Container can’t modify host filesystem. Even if it writes to /etc, it’s writing to its own overlay, not host’s /etc.

4. UTS Namespace - Hostname isolation:

What it is: Each UTS namespace can have its own hostname and domain name.

Real-world example:

# On host
$ hostname
production-server-01

# In container A
$ hostname
web-container

# In container B
$ hostname
db-container

Each container has its own hostname
Host hostname unchanged

Why it matters: Applications that log hostnames (like distributed systems) can identify which container logged what, even when multiple containers run same image.

5. IPC Namespace - Shared memory isolation:

What it is: Each IPC namespace has isolated System V IPC objects (shared memory segments, semaphores, message queues).

Real-world example:

# On host
$ ipcs -m
Shared Memory Segments
key        shmid      owner      bytes
0x00000000 32768     postgres   16777216

# In container
$ ipcs -m
Shared Memory Segments
key        shmid      owner      bytes
0x00000000 65536     postgres   8388608

Different shared memory segments
Container can't access host's shared memory

Why it matters: Prevents containers from using shared memory to communicate (potential side channel for attacks).

6. User Namespace - UID/GID remapping (most complex):

What it is: Maps user IDs inside container to different IDs on host. Container root (UID 0) can be mapped to unprivileged user on host (UID 100000).

Real-world example:

# In container (user namespace enabled)
$ id
uid=0(root) gid=0(root) groups=0(root)
$ whoami
root
$ touch /tmp/file
$ ls -la /tmp/file
-rw-r--r-- 1 root root 0 Oct 17 12:00 /tmp/file

# On host (looking at container's file)
$ ls -la /var/lib/docker/.../merged/tmp/file
-rw-r--r-- 1 100000 100000 0 Oct 17 12:00 file
                ↑      ↑
        Container UID 0 → Host UID 100000

Diagram of UID mapping:

Container User Namespace         Host User Namespace
┌──────────────────────┐        ┌──────────────────────┐
│                      │        │                      │
│  UID 0 (root)      ──┼───────▶│  UID 100000          │
│  UID 1 (daemon)    ──┼───────▶│  UID 100001          │
│  UID 1000 (user)   ──┼───────▶│  UID 101000          │
│                      │        │                      │
│  Container thinks    │        │  Host sees container │
│  process is root     │        │  as unprivileged     │
└──────────────────────┘        └──────────────────────┘

Mapping configured in /etc/subuid and /etc/subgid

Security benefit (huge):

Without user namespaces:
  Container root (UID 0) = Host root (UID 0)
  If container escapes, attacker is real root
  Game over

With user namespaces:
  Container root (UID 0) → Host UID 100000 (unprivileged)
  If container escapes, attacker has UID 100000 (can't sudo, can't read /etc/shadow)
  Much safer

Why Docker doesn’t enable by default: Compatibility issues with volume mounts (file ownership gets confusing).

Pitfall 1: Leaking namespaces via socket files

If process dies but socket in netns remains, netns persists (hidden)
Fix: ip netns list shows active namespaces; delete: ip netns delete NETNS

Pitfall 2: User namespace security misconfiguration

Mapping container root (UID 0) to host unprivileged user is complex
Misconfiguration allows container escape
Fix: Use podman’s user namespace by default (not Docker)

Key metrics/tools:

lsns -t pid                    # List PID namespaces
ip netns list                  # List network namespaces
nsenter -t PID -a /bin/bash    # Enter all namespaces of PID
unshare -p -f /bin/bash        # Create new PID namespace
cat /proc/PID/ns/*             # Inode numbers (same inode = same namespace)

7. cgroups v2 (Resource Limits & Accounting)

What it is:

cgroup: Control group; limits/accounts CPU, memory, I/O, pids, network for process group
v2 unified hierarchy: Single tree (v1 had multiple controllers)
CPU limit: cpu.max=50000 100000 = 50% of one core
Memory limit: memory.max=1G enforces hard limit (OOM killer if exceeded)
PSI (Pressure Stall Info): Metrics on resource contention (CPU throttling, memory pressure, I/O wait)

Detailed explanation:

cgroups - How containers limit resources (and don’t crash the host):

What it is: A cgroup is a way to group processes and apply resource limits to the group as a whole.

Why it matters: Without cgroups, a runaway container process can use 100% CPU and starve all other processes on the host. With cgroups, you can say “this container gets max 2 CPU cores and 4GB RAM, period.”

Real-world example:

Host has 8 CPU cores, 32GB RAM

Container A (web server):
  - cgroup limits: 2 CPU cores, 4GB RAM
  - Actual usage: 1.5 cores, 2GB RAM → OK

Container B (batch job goes wild):
  - cgroup limits: 2 CPU cores, 4GB RAM
  - Tries to use: 8 cores, 10GB RAM
  - cgroup enforces: Only gets 2 cores max, throttled
  - Memory: Process killed (OOM) when exceeds 4GB

Host remains responsive: Containers can't steal unlimited resources

CPU limits - The two numbers explained:

Format: cpu.max=50000 100000

What it means:

First number (50000): CPU quota in microseconds per period
Second number (100000): Period length in microseconds

Math:

50000 microseconds / 100000 microseconds = 0.5 = 50% of one CPU core

100000 microseconds = 0.1 seconds = period resets 10 times per second

In each 0.1 second period:
- cgroup can use max 50000 microseconds (0.05 seconds) of CPU
- After using 50000µs, process is throttled until next period

More examples:

cpu.max=200000 100000 = 200% = 2 full CPU cores
cpu.max=400000 100000 = 400% = 4 full CPU cores
cpu.max=10000  100000 = 10%  = 0.1 CPU cores

What “throttled” means:

Process runs for 50ms → reaches quota → scheduler stops running it
Wait 50ms (rest of period) → new period starts → process can run again

Result: Bursty performance (run, pause, run, pause...)
Check: cat /sys/fs/cgroup/cpu.stat
  nr_throttled: 1234       ← Number of times throttled
  throttled_time: 5000000  ← Microseconds spent throttled

Memory limits - Hard vs soft:

memory.max (hard limit):

memory.max=1G

Process allocates memory:
  500MB → OK
  800MB → OK
  1.1GB → OOM KILL (exceeds limit)

Kernel kills process immediately when limit exceeded
No warning, no grace period

memory.high (soft limit):

memory.high=1G
memory.max=2G

Process allocates memory:
  500MB → OK
  1.1GB → Kernel starts aggressive reclamation (swapping, cache eviction)
        → Process slows down but not killed
  2.1GB → OOM KILL (exceeded hard limit)

Think of memory.high as "warning track" before hard limit

Why use soft limits:

Gives application chance to release memory (GC, cache flush)
Avoids abrupt crashes
Production pattern: memory.high=0.8 × memory.max

PSI (Pressure Stall Information) - Early warning system:

What it is: PSI tells you when processes are waiting for resources.

Example:

$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

$ cat /proc/pressure/memory
some avg10=15.50 avg60=8.23 avg300=3.45 total=125000
full avg10=5.00 avg60=2.10 avg300=0.80 total=50000

Interpretation:
  some avg10=15.50 → 15.5% of last 10 seconds, some process waited for memory
  full avg10=5.00  → 5% of last 10 seconds, ALL processes stalled (thrashing)

Action:
  some > 10% → Memory pressure building, consider adding RAM
  full > 1%  → Critical, processes constantly stalled

Why this is better than traditional metrics:

Traditional: free -h shows 100MB free → "Is that bad?"
PSI: avg10=0.00 → "No processes waiting, system fine despite low free RAM"

Traditional: free -h shows 10GB free → "Looks OK"
PSI: avg10=20.00 → "Processes waiting 20% of time, something wrong!"

Unified hierarchy (v2) vs v1:

cgroups v1 problem: Each controller (CPU, memory, I/O) had separate hierarchies. You could have:

/sys/fs/cgroup/cpu/container1/
/sys/fs/cgroup/memory/container2/

Process in container1 for CPU, container2 for memory → confusing!

cgroups v2 solution: Single hierarchy:

/sys/fs/cgroup/container1/
  ├── cpu.max
  ├── memory.max
  ├── io.max
  └── pids.max

All controllers unified, process can't be in multiple groups

Pitfall 1: Memory limit too aggressive

Setting memory.max=1G for a 2GB app = guaranteed OOM kill
Fix: Set memory.high for soft limit (kernel reclaims); use memory.max as absolute last resort

Pitfall 2: Ignoring memory swap limit

If cgroup has swap allowed, process can use disk (slow); no distinction between RSS and swap
Fix: In production, set memory.memsw.max = memory.max (forbid swap)

Key metrics/tools:

mount -t cgroup2               # Verify cgroup v2 mounted
cat /proc/PID/cgroup           # Show PIDs's cgroup
echo "+memory +cpu +io" > /sys/fs/cgroup2/cgroup.subtree_control  # Enable controllers
cat /sys/fs/cgroup2/memory.stat  # Memory pressure, oom kills
docker inspect CONTAINER | grep -i memory  # Container limits

8. Linux Security Modules (LSM)

What it is:

SELinux: Type enforcement (TE); security contexts on files, processes, ports
AppArmor: Path-based access control (simpler than SELinux)
Capabilities: Fine-grained privileges; e.g., CAP_NET_ADMIN for network config
seccomp: Syscall filtering; restrict which syscalls a process can invoke

Pitfall 1: SELinux in enforcing mode causes mysterious failures

Default policies may block legitimate app behavior
Error: “Permission denied” but no clear root cause
Fix: Check audit2why for denials; use audit2allow to generate rules; test in permissive first

Pitfall 2: Overpermissive capabilities

Containers run with unnecessary capabilities (e.g., CAP_SYS_ADMIN = almost root)
Fix: Use docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE for strict baseline

Key metrics/tools:

getenforce                              # SELinux status (Disabled/Permissive/Enforcing)
setenforce 0                            # Switch to permissive (temp)
tail -f /var/log/audit/audit.log | grep denied  # Live denial logging
apparmor_status                         # AppArmor profiles
getcap /path/to/binary                  # Show binary capabilities
setcap cap_net_bind_service=+ep /bin/myapp  # Grant capability
seccomp-tools show /proc/PID/status     # Show seccomp filter for process

9. eBPF (In-Kernel Observability & Enforcement)

What it is:

eBPF: Sandboxed VM in kernel; run custom programs without module recompilation
BCC (eBPF Compiler Collection): Python wrapper; easy-to-use eBPF tools
bpftrace: High-level eBPF language; one-liners for tracing
XDP (eXpress Data Path): Attach eBPF to NIC driver; ultra-low-latency packet processing
Tracing: Hook syscalls, kernel functions, events; track execution flow

Pitfall 1: eBPF programs have kernel resource limits

Large eBPF program → verifier rejection (“stack size too large”)
Maps too large → OOM or allocation failure
Fix: Use BCC/bpftrace for small, focused tracing; production tools (Cilium) handle complexity

Pitfall 2: Unbounded eBPF loops crash kernel

Verifier doesn’t allow loops; workaround: #pragma unroll (only fixed count)
Fix: For finite state machines, use loops with bounded iteration count

Key metrics/tools:

bpftool prog show              # List loaded eBPF programs
bpftool map dump name MAPNAME  # Dump map contents
bpftrace -l                    # List available tracepoints
bpftrace -e 'tracepoint:syscalls:sys_enter_open* { printf("%s\n", str(args->filename)); }' -c "ls /"  # Trace file opens
sudo /usr/share/bcc/tools/opensnoop  # Trace open() syscalls
sudo /usr/share/bcc/tools/biotop     # Top processes by block I/O

Quick Troubleshooting Matrix

Symptom	Suspect Subsystem	First Check	Quick Fix
High load but low CPU%	Scheduler OR I/O wait	`iostat -x 1`, `vmstat` (`wa` column)	Check disk/network; run `perf record`
OOM kills despite free RAM	Memory / cgroups	`cat /proc/pressure/memory`, `cgroup v2` limits	Increase `memory.max` or add swap
Can’t create files	VFS / inodes	`df -i` shows 100%	Cleanup files or recreate fs with more inodes
Slow disk random I/O	Block I/O scheduler	`cat /sys/block/sda/queue/scheduler`	Set to `noop` or `mq-deadline` on SSDs
Connection refused (random ports)	Networking / conntrack	`cat /proc/net/stat/nf_conntrack`	Increase `nf_conntrack_max`
Container can’t reach network	Namespaces	`ip netns list`, `ip route show`	Check veth pair, routes in netns
Permission denied (no obvious reason)	LSM (SELinux/AppArmor)	`getenforce`, `setenforce 0`	Temporarily disable, check `/var/log/audit/audit.log`
Can’t trace with eBPF	eBPF / kernel version	`uname -r` (need ≥4.9), `bpftool prog show`	Upgrade kernel or use `ftrace` as fallback

One-Page Cheat: Critical Limits

# Memory
sysctl vm.swappiness=10                 # Reduce swapping
sysctl vm.watermark_scale_factor=10     # Kswapd reclaim threshold
sysctl vm.max_map_count=262144          # Max memory maps per process

# Networking
sysctl net.core.somaxconn=65535         # Backlog for listening sockets
sysctl net.ipv4.tcp_max_syn_backlog=65535  # SYN backlog (DDoS mitigation)
sysctl net.netfilter.nf_conntrack_max=2000000  # Conntrack table

# File descriptors
ulimit -n 65535                         # Per-process FD limit

# Kernel debug
echo 1 > /proc/sys/kernel/perf_event_paranoid  # Allow unprivileged perf

# cgroups
echo 262144 > /sys/fs/cgroup2/pids.max  # Max PIDs per cgroup

Overview#

1. Process Scheduler (CPU)#

2. Memory Management (RAM, Swap, Cache)#

3. Virtual Filesystem (VFS) & Filesystems#

4. Block I/O (Disk Scheduling)#

5. Networking Stack (Network I/O)#

6. Namespaces (Isolation for Containers)#

7. cgroups v2 (Resource Limits & Accounting)#

8. Linux Security Modules (LSM)#

9. eBPF (In-Kernel Observability & Enforcement)#

Quick Troubleshooting Matrix#

One-Page Cheat: Critical Limits#

Further Reading#

Overview

1. Process Scheduler (CPU)

2. Memory Management (RAM, Swap, Cache)

3. Virtual Filesystem (VFS) & Filesystems

4. Block I/O (Disk Scheduling)

5. Networking Stack (Network I/O)

6. Namespaces (Isolation for Containers)

7. cgroups v2 (Resource Limits & Accounting)

8. Linux Security Modules (LSM)

9. eBPF (In-Kernel Observability & Enforcement)

Quick Troubleshooting Matrix

One-Page Cheat: Critical Limits

Further Reading