Overview
This is a one-page cheat sheet for Linux kernel subsystems. Each subsystem controls a critical resource; understanding them is essential for troubleshooting, optimization, and security.
Why understanding subsystems matters:
Imagine your server is slow. Without subsystem knowledge, you’re guessing:
- “Maybe add more RAM?” (might be CPU scheduler issue)
- “Maybe faster disk?” (might be memory cache problem)
- “Maybe more CPU?” (might be I/O scheduler misconfiguration)
With subsystem knowledge, you diagnose systematically:
Symptom: Application slow
β
Check: top shows 80% CPU "wait" (wa)
β
Diagnosis: NOT a CPU problem β I/O wait means disk subsystem
β
Check: iostat shows %util=100% on /dev/sda
β
Diagnosis: Block I/O subsystem bottleneck
β
Fix: Check I/O scheduler, investigate slow queries, add SSD
What this guide covers:
Each of the 9 subsystems below answers one question:
- Scheduler: How does Linux decide which process runs next?
- Memory: How does Linux manage RAM, cache, and swap?
- VFS: How does Linux present files to applications?
- Block I/O: How does Linux talk to disks?
- Networking: How does Linux send/receive packets?
- Namespaces: How does Linux isolate containers?
- cgroups: How does Linux limit resource usage?
- LSM: How does Linux enforce security policies?
- eBPF: How does Linux enable custom kernel-level observability?
How to use this guide:
- Learning: Read each subsystem to understand what it does
- Troubleshooting: Jump to subsystem matching your symptom (use the table at bottom)
- Reference: Copy-paste commands when investigating issues
1. Process Scheduler (CPU)
What it is:
- Fairly distributes CPU time among processes using CFS (Completely Fair Scheduler)
- Tracks virtual runtime (
vruntime
); process with lowest vruntime runs next - Supports real-time (RT) scheduling for critical tasks (separate scheduler)
- Balances load across CPU cores while respecting NUMA locality
Detailed explanation:
CFS (Completely Fair Scheduler) - The “fairness” guarantee:
Think of CFS like a teacher distributing speaking time among students. The student who has spoken the least gets to speak next.
How it works:
- Each process has a
vruntime
(virtual runtime) counter that tracks total CPU time used - When CPU becomes available, CFS picks the process with the LOWEST vruntime
- Process runs for a time slice (~4-6ms by default), then vruntime increases
- Process goes back to queue, next lowest vruntime runs
Example:
Process A: vruntime = 100ms (ran a lot)
Process B: vruntime = 50ms (ran less)
Process C: vruntime = 75ms (middle)
β CFS picks Process B (lowest = 50ms)
β B runs for 5ms β vruntime becomes 55ms
β CFS picks B again (still lowest)
β B runs for 5ms β vruntime becomes 60ms
β CFS picks B again... until B's vruntime catches up to others
Why this matters: CPU-intensive processes don’t starve I/O-bound processes. A background video encoder (high vruntime) won’t block your SSH session (low vruntime).
Real-time scheduling - When “fairness” isn’t enough:
What it is: Some tasks can’t waitβaudio processing must happen within 10ms or you hear glitches. RT scheduler bypasses CFS entirely.
RT scheduling classes:
- SCHED_FIFO (First In First Out): Process runs until it yields or higher-priority RT process arrives. No time slicing.
- SCHED_RR (Round Robin): Like FIFO but with time slicing among same-priority RT processes.
- SCHED_DEADLINE: Advancedβspecify CPU time needed + deadline. Kernel schedules to meet deadline.
Real-world example:
# Audio processing daemon needs guaranteed CPU
sudo chrt -f 80 /usr/bin/audio-daemon
What happens:
- audio-daemon gets priority 80 (higher = more important)
- When audio-daemon wants CPU, it preempts ALL normal processes
- CFS processes wait until audio-daemon sleeps/finishes
Danger: RT process in infinite loop = system hang. All other processes starve.
Load balancing across cores - Why your 8-core CPU matters:
What it is: If you have 8 CPU cores, scheduler tries to use all 8 evenly. Otherwise core 0 might be 100% busy while core 7 idles.
How it works:
- Scheduler periodically (every ~4ms) checks if cores are imbalanced
- If core 0 has 10 processes and core 1 has 2, scheduler migrates some from 0β1
- Respects CPU affinity (
taskset
binds process to specific cores)
NUMA locality - Why memory proximity matters:
- Modern servers have multiple NUMA nodes (CPU+RAM pairs)
- Accessing local RAM (same NUMA node) is 2-3x faster than remote RAM
- Scheduler tries to keep process on same NUMA node as its memory
Example:
NUMA node 0: CPU 0-7, RAM 0-64GB
NUMA node 1: CPU 8-15, RAM 64-128GB
Process on CPU 0 accessing RAM at 10GB: Fast (local)
Process on CPU 8 accessing RAM at 10GB: Slow (crosses NUMA boundary)
β Scheduler prefers keeping process on node 0
Pitfall 1: Ignoring load average
- High load (> num_cores) indicates CPU contention
uptime
shows 1/5/15-min averages; trending matters more than absolute value- Fix: Use
systemd-analyze
+perf
to identify CPU-bound processes
Pitfall 2: Misusing real-time priority
- RT tasks bypass CFS, can starve other processes
- Setting
chrt -f 99 command
can hang the system if not careful - Fix: Reserve RT for genuinely critical, bounded-time work; use
SCHED_DEADLINE
for advanced users
Key metrics/tools:
uptime # Load average
top -p PID -s 1 # Per-process CPU% (user/system time)
ps -eo pid,comm,class # Scheduling class (TS=SCHED_OTHER, FF=FIFO, RR=round-robin)
perf stat command # IPC, cache misses, context switches
systemd-analyze plot # Boot parallelization
2. Memory Management (RAM, Swap, Cache)
What it is:
- Provides virtual address spaces to each process (MMU translates to physical RAM)
- Page cache: OS caches file data in RAM for speed; pages evicted when RAM needed
- Swap: Moves inactive pages to disk (slow fallback when RAM full)
- THP (Transparent Hugepages): Automatically uses 2MB/1GB pages instead of 4KB to reduce TLB misses
Detailed explanation:
Virtual memory - Why every process thinks it owns all RAM:
What it is: Each process sees its own private address space (like 0x00000000 to 0xFFFFFFFF on 32-bit). The MMU (Memory Management Unit, hardware) translates these “virtual” addresses to actual physical RAM locations.
Why it matters:
Process A reads from address 0x1000 β MMU translates to physical RAM 0x500000
Process B reads from address 0x1000 β MMU translates to physical RAM 0x700000
Same virtual address, different physical RAM = processes isolated from each other
Real-world benefit: Process crashes can’t corrupt other processes’ memory. Container at virtual 0x1000 can’t read host memory.
Page cache - Why your second ls
is instant:
What it is: When you read a file, Linux copies it into RAM. Next time you read, Linux serves from RAM (instant) instead of disk (milliseconds).
Example:
# First read: 500ms (from disk)
time cat /var/log/syslog > /dev/null
# real 0m0.500s
# Second read: 5ms (from page cache)
time cat /var/log/syslog > /dev/null
# real 0m0.005s β 100x faster!
How it works:
- File reads β Linux copies disk blocks into RAM pages
- RAM pages marked as “page cache”
- When RAM needed for applications, kernel evicts least-recently-used cache pages
- Modified pages (writes) must be flushed to disk first (dirty pages)
This is why free -h
shows “used” RAM as high: Linux uses “free” RAM for caching. It’s not wastedβit’s optimized.
$ free -h
total used free shared buff/cache
Mem: 31Gi 5.0Gi 1.0Gi 100Mi 25Gi
25GB in "buff/cache" = page cache
If application needs RAM, kernel evicts cache automatically
This is GOOD, not bad
Swap - The emergency pressure valve:
What it is: When RAM is full, Linux moves inactive memory pages to disk (swap partition or swap file). This frees RAM for active processes.
Why it’s slow:
RAM access: ~100 nanoseconds
Disk access: ~10 milliseconds
β Swap is 100,000x slower than RAM
When swap is good:
- Inactive background process (like old SSH session) swapped out β active database gets more RAM
- Temporary RAM spike β swap absorbs it, prevents OOM kill
When swap is bad:
- Active process swapping in/out repeatedly (thrashing)
- Example: Database with 8GB working set but only 4GB RAM β constant swap I/O β queries take 100x longer
How to detect thrashing:
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 2 500000 10000 20000 30000 5000 5000 100 200 500 1000 10 20 50 20 0
si=5000, so=5000 (swap in/out) = THRASHING
β Application using RAM faster than available
β Fix: Add RAM or reduce working set
Transparent Hugepages (THP) - Bigger pages, fewer lookups:
What it is: Normally Linux uses 4KB memory pages. THP uses 2MB pages automatically (512x larger).
Why bigger is sometimes better:
- CPU has Translation Lookaside Buffer (TLB) that caches virtualβphysical address mappings
- TLB has limited entries (like ~1024)
- With 4KB pages: 1024 entries Γ 4KB = 4MB coverage
- With 2MB pages: 1024 entries Γ 2MB = 2GB coverage
- Larger coverage = fewer TLB misses = faster memory access
Example benefit:
Database with 10GB dataset, sequential access:
- 4KB pages: Frequent TLB misses (10GB > 4MB TLB coverage)
- 2MB pages: Fewer TLB misses (10GB fits better in 2GB coverage)
β 10-20% performance improvement in some workloads
When THP hurts:
- Fragmentation: Kernel tries to find contiguous 2MB physical RAM
- If RAM fragmented (lots of small allocations), kernel spends time defragmenting
- Defragmentation can cause 10-100ms latency spikes
- Databases (Redis, MongoDB) with strict latency SLAs disable THP
Pitfall 1: Excessive swapping kills performance
- Swapping page from disk β 1000x slower than RAM access
- High
si
/so
(swap in/out) invmstat
indicates thrashing - Fix: Disable swap for latency-sensitive apps (
swapoff -a
); monitorPSI
(Pressure Stall Info)
Pitfall 2: THP backfires on databases
- Defragmentation latency spikes when THP pages get fragmented
- Some DBs (MongoDB, Redis) prefer 4KB pages for predictability
- Fix: Disable:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Key metrics/tools:
free -h # Overall RAM/swap/cache
vmstat 1 5 # si/so (swap I/O), wa (I/O wait)
ps aux | sort -k6 -rn | head -10 # Top memory consumers (RSS)
cat /proc/pressure/memory # PSI: CPU, I/O, memory stall percentages
sar -B 1 5 # Page faults, THP usage
echo madvise > /sys/kernel/mm/transparent_hugepage/sysfs_enabled # Enable for app opt-in
3. Virtual Filesystem (VFS) & Filesystems
What it is:
- VFS: Unified interface above ext4, XFS, Btrfs, NFS (abstraction layer)
- Inode: File metadata (permissions, size, block pointers, timestamps)
- Dentry: Filename β inode mapping (cached in dcache for speed)
- Mount: Attach filesystem to directory tree; multiple filesystems can coexist
Detailed explanation:
VFS - Why cat
works on ext4, XFS, NFS, and even /proc:
What it is: VFS is an abstraction layer. Applications call open("/path/to/file")
, and VFS translates that into the appropriate filesystem-specific operation.
Why it matters:
Application: open("/etc/passwd")
β
VFS: "Which filesystem owns /etc?"
β
VFS: "ext4 filesystem on /dev/sda1"
β
VFS calls: ext4_open()
β
ext4 driver reads inode, returns file descriptor
Same for all filesystems:
open("/mnt/nfs/file") β VFS β nfs_open() β Network request to NFS server
open("/proc/cpuinfo") β VFS β proc_open() β Kernel generates CPU info on-the-fly
open("/dev/sda") β VFS β block_device_open() β Direct disk access
Application doesn't careβVFS handles it
Inode - The “real” file (not the filename):
What it is: An inode is a data structure that stores everything about a file EXCEPT its name:
- File type (regular, directory, symlink)
- Permissions (rwxr-xr-x)
- Owner (UID/GID)
- Size (in bytes)
- Timestamps (created, modified, accessed)
- Pointers to data blocks on disk
Filename is separate: Directory entries (dentries) map names β inode numbers.
Example:
$ ls -li /etc/passwd
12345678 -rw-r--r-- 1 root root 2048 Oct 16 12:00 /etc/passwd
β
Inode number
$ stat /etc/passwd
File: /etc/passwd
Size: 2048 Blocks: 8
Inode: 12345678 Links: 1
Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root)
Hard links - Multiple names, one inode:
$ ln /etc/passwd /tmp/passwd-hardlink
$ ls -li /etc/passwd /tmp/passwd-hardlink
12345678 -rw-r--r-- 2 root root 2048 Oct 16 12:00 /etc/passwd
12345678 -rw-r--r-- 2 root root 2048 Oct 16 12:00 /tmp/passwd-hardlink
β Same inode = same file, two names
Why running out of inodes breaks things:
$ touch /tmp/newfile
touch: cannot touch '/tmp/newfile': No space left on device
$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 50G 50G 50% /tmp
$ df -i /tmp
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 6500000 6500000 0 100% /tmp
β All inodes used!
Even though 50GB free, can't create files (no inodes available)
Dentry cache (dcache) - Why ls
the second time is instant:
What it is: Linux caches the mapping of filename β inode in RAM.
Example:
# First lookup: Must read directory blocks from disk
$ ls /var/log/syslog
(kernel reads /var/log directory inode, finds syslog entry, caches it)
# Second lookup: Served from dcache (RAM)
$ ls /var/log/syslog
(instant - no disk I/O)
How it helps: Applications frequently access same files (/etc/hosts
, /lib/x86_64-linux-gnu/libc.so.6
). Dcache avoids disk reads.
Mount - Attaching filesystems to the directory tree:
What it is: Linux has one unified directory tree starting at /
. You “mount” filesystems at specific paths.
Example:
# Root filesystem (ext4 on /dev/sda1) mounted at /
# Home directories (XFS on /dev/sdb1) mounted at /home
# NFS share mounted at /mnt/shared
$ mount
/dev/sda1 on / type ext4 (rw,relatime)
/dev/sdb1 on /home type xfs (rw,noatime)
nfs-server:/export on /mnt/shared type nfs (rw,soft,timeo=30)
When you access /home/user/file:
β VFS sees "/home" is a mount point
β Redirects to /dev/sdb1 (XFS filesystem)
β XFS handles the request
Why this matters: You can have different filesystems for different directories (fast SSD for /var/lib/postgresql
, slow HDD for /var/log
).
Pitfall 1: Running out of inodes
- Each file/dir/link = 1 inode; if
df -i
shows 100%, can’t create files even with free space - Millions of small files (temp logs, session stores) exhaust inodes quickly
- Fix:
tune2fs -l /dev/sda1
shows inode count; recreate fs with more inodes:mkfs.ext4 -N 1000000 /dev/sda1
Pitfall 2: Suboptimal mount options
- Default
relatime
still updates atime on read (small I/O overhead) - No
nobarrier
on SSD without UPS = slow writes (fsync waits for disk flush) - Fix: Use
mount -o noatime,nodiratime,nobarrier
for non-critical data; keep barriers for databases
Key metrics/tools:
df -h # Disk space by filesystem
df -i # Inode usage (critical!)
mount | grep -E 'ext4|xfs' # Show mount options
lsof | head -20 # Files open by process
sync; echo 3 > /proc/sys/vm/drop_caches # Clear page cache (test)
fstrim -v /mount # Discard unused blocks (SSDs)
4. Block I/O (Disk Scheduling)
What it is:
- I/O scheduler: Orders disk requests to minimize seek time (deadline, CFQ, mq-deadline, noop)
- io_uring: Modern async I/O interface (replaces aio); supports polling, fixed buffers, kernel bypass
- Request queue: Batches I/O requests before sending to device
- Throughput vs latency: High throughput needs batching; low latency needs quick service
Pitfall 1: Wrong scheduler for your device
- Spinning disk: use
deadline
(prioritizes reads) - SSD/NVMe: use
none
ornoop
(let device schedule) - Using CFQ on fast SSDs adds unnecessary latency (sorting overhead)
- Fix: Check
cat /sys/block/sda/queue/scheduler
; for production:none
is usually safe
Pitfall 2: io_uring without proper resource limits
- io_uring buffers can pin kernel memory; many async ops β OOM
- Fix: Limit memory pinning:
ulimit -l
(checkcat /proc/sys/kernel/memlock
)
Key metrics/tools:
iostat -x 1 5 # %util, await (avg service time), svctm
iotop # Top processes by disk I/O
blktrace -d /dev/sda -o - | blkparse # Detailed I/O tracing
perf record -e block:block_rq_* -- command # I/O event tracing
fio --name=random-read --ioengine=libaio # Disk benchmark
5. Networking Stack (Network I/O)
What it is:
- nftables: Modern packet filtering framework (replaces iptables); rules in kernel eBPF
- conntrack: Tracks TCP/UDP connection state; enables stateful firewall
- qdisc (queuing discipline): Schedules outbound packets (pfifo, fq, cake, htb for traffic shaping)
- tc (traffic control): Linux traffic shaping tool; applies qdiscs, classes, filters
Pitfall 1: Conntrack table exhaustion
- Malicious/buggy clients create many short-lived connections; conntrack table fills
- Result: “Cannot assign requested address” on outbound connections
- Fix: Monitor
cat /proc/net/stat/nf_conntrack | tail -1
; increase:sysctl -w net.netfilter.nf_conntrack_max=2000000
Pitfall 2: No egress rate limiting β noisy neighbor
- One container/VM burns all bandwidth; others starve
- Fix: Apply tc qdisc:
tc qdisc add dev eth0 root tbf rate 100mbit burst 32kb latency 400ms
Key metrics/tools:
ss -tulnp # TCP/UDP sockets, listening ports
cat /proc/net/netstat # IP stats (dropped, errors)
nftrace list ruleset # View nftables rules
ip netns list; ip netns exec NS ss -an # Namespace inspection
ethtool -S eth0 # NIC driver stats (RX/TX drops, errors)
tc -s qdisc show dev eth0 # Queue discipline stats
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_connect { ... }' # Trace connects
6. Namespaces (Isolation for Containers)
What it is:
- PID namespace: Process sees only its own PID tree; PID 1 in container β PID 1 on host
- Network namespace: Isolated network stack (veth, lo, routing table)
- Mount namespace: Private root filesystem view; mount /dev/sda1 in container doesn’t affect host
- UTS namespace: Isolated hostname, domainname
- IPC namespace: Private semaphores, message queues, shared memory
- User namespace: UID/GID remapping (container root = unprivileged host user)
Detailed explanation:
Namespaces - The foundation of container isolation:
What it is: Namespaces make each container think it’s running on its own dedicated machine. Processes inside a container can’t see (or affect) processes in other containers or on the host.
How Docker/Kubernetes use namespaces:
Docker run creates:
1. New PID namespace β container sees PID 1 as its own init
2. New network namespace β container gets its own lo, eth0
3. New mount namespace β container sees its own /etc, /usr, /var
4. New UTS namespace β container has its own hostname
5. New IPC namespace β container's shared memory isolated
6. New user namespace β container root β host root (optional)
Result: Container thinks it's a separate machine
Visual diagram of namespace isolation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HOST SYSTEM (Real Linux Kernel) β
β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Container A β β Container B β β
β β (Namespace Set #1) β β (Namespace Set #2) β β
β β β β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β PID Namespace β β β β PID Namespace β β β
β β β β β β β β β β
β β β PID 1: nginx β β β β PID 1: postgres β β β
β β β PID 2: worker β β β β PID 2: worker β β β
β β β β β β β β β β
β β β (Isolated PIDs) β β β β (Isolated PIDs) β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β β β β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β Net Namespace β β β β Net Namespace β β β
β β β β β β β β β β
β β β eth0: 172.17.0.2β β β β eth0: 172.17.0.3β β β
β β β lo: 127.0.0.1 β β β β lo: 127.0.0.1 β β β
β β β β β β β β β β
β β β (Own IP stack) β β β β (Own IP stack) β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β β β β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β β β Mount Namespace β β β β Mount Namespace β β β
β β β β β β β β β β
β β β /: overlay2 fs β β β β /: overlay2 fs β β β
β β β /etc: container β β β β /etc: container β β β
β β β /var: container β β β β /var: container β β β
β β β β β β β β β β
β β β (Own root FS) β β β β (Own root FS) β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ACTUAL KERNEL RESOURCES β β
β β β β
β β PIDs: 1234 (nginx), 1235 (worker), 1236 (postgres)... β β
β β Network: Real NICs (eth0), bridges (docker0), veth pairs β β
β β Mounts: /var/lib/docker/overlay2/abc123, /var/lib/docker...β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key insight:
- Container A's PID 1 β Real kernel PID 1234
- Container B's PID 1 β Real kernel PID 1236
- Containers can't see each other's PIDs, networks, or filesystems
1. PID Namespace - Process isolation:
What it is: Each PID namespace has its own process tree starting at PID 1. Processes in the namespace only see other processes in the same namespace.
Important distinction: PID Namespace vs PID (Process ID):
Many people confuse these two concepts. Let’s clarify:
PID (Process ID):
- A number assigned to a running process
- Example: nginx process has PID 1234
- Every running process has a PID
- PIDs are unique within a namespace
PID Namespace:
- An isolation mechanism (like a container)
- Groups processes together
- Each namespace has its own PID numbering starting from 1
- Same process can have different PID numbers in different namespaces
Visual comparison:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Concept: PID (Process ID) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β What it is: A NUMBER assigned to a running process β
β β
β Example: β
β $ ps aux β
β PID USER COMMAND β
β 1234 root nginx: master process β
β 1235 www nginx: worker process β
β 1236 postgres postgres -D /var/lib/postgresql β
β β β
β These are PIDs (just numbers) β
β β
β Analogy: Like a house number (123 Main Street) β
β The number identifies the house β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Concept: PID Namespace β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β What it is: An ISOLATION CONTAINER for processes β
β Each namespace has its own PID numbering β
β β
β Example: β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β PID Namespace A β β PID Namespace B β β
β β (Container 1) β β (Container 2) β β
β β β β β β
β β PID 1: nginx β β PID 1: postgres β β
β β PID 2: worker β β PID 2: worker β β
β β PID 3: bash β β PID 3: bash β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β β β
β These are PID Namespaces (containers) β
β Each has its own PID numbering (both start at 1) β
β β
β Analogy: Like different cities (New York vs Tokyo) β
β Each city has its own "123 Main Street" β
β The street name is the same, but they're in different citiesβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Real-world example showing the difference:
# HOST SYSTEM (Default PID Namespace)
$ ps aux
PID USER COMMAND
1 root /sbin/init β Host init (real PID 1)
1234 root nginx: master β nginx in container A
1235 www nginx: worker
1236 postgres postgres: main β postgres in container B
1237 postgres postgres: worker
# CONTAINER A (New PID Namespace)
$ docker exec -it container-a ps aux
PID USER COMMAND
1 root nginx: master β Same process as host PID 1234
2 www nginx: worker β Same process as host PID 1235
# CONTAINER B (Another PID Namespace)
$ docker exec -it container-b ps aux
PID USER COMMAND
1 postgres postgres: main β Same process as host PID 1236
2 postgres postgres: worker β Same process as host PID 1237
Key insight - Same process, multiple PIDs:
ββββββββββββββββββββββββ
β ACTUAL PROCESS β
β (nginx master) β
β Running in kernel β
ββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
β β
ββββββββββββββββββββββ ββββββββββββββββββββββ
β How HOST sees it: β β How CONTAINER sees:β
β PID 1234 β β PID 1 β
ββββββββββββββββββββββ ββββββββββββββββββββββ
Same process = Different PID numbers depending on namespace viewing it
Why this matters:
- Isolation: Container A’s PID 1 can’t kill Container B’s PID 1 (different namespaces)
- Security: Container sees PID 1-3, can’t see host PIDs 1234-1237
- Debugging: When you see “PID 1” in container logs, it’s NOT the host’s PID 1
Common confusion resolved:
β Wrong thinking: “My container has PID 1, that means it’s the system’s init process” β Correct thinking: “My container has PID 1 in its namespace. On the host, it’s probably PID 12345”
β Wrong: “I killed PID 1234 on host, why did the container die?” β Correct: “PID 1234 on host is PID 1 in container namespace. Killing container’s PID 1 kills entire container”
Summary table:
Aspect | PID (Process ID) | PID Namespace |
---|---|---|
What it is | A number (identifier) | An isolation container |
Example | 1234, 1235, 1236 | Container A, Container B, Host |
Uniqueness | Unique within a namespace | Each namespace is separate |
Purpose | Identify a specific process | Isolate groups of processes |
Analogy | House number (123) | City/neighborhood (New York, Tokyo) |
Created by | Kernel when process starts | unshare(CLONE_NEWPID) or Docker |
Lifetime | Until process exits | Until last process in namespace exits |
Real-world example:
# On host
$ ps aux | grep nginx
root 1234 0.0 0.1 nginx: master process
www-data 1235 0.0 0.1 nginx: worker process
# Inside container
$ ps aux
PID USER COMMAND
1 root nginx: master process β This is PID 1234 on host
2 www-data nginx: worker process β This is PID 1235 on host
Container sees: PID 1, 2
Host sees: PID 1234, 1235
Same process, different PID numbers in different namespaces
Why PID 1 matters:
- PID 1 in Unix is specialβit’s the init process
- When PID 1 exits, kernel kills all processes in that namespace
- Docker container exits when PID 1 (entrypoint) exits
Diagram of PID namespace mapping:
Container PID Namespace Host PID Namespace
βββββββββββββββββββββββ ββββββββββββββββββββββββ
β β β β
β PID 1 (nginx) βββΌβββββββββΆβ PID 1234 (nginx) β
β PID 2 (worker) βββΌβββββββββΆβ PID 1235 (worker) β
β PID 3 (bash) βββΌβββββββββΆβ PID 1236 (bash) β
β β β β
β Can only see β β Can see ALL PIDs: β
β PIDs 1, 2, 3 β β 1, 1234, 1235, 1236 β
βββββββββββββββββββββββ ββββββββββββββββββββββββ
Security benefit: Container process can’t send signals (kill) to host processes because it can’t see their PIDs.
2. Network Namespace - Network stack isolation:
What it is: Each network namespace has its own network interfaces, IP addresses, routing tables, firewall rules, and sockets.
How containers get networking:
Step 1: Create network namespace
β Container gets isolated network stack (empty)
β No interfaces except loopback (lo)
Step 2: Create veth pair (virtual ethernet cable)
β One end in container namespace (eth0)
β Other end in host namespace (vethXXX)
Step 3: Connect host end to bridge (docker0)
β Now container can reach host and other containers
Step 4: Configure NAT on host
β Container can reach internet (host translates addresses)
Visual diagram of container networking:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HOST β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Container A β β Container B β β
β β Network Namespace β β Network Namespace β β
β β β β β β
β β eth0: 172.17.0.2 β β eth0: 172.17.0.3 β β
β β gateway: 172.17.0.1β β gateway: 172.17.0.1β β
β ββββββββ¬βββββββββββββββ ββββββββ¬βββββββββββββββ β
β β β β
β β veth pair β veth pair β
β β (virtual cable) β (virtual cable) β
β β β β
β ββββββ΄ββββββββββββββββββββββββββββββββββ΄βββββ β
β β docker0 bridge (172.17.0.1) β β
β β (Virtual switch connecting containers) β β
β ββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β β
β β NAT (iptables) β
β β β
β ββββββ΄βββββββ β
β β eth0 β (Host physical NIC) β
β β Public IPβ β
β ββββββ¬βββββββ β
β β β
βββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
Internet
Data flow example:
Container A (172.17.0.2) β docker0 bridge β NAT (source IP changed to host IP) β eth0 β Internet
Real-world example:
# On host
$ ip addr show docker0
docker0: <BROADCAST,MULTICAST,UP>
inet 172.17.0.1/16 scope global docker0
$ ip addr show veth1a2b3c
veth1a2b3c: <BROADCAST,MULTICAST,UP>
(connected to container A's eth0)
# In container
$ ip addr show eth0
eth0: <BROADCAST,MULTICAST,UP>
inet 172.17.0.2/16 scope global eth0
(connected to host's veth1a2b3c)
$ ip route
default via 172.17.0.1 dev eth0
(Container routes all traffic through docker0)
Security benefit: Container can’t sniff packets from other containers. Each has isolated network stack.
3. Mount Namespace - Filesystem isolation:
What it is: Each mount namespace sees its own filesystem tree. Mounting a filesystem in one namespace doesn’t affect others.
Real-world example:
# On host
$ mount | grep /var/lib/docker
overlay on /var/lib/docker/overlay2/abc123/merged type overlay
# In container
$ mount
overlay on / type overlay (rw,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k)
proc on /proc type proc (rw,nosuid,nodev,noexec)
Container sees "/" as root
Host sees it as /var/lib/docker/overlay2/abc123/merged
Diagram of mount namespace:
Container Mount Namespace Host Mount Namespace
ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β β β β
β / β β / β
β βββ /etc β β βββ /etc (real host /etc) β
β βββ /usr β β βββ /usr β
β βββ /var β β βββ /var β
β βββ /app β β βββ /var/lib/docker/overlay2/ β
β (container FS) β β βββ abc123/merged/ β Container A root
β β β βββ def456/merged/ β Container B root
ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
When container reads /etc/passwd:
β Container sees /etc/passwd (in its namespace)
β Kernel translates to /var/lib/docker/overlay2/abc123/merged/etc/passwd
β Container can't access host's real /etc/passwd
Security benefit: Container can’t modify host filesystem. Even if it writes to /etc
, it’s writing to its own overlay, not host’s /etc
.
4. UTS Namespace - Hostname isolation:
What it is: Each UTS namespace can have its own hostname and domain name.
Real-world example:
# On host
$ hostname
production-server-01
# In container A
$ hostname
web-container
# In container B
$ hostname
db-container
Each container has its own hostname
Host hostname unchanged
Why it matters: Applications that log hostnames (like distributed systems) can identify which container logged what, even when multiple containers run same image.
5. IPC Namespace - Shared memory isolation:
What it is: Each IPC namespace has isolated System V IPC objects (shared memory segments, semaphores, message queues).
Real-world example:
# On host
$ ipcs -m
Shared Memory Segments
key shmid owner bytes
0x00000000 32768 postgres 16777216
# In container
$ ipcs -m
Shared Memory Segments
key shmid owner bytes
0x00000000 65536 postgres 8388608
Different shared memory segments
Container can't access host's shared memory
Why it matters: Prevents containers from using shared memory to communicate (potential side channel for attacks).
6. User Namespace - UID/GID remapping (most complex):
What it is: Maps user IDs inside container to different IDs on host. Container root (UID 0) can be mapped to unprivileged user on host (UID 100000).
Real-world example:
# In container (user namespace enabled)
$ id
uid=0(root) gid=0(root) groups=0(root)
$ whoami
root
$ touch /tmp/file
$ ls -la /tmp/file
-rw-r--r-- 1 root root 0 Oct 17 12:00 /tmp/file
# On host (looking at container's file)
$ ls -la /var/lib/docker/.../merged/tmp/file
-rw-r--r-- 1 100000 100000 0 Oct 17 12:00 file
β β
Container UID 0 β Host UID 100000
Diagram of UID mapping:
Container User Namespace Host User Namespace
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β β β β
β UID 0 (root) βββΌββββββββΆβ UID 100000 β
β UID 1 (daemon) βββΌββββββββΆβ UID 100001 β
β UID 1000 (user) βββΌββββββββΆβ UID 101000 β
β β β β
β Container thinks β β Host sees container β
β process is root β β as unprivileged β
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
Mapping configured in /etc/subuid and /etc/subgid
Security benefit (huge):
Without user namespaces:
Container root (UID 0) = Host root (UID 0)
If container escapes, attacker is real root
Game over
With user namespaces:
Container root (UID 0) β Host UID 100000 (unprivileged)
If container escapes, attacker has UID 100000 (can't sudo, can't read /etc/shadow)
Much safer
Why Docker doesn’t enable by default: Compatibility issues with volume mounts (file ownership gets confusing).
Pitfall 1: Leaking namespaces via socket files
- If process dies but socket in netns remains, netns persists (hidden)
- Fix:
ip netns list
shows active namespaces; delete:ip netns delete NETNS
Pitfall 2: User namespace security misconfiguration
- Mapping container root (UID 0) to host unprivileged user is complex
- Misconfiguration allows container escape
- Fix: Use
podman
’s user namespace by default (not Docker)
Key metrics/tools:
lsns -t pid # List PID namespaces
ip netns list # List network namespaces
nsenter -t PID -a /bin/bash # Enter all namespaces of PID
unshare -p -f /bin/bash # Create new PID namespace
cat /proc/PID/ns/* # Inode numbers (same inode = same namespace)
7. cgroups v2 (Resource Limits & Accounting)
What it is:
- cgroup: Control group; limits/accounts CPU, memory, I/O, pids, network for process group
- v2 unified hierarchy: Single tree (v1 had multiple controllers)
- CPU limit:
cpu.max=50000 100000
= 50% of one core - Memory limit:
memory.max=1G
enforces hard limit (OOM killer if exceeded) - PSI (Pressure Stall Info): Metrics on resource contention (CPU throttling, memory pressure, I/O wait)
Detailed explanation:
cgroups - How containers limit resources (and don’t crash the host):
What it is: A cgroup is a way to group processes and apply resource limits to the group as a whole.
Why it matters: Without cgroups, a runaway container process can use 100% CPU and starve all other processes on the host. With cgroups, you can say “this container gets max 2 CPU cores and 4GB RAM, period.”
Real-world example:
Host has 8 CPU cores, 32GB RAM
Container A (web server):
- cgroup limits: 2 CPU cores, 4GB RAM
- Actual usage: 1.5 cores, 2GB RAM β OK
Container B (batch job goes wild):
- cgroup limits: 2 CPU cores, 4GB RAM
- Tries to use: 8 cores, 10GB RAM
- cgroup enforces: Only gets 2 cores max, throttled
- Memory: Process killed (OOM) when exceeds 4GB
Host remains responsive: Containers can't steal unlimited resources
CPU limits - The two numbers explained:
Format: cpu.max=50000 100000
What it means:
- First number (50000): CPU quota in microseconds per period
- Second number (100000): Period length in microseconds
Math:
50000 microseconds / 100000 microseconds = 0.5 = 50% of one CPU core
100000 microseconds = 0.1 seconds = period resets 10 times per second
In each 0.1 second period:
- cgroup can use max 50000 microseconds (0.05 seconds) of CPU
- After using 50000Β΅s, process is throttled until next period
More examples:
cpu.max=200000 100000 = 200% = 2 full CPU cores
cpu.max=400000 100000 = 400% = 4 full CPU cores
cpu.max=10000 100000 = 10% = 0.1 CPU cores
What “throttled” means:
Process runs for 50ms β reaches quota β scheduler stops running it
Wait 50ms (rest of period) β new period starts β process can run again
Result: Bursty performance (run, pause, run, pause...)
Check: cat /sys/fs/cgroup/cpu.stat
nr_throttled: 1234 β Number of times throttled
throttled_time: 5000000 β Microseconds spent throttled
Memory limits - Hard vs soft:
memory.max (hard limit):
memory.max=1G
Process allocates memory:
500MB β OK
800MB β OK
1.1GB β OOM KILL (exceeds limit)
Kernel kills process immediately when limit exceeded
No warning, no grace period
memory.high (soft limit):
memory.high=1G
memory.max=2G
Process allocates memory:
500MB β OK
1.1GB β Kernel starts aggressive reclamation (swapping, cache eviction)
β Process slows down but not killed
2.1GB β OOM KILL (exceeded hard limit)
Think of memory.high as "warning track" before hard limit
Why use soft limits:
- Gives application chance to release memory (GC, cache flush)
- Avoids abrupt crashes
- Production pattern:
memory.high=0.8 Γ memory.max
PSI (Pressure Stall Information) - Early warning system:
What it is: PSI tells you when processes are waiting for resources.
Example:
$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
$ cat /proc/pressure/memory
some avg10=15.50 avg60=8.23 avg300=3.45 total=125000
full avg10=5.00 avg60=2.10 avg300=0.80 total=50000
Interpretation:
some avg10=15.50 β 15.5% of last 10 seconds, some process waited for memory
full avg10=5.00 β 5% of last 10 seconds, ALL processes stalled (thrashing)
Action:
some > 10% β Memory pressure building, consider adding RAM
full > 1% β Critical, processes constantly stalled
Why this is better than traditional metrics:
Traditional: free -h shows 100MB free β "Is that bad?"
PSI: avg10=0.00 β "No processes waiting, system fine despite low free RAM"
Traditional: free -h shows 10GB free β "Looks OK"
PSI: avg10=20.00 β "Processes waiting 20% of time, something wrong!"
Unified hierarchy (v2) vs v1:
cgroups v1 problem: Each controller (CPU, memory, I/O) had separate hierarchies. You could have:
/sys/fs/cgroup/cpu/container1/
/sys/fs/cgroup/memory/container2/
Process in container1 for CPU, container2 for memory β confusing!
cgroups v2 solution: Single hierarchy:
/sys/fs/cgroup/container1/
βββ cpu.max
βββ memory.max
βββ io.max
βββ pids.max
All controllers unified, process can't be in multiple groups
Pitfall 1: Memory limit too aggressive
- Setting
memory.max=1G
for a 2GB app = guaranteed OOM kill - Fix: Set
memory.high
for soft limit (kernel reclaims); usememory.max
as absolute last resort
Pitfall 2: Ignoring memory swap limit
- If cgroup has swap allowed, process can use disk (slow); no distinction between RSS and swap
- Fix: In production, set
memory.memsw.max
=memory.max
(forbid swap)
Key metrics/tools:
mount -t cgroup2 # Verify cgroup v2 mounted
cat /proc/PID/cgroup # Show PIDs's cgroup
echo "+memory +cpu +io" > /sys/fs/cgroup2/cgroup.subtree_control # Enable controllers
cat /sys/fs/cgroup2/memory.stat # Memory pressure, oom kills
docker inspect CONTAINER | grep -i memory # Container limits
8. Linux Security Modules (LSM)
What it is:
- SELinux: Type enforcement (TE); security contexts on files, processes, ports
- AppArmor: Path-based access control (simpler than SELinux)
- Capabilities: Fine-grained privileges; e.g.,
CAP_NET_ADMIN
for network config - seccomp: Syscall filtering; restrict which syscalls a process can invoke
Pitfall 1: SELinux in enforcing mode causes mysterious failures
- Default policies may block legitimate app behavior
- Error: “Permission denied” but no clear root cause
- Fix: Check
audit2why
for denials; useaudit2allow
to generate rules; test inpermissive
first
Pitfall 2: Overpermissive capabilities
- Containers run with unnecessary capabilities (e.g.,
CAP_SYS_ADMIN
= almost root) - Fix: Use
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE
for strict baseline
Key metrics/tools:
getenforce # SELinux status (Disabled/Permissive/Enforcing)
setenforce 0 # Switch to permissive (temp)
tail -f /var/log/audit/audit.log | grep denied # Live denial logging
apparmor_status # AppArmor profiles
getcap /path/to/binary # Show binary capabilities
setcap cap_net_bind_service=+ep /bin/myapp # Grant capability
seccomp-tools show /proc/PID/status # Show seccomp filter for process
9. eBPF (In-Kernel Observability & Enforcement)
What it is:
- eBPF: Sandboxed VM in kernel; run custom programs without module recompilation
- BCC (eBPF Compiler Collection): Python wrapper; easy-to-use eBPF tools
- bpftrace: High-level eBPF language; one-liners for tracing
- XDP (eXpress Data Path): Attach eBPF to NIC driver; ultra-low-latency packet processing
- Tracing: Hook syscalls, kernel functions, events; track execution flow
Pitfall 1: eBPF programs have kernel resource limits
- Large eBPF program β verifier rejection (“stack size too large”)
- Maps too large β OOM or allocation failure
- Fix: Use BCC/bpftrace for small, focused tracing; production tools (Cilium) handle complexity
Pitfall 2: Unbounded eBPF loops crash kernel
- Verifier doesn’t allow loops; workaround:
#pragma unroll
(only fixed count) - Fix: For finite state machines, use loops with bounded iteration count
Key metrics/tools:
bpftool prog show # List loaded eBPF programs
bpftool map dump name MAPNAME # Dump map contents
bpftrace -l # List available tracepoints
bpftrace -e 'tracepoint:syscalls:sys_enter_open* { printf("%s\n", str(args->filename)); }' -c "ls /" # Trace file opens
sudo /usr/share/bcc/tools/opensnoop # Trace open() syscalls
sudo /usr/share/bcc/tools/biotop # Top processes by block I/O
Quick Troubleshooting Matrix
Symptom | Suspect Subsystem | First Check | Quick Fix |
---|---|---|---|
High load but low CPU% | Scheduler OR I/O wait | iostat -x 1 , vmstat (wa column) | Check disk/network; run perf record |
OOM kills despite free RAM | Memory / cgroups | cat /proc/pressure/memory , cgroup v2 limits | Increase memory.max or add swap |
Can’t create files | VFS / inodes | df -i shows 100% | Cleanup files or recreate fs with more inodes |
Slow disk random I/O | Block I/O scheduler | cat /sys/block/sda/queue/scheduler | Set to noop or mq-deadline on SSDs |
Connection refused (random ports) | Networking / conntrack | cat /proc/net/stat/nf_conntrack | Increase nf_conntrack_max |
Container can’t reach network | Namespaces | ip netns list , ip route show | Check veth pair, routes in netns |
Permission denied (no obvious reason) | LSM (SELinux/AppArmor) | getenforce , setenforce 0 | Temporarily disable, check /var/log/audit/audit.log |
Can’t trace with eBPF | eBPF / kernel version | uname -r (need β₯4.9), bpftool prog show | Upgrade kernel or use ftrace as fallback |
One-Page Cheat: Critical Limits
# Memory
sysctl vm.swappiness=10 # Reduce swapping
sysctl vm.watermark_scale_factor=10 # Kswapd reclaim threshold
sysctl vm.max_map_count=262144 # Max memory maps per process
# Networking
sysctl net.core.somaxconn=65535 # Backlog for listening sockets
sysctl net.ipv4.tcp_max_syn_backlog=65535 # SYN backlog (DDoS mitigation)
sysctl net.netfilter.nf_conntrack_max=2000000 # Conntrack table
# File descriptors
ulimit -n 65535 # Per-process FD limit
# Kernel debug
echo 1 > /proc/sys/kernel/perf_event_paranoid # Allow unprivileged perf
# cgroups
echo 262144 > /sys/fs/cgroup2/pids.max # Max PIDs per cgroup