Overview

This is a one-page cheat sheet for Linux kernel subsystems. Each subsystem controls a critical resource; understanding them is essential for troubleshooting, optimization, and security.

Why understanding subsystems matters:

Imagine your server is slow. Without subsystem knowledge, you’re guessing:

  • “Maybe add more RAM?” (might be CPU scheduler issue)
  • “Maybe faster disk?” (might be memory cache problem)
  • “Maybe more CPU?” (might be I/O scheduler misconfiguration)

With subsystem knowledge, you diagnose systematically:

Symptom: Application slow
↓
Check: top shows 80% CPU "wait" (wa)
↓
Diagnosis: NOT a CPU problem β†’ I/O wait means disk subsystem
↓
Check: iostat shows %util=100% on /dev/sda
↓
Diagnosis: Block I/O subsystem bottleneck
↓
Fix: Check I/O scheduler, investigate slow queries, add SSD

What this guide covers:

Each of the 9 subsystems below answers one question:

  1. Scheduler: How does Linux decide which process runs next?
  2. Memory: How does Linux manage RAM, cache, and swap?
  3. VFS: How does Linux present files to applications?
  4. Block I/O: How does Linux talk to disks?
  5. Networking: How does Linux send/receive packets?
  6. Namespaces: How does Linux isolate containers?
  7. cgroups: How does Linux limit resource usage?
  8. LSM: How does Linux enforce security policies?
  9. eBPF: How does Linux enable custom kernel-level observability?

How to use this guide:

  • Learning: Read each subsystem to understand what it does
  • Troubleshooting: Jump to subsystem matching your symptom (use the table at bottom)
  • Reference: Copy-paste commands when investigating issues

1. Process Scheduler (CPU)

What it is:

  • Fairly distributes CPU time among processes using CFS (Completely Fair Scheduler)
  • Tracks virtual runtime (vruntime); process with lowest vruntime runs next
  • Supports real-time (RT) scheduling for critical tasks (separate scheduler)
  • Balances load across CPU cores while respecting NUMA locality

Detailed explanation:

CFS (Completely Fair Scheduler) - The “fairness” guarantee:

Think of CFS like a teacher distributing speaking time among students. The student who has spoken the least gets to speak next.

How it works:

  • Each process has a vruntime (virtual runtime) counter that tracks total CPU time used
  • When CPU becomes available, CFS picks the process with the LOWEST vruntime
  • Process runs for a time slice (~4-6ms by default), then vruntime increases
  • Process goes back to queue, next lowest vruntime runs

Example:

Process A: vruntime = 100ms (ran a lot)
Process B: vruntime = 50ms  (ran less)
Process C: vruntime = 75ms  (middle)

β†’ CFS picks Process B (lowest = 50ms)
β†’ B runs for 5ms β†’ vruntime becomes 55ms
β†’ CFS picks B again (still lowest)
β†’ B runs for 5ms β†’ vruntime becomes 60ms
β†’ CFS picks B again... until B's vruntime catches up to others

Why this matters: CPU-intensive processes don’t starve I/O-bound processes. A background video encoder (high vruntime) won’t block your SSH session (low vruntime).

Real-time scheduling - When “fairness” isn’t enough:

What it is: Some tasks can’t waitβ€”audio processing must happen within 10ms or you hear glitches. RT scheduler bypasses CFS entirely.

RT scheduling classes:

  • SCHED_FIFO (First In First Out): Process runs until it yields or higher-priority RT process arrives. No time slicing.
  • SCHED_RR (Round Robin): Like FIFO but with time slicing among same-priority RT processes.
  • SCHED_DEADLINE: Advancedβ€”specify CPU time needed + deadline. Kernel schedules to meet deadline.

Real-world example:

# Audio processing daemon needs guaranteed CPU
sudo chrt -f 80 /usr/bin/audio-daemon

What happens:
- audio-daemon gets priority 80 (higher = more important)
- When audio-daemon wants CPU, it preempts ALL normal processes
- CFS processes wait until audio-daemon sleeps/finishes

Danger: RT process in infinite loop = system hang. All other processes starve.

Load balancing across cores - Why your 8-core CPU matters:

What it is: If you have 8 CPU cores, scheduler tries to use all 8 evenly. Otherwise core 0 might be 100% busy while core 7 idles.

How it works:

  • Scheduler periodically (every ~4ms) checks if cores are imbalanced
  • If core 0 has 10 processes and core 1 has 2, scheduler migrates some from 0β†’1
  • Respects CPU affinity (taskset binds process to specific cores)

NUMA locality - Why memory proximity matters:

  • Modern servers have multiple NUMA nodes (CPU+RAM pairs)
  • Accessing local RAM (same NUMA node) is 2-3x faster than remote RAM
  • Scheduler tries to keep process on same NUMA node as its memory

Example:

NUMA node 0: CPU 0-7, RAM 0-64GB
NUMA node 1: CPU 8-15, RAM 64-128GB

Process on CPU 0 accessing RAM at 10GB: Fast (local)
Process on CPU 8 accessing RAM at 10GB: Slow (crosses NUMA boundary)

β†’ Scheduler prefers keeping process on node 0

Pitfall 1: Ignoring load average

  • High load (> num_cores) indicates CPU contention
  • uptime shows 1/5/15-min averages; trending matters more than absolute value
  • Fix: Use systemd-analyze + perf to identify CPU-bound processes

Pitfall 2: Misusing real-time priority

  • RT tasks bypass CFS, can starve other processes
  • Setting chrt -f 99 command can hang the system if not careful
  • Fix: Reserve RT for genuinely critical, bounded-time work; use SCHED_DEADLINE for advanced users

Key metrics/tools:

uptime                    # Load average
top -p PID -s 1          # Per-process CPU% (user/system time)
ps -eo pid,comm,class    # Scheduling class (TS=SCHED_OTHER, FF=FIFO, RR=round-robin)
perf stat command        # IPC, cache misses, context switches
systemd-analyze plot     # Boot parallelization

2. Memory Management (RAM, Swap, Cache)

What it is:

  • Provides virtual address spaces to each process (MMU translates to physical RAM)
  • Page cache: OS caches file data in RAM for speed; pages evicted when RAM needed
  • Swap: Moves inactive pages to disk (slow fallback when RAM full)
  • THP (Transparent Hugepages): Automatically uses 2MB/1GB pages instead of 4KB to reduce TLB misses

Detailed explanation:

Virtual memory - Why every process thinks it owns all RAM:

What it is: Each process sees its own private address space (like 0x00000000 to 0xFFFFFFFF on 32-bit). The MMU (Memory Management Unit, hardware) translates these “virtual” addresses to actual physical RAM locations.

Why it matters:

Process A reads from address 0x1000 β†’ MMU translates to physical RAM 0x500000
Process B reads from address 0x1000 β†’ MMU translates to physical RAM 0x700000

Same virtual address, different physical RAM = processes isolated from each other

Real-world benefit: Process crashes can’t corrupt other processes’ memory. Container at virtual 0x1000 can’t read host memory.

Page cache - Why your second ls is instant:

What it is: When you read a file, Linux copies it into RAM. Next time you read, Linux serves from RAM (instant) instead of disk (milliseconds).

Example:

# First read: 500ms (from disk)
time cat /var/log/syslog > /dev/null
# real    0m0.500s

# Second read: 5ms (from page cache)
time cat /var/log/syslog > /dev/null
# real    0m0.005s  ← 100x faster!

How it works:

  • File reads β†’ Linux copies disk blocks into RAM pages
  • RAM pages marked as “page cache”
  • When RAM needed for applications, kernel evicts least-recently-used cache pages
  • Modified pages (writes) must be flushed to disk first (dirty pages)

This is why free -h shows “used” RAM as high: Linux uses “free” RAM for caching. It’s not wastedβ€”it’s optimized.

$ free -h
              total        used        free      shared  buff/cache
Mem:           31Gi       5.0Gi       1.0Gi       100Mi        25Gi

25GB in "buff/cache" = page cache
If application needs RAM, kernel evicts cache automatically
This is GOOD, not bad

Swap - The emergency pressure valve:

What it is: When RAM is full, Linux moves inactive memory pages to disk (swap partition or swap file). This frees RAM for active processes.

Why it’s slow:

RAM access:  ~100 nanoseconds
Disk access: ~10 milliseconds
β†’ Swap is 100,000x slower than RAM

When swap is good:

  • Inactive background process (like old SSH session) swapped out β†’ active database gets more RAM
  • Temporary RAM spike β†’ swap absorbs it, prevents OOM kill

When swap is bad:

  • Active process swapping in/out repeatedly (thrashing)
  • Example: Database with 8GB working set but only 4GB RAM β†’ constant swap I/O β†’ queries take 100x longer

How to detect thrashing:

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  2 500000  10000  20000  30000  5000 5000   100   200  500 1000 10 20 50 20  0

si=5000, so=5000 (swap in/out) = THRASHING
β†’ Application using RAM faster than available
β†’ Fix: Add RAM or reduce working set

Transparent Hugepages (THP) - Bigger pages, fewer lookups:

What it is: Normally Linux uses 4KB memory pages. THP uses 2MB pages automatically (512x larger).

Why bigger is sometimes better:

  • CPU has Translation Lookaside Buffer (TLB) that caches virtualβ†’physical address mappings
  • TLB has limited entries (like ~1024)
  • With 4KB pages: 1024 entries Γ— 4KB = 4MB coverage
  • With 2MB pages: 1024 entries Γ— 2MB = 2GB coverage
  • Larger coverage = fewer TLB misses = faster memory access

Example benefit:

Database with 10GB dataset, sequential access:
- 4KB pages: Frequent TLB misses (10GB > 4MB TLB coverage)
- 2MB pages: Fewer TLB misses (10GB fits better in 2GB coverage)
β†’ 10-20% performance improvement in some workloads

When THP hurts:

  • Fragmentation: Kernel tries to find contiguous 2MB physical RAM
  • If RAM fragmented (lots of small allocations), kernel spends time defragmenting
  • Defragmentation can cause 10-100ms latency spikes
  • Databases (Redis, MongoDB) with strict latency SLAs disable THP

Pitfall 1: Excessive swapping kills performance

  • Swapping page from disk β†’ 1000x slower than RAM access
  • High si/so (swap in/out) in vmstat indicates thrashing
  • Fix: Disable swap for latency-sensitive apps (swapoff -a); monitor PSI (Pressure Stall Info)

Pitfall 2: THP backfires on databases

  • Defragmentation latency spikes when THP pages get fragmented
  • Some DBs (MongoDB, Redis) prefer 4KB pages for predictability
  • Fix: Disable: echo never > /sys/kernel/mm/transparent_hugepage/enabled

Key metrics/tools:

free -h                              # Overall RAM/swap/cache
vmstat 1 5                           # si/so (swap I/O), wa (I/O wait)
ps aux | sort -k6 -rn | head -10    # Top memory consumers (RSS)
cat /proc/pressure/memory            # PSI: CPU, I/O, memory stall percentages
sar -B 1 5                           # Page faults, THP usage
echo madvise > /sys/kernel/mm/transparent_hugepage/sysfs_enabled  # Enable for app opt-in

3. Virtual Filesystem (VFS) & Filesystems

What it is:

  • VFS: Unified interface above ext4, XFS, Btrfs, NFS (abstraction layer)
  • Inode: File metadata (permissions, size, block pointers, timestamps)
  • Dentry: Filename β†’ inode mapping (cached in dcache for speed)
  • Mount: Attach filesystem to directory tree; multiple filesystems can coexist

Detailed explanation:

VFS - Why cat works on ext4, XFS, NFS, and even /proc:

What it is: VFS is an abstraction layer. Applications call open("/path/to/file"), and VFS translates that into the appropriate filesystem-specific operation.

Why it matters:

Application: open("/etc/passwd")
    ↓
VFS: "Which filesystem owns /etc?"
    ↓
VFS: "ext4 filesystem on /dev/sda1"
    ↓
VFS calls: ext4_open()
    ↓
ext4 driver reads inode, returns file descriptor

Same for all filesystems:

open("/mnt/nfs/file") β†’ VFS β†’ nfs_open() β†’ Network request to NFS server
open("/proc/cpuinfo") β†’ VFS β†’ proc_open() β†’ Kernel generates CPU info on-the-fly
open("/dev/sda") β†’ VFS β†’ block_device_open() β†’ Direct disk access

Application doesn't careβ€”VFS handles it

Inode - The “real” file (not the filename):

What it is: An inode is a data structure that stores everything about a file EXCEPT its name:

  • File type (regular, directory, symlink)
  • Permissions (rwxr-xr-x)
  • Owner (UID/GID)
  • Size (in bytes)
  • Timestamps (created, modified, accessed)
  • Pointers to data blocks on disk

Filename is separate: Directory entries (dentries) map names β†’ inode numbers.

Example:

$ ls -li /etc/passwd
12345678 -rw-r--r-- 1 root root 2048 Oct 16 12:00 /etc/passwd
         ↑
         Inode number

$ stat /etc/passwd
  File: /etc/passwd
  Size: 2048        Blocks: 8
  Inode: 12345678   Links: 1
  Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)

Hard links - Multiple names, one inode:

$ ln /etc/passwd /tmp/passwd-hardlink
$ ls -li /etc/passwd /tmp/passwd-hardlink
12345678 -rw-r--r-- 2 root root 2048 Oct 16 12:00 /etc/passwd
12345678 -rw-r--r-- 2 root root 2048 Oct 16 12:00 /tmp/passwd-hardlink
         ↑ Same inode = same file, two names

Why running out of inodes breaks things:

$ touch /tmp/newfile
touch: cannot touch '/tmp/newfile': No space left on device

$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   50G   50G  50% /tmp

$ df -i /tmp
Filesystem      Inodes  IUsed  IFree IUse% Mounted on
/dev/sda1      6500000 6500000    0  100% /tmp
                       ↑ All inodes used!

Even though 50GB free, can't create files (no inodes available)

Dentry cache (dcache) - Why ls the second time is instant:

What it is: Linux caches the mapping of filename β†’ inode in RAM.

Example:

# First lookup: Must read directory blocks from disk
$ ls /var/log/syslog
(kernel reads /var/log directory inode, finds syslog entry, caches it)

# Second lookup: Served from dcache (RAM)
$ ls /var/log/syslog
(instant - no disk I/O)

How it helps: Applications frequently access same files (/etc/hosts, /lib/x86_64-linux-gnu/libc.so.6). Dcache avoids disk reads.

Mount - Attaching filesystems to the directory tree:

What it is: Linux has one unified directory tree starting at /. You “mount” filesystems at specific paths.

Example:

# Root filesystem (ext4 on /dev/sda1) mounted at /
# Home directories (XFS on /dev/sdb1) mounted at /home
# NFS share mounted at /mnt/shared

$ mount
/dev/sda1 on / type ext4 (rw,relatime)
/dev/sdb1 on /home type xfs (rw,noatime)
nfs-server:/export on /mnt/shared type nfs (rw,soft,timeo=30)

When you access /home/user/file:
β†’ VFS sees "/home" is a mount point
β†’ Redirects to /dev/sdb1 (XFS filesystem)
β†’ XFS handles the request

Why this matters: You can have different filesystems for different directories (fast SSD for /var/lib/postgresql, slow HDD for /var/log).

Pitfall 1: Running out of inodes

  • Each file/dir/link = 1 inode; if df -i shows 100%, can’t create files even with free space
  • Millions of small files (temp logs, session stores) exhaust inodes quickly
  • Fix: tune2fs -l /dev/sda1 shows inode count; recreate fs with more inodes: mkfs.ext4 -N 1000000 /dev/sda1

Pitfall 2: Suboptimal mount options

  • Default relatime still updates atime on read (small I/O overhead)
  • No nobarrier on SSD without UPS = slow writes (fsync waits for disk flush)
  • Fix: Use mount -o noatime,nodiratime,nobarrier for non-critical data; keep barriers for databases

Key metrics/tools:

df -h                        # Disk space by filesystem
df -i                        # Inode usage (critical!)
mount | grep -E 'ext4|xfs'   # Show mount options
lsof | head -20              # Files open by process
sync; echo 3 > /proc/sys/vm/drop_caches  # Clear page cache (test)
fstrim -v /mount             # Discard unused blocks (SSDs)

4. Block I/O (Disk Scheduling)

What it is:

  • I/O scheduler: Orders disk requests to minimize seek time (deadline, CFQ, mq-deadline, noop)
  • io_uring: Modern async I/O interface (replaces aio); supports polling, fixed buffers, kernel bypass
  • Request queue: Batches I/O requests before sending to device
  • Throughput vs latency: High throughput needs batching; low latency needs quick service

Pitfall 1: Wrong scheduler for your device

  • Spinning disk: use deadline (prioritizes reads)
  • SSD/NVMe: use none or noop (let device schedule)
  • Using CFQ on fast SSDs adds unnecessary latency (sorting overhead)
  • Fix: Check cat /sys/block/sda/queue/scheduler; for production: none is usually safe

Pitfall 2: io_uring without proper resource limits

  • io_uring buffers can pin kernel memory; many async ops β†’ OOM
  • Fix: Limit memory pinning: ulimit -l (check cat /proc/sys/kernel/memlock)

Key metrics/tools:

iostat -x 1 5                     # %util, await (avg service time), svctm
iotop                             # Top processes by disk I/O
blktrace -d /dev/sda -o - | blkparse  # Detailed I/O tracing
perf record -e block:block_rq_* -- command  # I/O event tracing
fio --name=random-read --ioengine=libaio    # Disk benchmark

5. Networking Stack (Network I/O)

What it is:

  • nftables: Modern packet filtering framework (replaces iptables); rules in kernel eBPF
  • conntrack: Tracks TCP/UDP connection state; enables stateful firewall
  • qdisc (queuing discipline): Schedules outbound packets (pfifo, fq, cake, htb for traffic shaping)
  • tc (traffic control): Linux traffic shaping tool; applies qdiscs, classes, filters

Pitfall 1: Conntrack table exhaustion

  • Malicious/buggy clients create many short-lived connections; conntrack table fills
  • Result: “Cannot assign requested address” on outbound connections
  • Fix: Monitor cat /proc/net/stat/nf_conntrack | tail -1; increase: sysctl -w net.netfilter.nf_conntrack_max=2000000

Pitfall 2: No egress rate limiting β†’ noisy neighbor

  • One container/VM burns all bandwidth; others starve
  • Fix: Apply tc qdisc: tc qdisc add dev eth0 root tbf rate 100mbit burst 32kb latency 400ms

Key metrics/tools:

ss -tulnp                              # TCP/UDP sockets, listening ports
cat /proc/net/netstat                  # IP stats (dropped, errors)
nftrace list ruleset                   # View nftables rules
ip netns list; ip netns exec NS ss -an # Namespace inspection
ethtool -S eth0                        # NIC driver stats (RX/TX drops, errors)
tc -s qdisc show dev eth0              # Queue discipline stats
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_connect { ... }'  # Trace connects

6. Namespaces (Isolation for Containers)

What it is:

  • PID namespace: Process sees only its own PID tree; PID 1 in container β‰  PID 1 on host
  • Network namespace: Isolated network stack (veth, lo, routing table)
  • Mount namespace: Private root filesystem view; mount /dev/sda1 in container doesn’t affect host
  • UTS namespace: Isolated hostname, domainname
  • IPC namespace: Private semaphores, message queues, shared memory
  • User namespace: UID/GID remapping (container root = unprivileged host user)

Detailed explanation:

Namespaces - The foundation of container isolation:

What it is: Namespaces make each container think it’s running on its own dedicated machine. Processes inside a container can’t see (or affect) processes in other containers or on the host.

How Docker/Kubernetes use namespaces:

Docker run creates:
1. New PID namespace β†’ container sees PID 1 as its own init
2. New network namespace β†’ container gets its own lo, eth0
3. New mount namespace β†’ container sees its own /etc, /usr, /var
4. New UTS namespace β†’ container has its own hostname
5. New IPC namespace β†’ container's shared memory isolated
6. New user namespace β†’ container root β‰  host root (optional)

Result: Container thinks it's a separate machine

Visual diagram of namespace isolation:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HOST SYSTEM (Real Linux Kernel)                                     β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚  β”‚  Container A          β”‚  β”‚  Container B          β”‚                β”‚
β”‚  β”‚  (Namespace Set #1)   β”‚  β”‚  (Namespace Set #2)   β”‚                β”‚
β”‚  β”‚                       β”‚  β”‚                       β”‚                β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                β”‚
β”‚  β”‚  β”‚ PID Namespace   β”‚ β”‚  β”‚  β”‚ PID Namespace   β”‚ β”‚                β”‚
β”‚  β”‚  β”‚                 β”‚ β”‚  β”‚  β”‚                 β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ PID 1: nginx    β”‚ β”‚  β”‚  β”‚ PID 1: postgres β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ PID 2: worker   β”‚ β”‚  β”‚  β”‚ PID 2: worker   β”‚ β”‚                β”‚
β”‚  β”‚  β”‚                 β”‚ β”‚  β”‚  β”‚                 β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ (Isolated PIDs) β”‚ β”‚  β”‚  β”‚ (Isolated PIDs) β”‚ β”‚                β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                β”‚
β”‚  β”‚          ↓            β”‚  β”‚          ↓            β”‚                β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                β”‚
β”‚  β”‚  β”‚ Net Namespace   β”‚ β”‚  β”‚  β”‚ Net Namespace   β”‚ β”‚                β”‚
β”‚  β”‚  β”‚                 β”‚ β”‚  β”‚  β”‚                 β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ eth0: 172.17.0.2β”‚ β”‚  β”‚  β”‚ eth0: 172.17.0.3β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ lo: 127.0.0.1   β”‚ β”‚  β”‚  β”‚ lo: 127.0.0.1   β”‚ β”‚                β”‚
β”‚  β”‚  β”‚                 β”‚ β”‚  β”‚  β”‚                 β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ (Own IP stack)  β”‚ β”‚  β”‚  β”‚ (Own IP stack)  β”‚ β”‚                β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                β”‚
β”‚  β”‚          ↓            β”‚  β”‚          ↓            β”‚                β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                β”‚
β”‚  β”‚  β”‚ Mount Namespace β”‚ β”‚  β”‚  β”‚ Mount Namespace β”‚ β”‚                β”‚
β”‚  β”‚  β”‚                 β”‚ β”‚  β”‚  β”‚                 β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ /: overlay2 fs  β”‚ β”‚  β”‚  β”‚ /: overlay2 fs  β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ /etc: container β”‚ β”‚  β”‚  β”‚ /etc: container β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ /var: container β”‚ β”‚  β”‚  β”‚ /var: container β”‚ β”‚                β”‚
β”‚  β”‚  β”‚                 β”‚ β”‚  β”‚  β”‚                 β”‚ β”‚                β”‚
β”‚  β”‚  β”‚ (Own root FS)   β”‚ β”‚  β”‚  β”‚ (Own root FS)   β”‚ β”‚                β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚              ↓                          ↓                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚             ACTUAL KERNEL RESOURCES                          β”‚   β”‚
β”‚  β”‚                                                               β”‚   β”‚
β”‚  β”‚  PIDs: 1234 (nginx), 1235 (worker), 1236 (postgres)...      β”‚   β”‚
β”‚  β”‚  Network: Real NICs (eth0), bridges (docker0), veth pairs   β”‚   β”‚
β”‚  β”‚  Mounts: /var/lib/docker/overlay2/abc123, /var/lib/docker...β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight:
- Container A's PID 1 β†’ Real kernel PID 1234
- Container B's PID 1 β†’ Real kernel PID 1236
- Containers can't see each other's PIDs, networks, or filesystems

1. PID Namespace - Process isolation:

What it is: Each PID namespace has its own process tree starting at PID 1. Processes in the namespace only see other processes in the same namespace.

Important distinction: PID Namespace vs PID (Process ID):

Many people confuse these two concepts. Let’s clarify:

PID (Process ID):

  • A number assigned to a running process
  • Example: nginx process has PID 1234
  • Every running process has a PID
  • PIDs are unique within a namespace

PID Namespace:

  • An isolation mechanism (like a container)
  • Groups processes together
  • Each namespace has its own PID numbering starting from 1
  • Same process can have different PID numbers in different namespaces

Visual comparison:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Concept: PID (Process ID)                                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚ What it is: A NUMBER assigned to a running process                  β”‚
β”‚                                                                      β”‚
β”‚ Example:                                                             β”‚
β”‚   $ ps aux                                                           β”‚
β”‚   PID   USER    COMMAND                                             β”‚
β”‚   1234  root    nginx: master process                               β”‚
β”‚   1235  www     nginx: worker process                               β”‚
β”‚   1236  postgres postgres -D /var/lib/postgresql                    β”‚
β”‚           ↑                                                          β”‚
β”‚           These are PIDs (just numbers)                              β”‚
β”‚                                                                      β”‚
β”‚ Analogy: Like a house number (123 Main Street)                      β”‚
β”‚          The number identifies the house                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Concept: PID Namespace                                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚ What it is: An ISOLATION CONTAINER for processes                    β”‚
β”‚             Each namespace has its own PID numbering                 β”‚
β”‚                                                                      β”‚
β”‚ Example:                                                             β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚ PID Namespace A     β”‚    β”‚ PID Namespace B     β”‚               β”‚
β”‚  β”‚ (Container 1)       β”‚    β”‚ (Container 2)       β”‚               β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚               β”‚
β”‚  β”‚  PID 1: nginx       β”‚    β”‚  PID 1: postgres    β”‚               β”‚
β”‚  β”‚  PID 2: worker      β”‚    β”‚  PID 2: worker      β”‚               β”‚
β”‚  β”‚  PID 3: bash        β”‚    β”‚  PID 3: bash        β”‚               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚           ↑                           ↑                             β”‚
β”‚     These are PID Namespaces (containers)                           β”‚
β”‚     Each has its own PID numbering (both start at 1)                β”‚
β”‚                                                                      β”‚
β”‚ Analogy: Like different cities (New York vs Tokyo)                  β”‚
β”‚          Each city has its own "123 Main Street"                    β”‚
β”‚          The street name is the same, but they're in different citiesβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Real-world example showing the difference:

# HOST SYSTEM (Default PID Namespace)
$ ps aux
PID   USER     COMMAND
1     root     /sbin/init                    ← Host init (real PID 1)
1234  root     nginx: master                 ← nginx in container A
1235  www      nginx: worker
1236  postgres postgres: main                ← postgres in container B
1237  postgres postgres: worker

# CONTAINER A (New PID Namespace)
$ docker exec -it container-a ps aux
PID   USER     COMMAND
1     root     nginx: master                 ← Same process as host PID 1234
2     www      nginx: worker                 ← Same process as host PID 1235

# CONTAINER B (Another PID Namespace)
$ docker exec -it container-b ps aux
PID   USER     COMMAND
1     postgres postgres: main                ← Same process as host PID 1236
2     postgres postgres: worker              ← Same process as host PID 1237

Key insight - Same process, multiple PIDs:

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚  ACTUAL PROCESS      β”‚
                          β”‚  (nginx master)      β”‚
                          β”‚  Running in kernel   β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                               β”‚
                    ↓                               ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ How HOST sees it:  β”‚         β”‚ How CONTAINER sees:β”‚
         β”‚   PID 1234         β”‚         β”‚   PID 1           β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Same process = Different PID numbers depending on namespace viewing it

Why this matters:

  1. Isolation: Container A’s PID 1 can’t kill Container B’s PID 1 (different namespaces)
  2. Security: Container sees PID 1-3, can’t see host PIDs 1234-1237
  3. Debugging: When you see “PID 1” in container logs, it’s NOT the host’s PID 1

Common confusion resolved:

❌ Wrong thinking: “My container has PID 1, that means it’s the system’s init process” βœ… Correct thinking: “My container has PID 1 in its namespace. On the host, it’s probably PID 12345”

❌ Wrong: “I killed PID 1234 on host, why did the container die?” βœ… Correct: “PID 1234 on host is PID 1 in container namespace. Killing container’s PID 1 kills entire container”

Summary table:

AspectPID (Process ID)PID Namespace
What it isA number (identifier)An isolation container
Example1234, 1235, 1236Container A, Container B, Host
UniquenessUnique within a namespaceEach namespace is separate
PurposeIdentify a specific processIsolate groups of processes
AnalogyHouse number (123)City/neighborhood (New York, Tokyo)
Created byKernel when process startsunshare(CLONE_NEWPID) or Docker
LifetimeUntil process exitsUntil last process in namespace exits

Real-world example:

# On host
$ ps aux | grep nginx
root      1234  0.0  0.1  nginx: master process
www-data  1235  0.0  0.1  nginx: worker process

# Inside container
$ ps aux
PID   USER     COMMAND
1     root     nginx: master process    ← This is PID 1234 on host
2     www-data nginx: worker process    ← This is PID 1235 on host

Container sees: PID 1, 2
Host sees: PID 1234, 1235
Same process, different PID numbers in different namespaces

Why PID 1 matters:

  • PID 1 in Unix is specialβ€”it’s the init process
  • When PID 1 exits, kernel kills all processes in that namespace
  • Docker container exits when PID 1 (entrypoint) exits

Diagram of PID namespace mapping:

Container PID Namespace          Host PID Namespace
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     β”‚         β”‚                      β”‚
β”‚  PID 1 (nginx)    ──┼────────▢│  PID 1234 (nginx)    β”‚
β”‚  PID 2 (worker)   ──┼────────▢│  PID 1235 (worker)   β”‚
β”‚  PID 3 (bash)     ──┼────────▢│  PID 1236 (bash)     β”‚
β”‚                     β”‚         β”‚                      β”‚
β”‚  Can only see      β”‚         β”‚  Can see ALL PIDs:   β”‚
β”‚  PIDs 1, 2, 3      β”‚         β”‚  1, 1234, 1235, 1236 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Security benefit: Container process can’t send signals (kill) to host processes because it can’t see their PIDs.

2. Network Namespace - Network stack isolation:

What it is: Each network namespace has its own network interfaces, IP addresses, routing tables, firewall rules, and sockets.

How containers get networking:

Step 1: Create network namespace
  β†’ Container gets isolated network stack (empty)
  β†’ No interfaces except loopback (lo)

Step 2: Create veth pair (virtual ethernet cable)
  β†’ One end in container namespace (eth0)
  β†’ Other end in host namespace (vethXXX)

Step 3: Connect host end to bridge (docker0)
  β†’ Now container can reach host and other containers

Step 4: Configure NAT on host
  β†’ Container can reach internet (host translates addresses)

Visual diagram of container networking:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HOST                                                                β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ Container A         β”‚          β”‚ Container B         β”‚         β”‚
β”‚  β”‚ Network Namespace   β”‚          β”‚ Network Namespace   β”‚         β”‚
β”‚  β”‚                     β”‚          β”‚                     β”‚         β”‚
β”‚  β”‚  eth0: 172.17.0.2   β”‚          β”‚  eth0: 172.17.0.3   β”‚         β”‚
β”‚  β”‚  gateway: 172.17.0.1β”‚          β”‚  gateway: 172.17.0.1β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚         β”‚                                 β”‚                        β”‚
β”‚         β”‚ veth pair                       β”‚ veth pair              β”‚
β”‚         β”‚ (virtual cable)                 β”‚ (virtual cable)        β”‚
β”‚         ↓                                 ↓                        β”‚
β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                  β”‚
β”‚    β”‚       docker0 bridge (172.17.0.1)         β”‚                  β”‚
β”‚    β”‚    (Virtual switch connecting containers) β”‚                  β”‚
β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚         β”‚                                                          β”‚
β”‚         β”‚ NAT (iptables)                                          β”‚
β”‚         ↓                                                          β”‚
β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”                                                  β”‚
β”‚    β”‚  eth0     β”‚ (Host physical NIC)                              β”‚
β”‚    β”‚  Public IPβ”‚                                                  β”‚
β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                                                  β”‚
β”‚         β”‚                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          ↓
     Internet

Data flow example:
Container A (172.17.0.2) β†’ docker0 bridge β†’ NAT (source IP changed to host IP) β†’ eth0 β†’ Internet

Real-world example:

# On host
$ ip addr show docker0
docker0: <BROADCAST,MULTICAST,UP>
    inet 172.17.0.1/16 scope global docker0

$ ip addr show veth1a2b3c
veth1a2b3c: <BROADCAST,MULTICAST,UP>
    (connected to container A's eth0)

# In container
$ ip addr show eth0
eth0: <BROADCAST,MULTICAST,UP>
    inet 172.17.0.2/16 scope global eth0
    (connected to host's veth1a2b3c)

$ ip route
default via 172.17.0.1 dev eth0
(Container routes all traffic through docker0)

Security benefit: Container can’t sniff packets from other containers. Each has isolated network stack.

3. Mount Namespace - Filesystem isolation:

What it is: Each mount namespace sees its own filesystem tree. Mounting a filesystem in one namespace doesn’t affect others.

Real-world example:

# On host
$ mount | grep /var/lib/docker
overlay on /var/lib/docker/overlay2/abc123/merged type overlay

# In container
$ mount
overlay on / type overlay (rw,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k)
proc on /proc type proc (rw,nosuid,nodev,noexec)

Container sees "/" as root
Host sees it as /var/lib/docker/overlay2/abc123/merged

Diagram of mount namespace:

Container Mount Namespace        Host Mount Namespace
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      β”‚        β”‚                                 β”‚
β”‚  /                   β”‚        β”‚  /                              β”‚
β”‚  β”œβ”€β”€ /etc            β”‚        β”‚  β”œβ”€β”€ /etc (real host /etc)      β”‚
β”‚  β”œβ”€β”€ /usr            β”‚        β”‚  β”œβ”€β”€ /usr                       β”‚
β”‚  β”œβ”€β”€ /var            β”‚        β”‚  β”œβ”€β”€ /var                       β”‚
β”‚  └── /app            β”‚        β”‚  └── /var/lib/docker/overlay2/  β”‚
β”‚      (container FS)  β”‚        β”‚      β”œβ”€β”€ abc123/merged/ ← Container A root
β”‚                      β”‚        β”‚      └── def456/merged/ ← Container B root
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When container reads /etc/passwd:
  β†’ Container sees /etc/passwd (in its namespace)
  β†’ Kernel translates to /var/lib/docker/overlay2/abc123/merged/etc/passwd
  β†’ Container can't access host's real /etc/passwd

Security benefit: Container can’t modify host filesystem. Even if it writes to /etc, it’s writing to its own overlay, not host’s /etc.

4. UTS Namespace - Hostname isolation:

What it is: Each UTS namespace can have its own hostname and domain name.

Real-world example:

# On host
$ hostname
production-server-01

# In container A
$ hostname
web-container

# In container B
$ hostname
db-container

Each container has its own hostname
Host hostname unchanged

Why it matters: Applications that log hostnames (like distributed systems) can identify which container logged what, even when multiple containers run same image.

5. IPC Namespace - Shared memory isolation:

What it is: Each IPC namespace has isolated System V IPC objects (shared memory segments, semaphores, message queues).

Real-world example:

# On host
$ ipcs -m
Shared Memory Segments
key        shmid      owner      bytes
0x00000000 32768     postgres   16777216

# In container
$ ipcs -m
Shared Memory Segments
key        shmid      owner      bytes
0x00000000 65536     postgres   8388608

Different shared memory segments
Container can't access host's shared memory

Why it matters: Prevents containers from using shared memory to communicate (potential side channel for attacks).

6. User Namespace - UID/GID remapping (most complex):

What it is: Maps user IDs inside container to different IDs on host. Container root (UID 0) can be mapped to unprivileged user on host (UID 100000).

Real-world example:

# In container (user namespace enabled)
$ id
uid=0(root) gid=0(root) groups=0(root)
$ whoami
root
$ touch /tmp/file
$ ls -la /tmp/file
-rw-r--r-- 1 root root 0 Oct 17 12:00 /tmp/file

# On host (looking at container's file)
$ ls -la /var/lib/docker/.../merged/tmp/file
-rw-r--r-- 1 100000 100000 0 Oct 17 12:00 file
                ↑      ↑
        Container UID 0 β†’ Host UID 100000

Diagram of UID mapping:

Container User Namespace         Host User Namespace
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      β”‚        β”‚                      β”‚
β”‚  UID 0 (root)      ──┼───────▢│  UID 100000          β”‚
β”‚  UID 1 (daemon)    ──┼───────▢│  UID 100001          β”‚
β”‚  UID 1000 (user)   ──┼───────▢│  UID 101000          β”‚
β”‚                      β”‚        β”‚                      β”‚
β”‚  Container thinks    β”‚        β”‚  Host sees container β”‚
β”‚  process is root     β”‚        β”‚  as unprivileged     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Mapping configured in /etc/subuid and /etc/subgid

Security benefit (huge):

Without user namespaces:
  Container root (UID 0) = Host root (UID 0)
  If container escapes, attacker is real root
  Game over

With user namespaces:
  Container root (UID 0) β†’ Host UID 100000 (unprivileged)
  If container escapes, attacker has UID 100000 (can't sudo, can't read /etc/shadow)
  Much safer

Why Docker doesn’t enable by default: Compatibility issues with volume mounts (file ownership gets confusing).

Pitfall 1: Leaking namespaces via socket files

  • If process dies but socket in netns remains, netns persists (hidden)
  • Fix: ip netns list shows active namespaces; delete: ip netns delete NETNS

Pitfall 2: User namespace security misconfiguration

  • Mapping container root (UID 0) to host unprivileged user is complex
  • Misconfiguration allows container escape
  • Fix: Use podman’s user namespace by default (not Docker)

Key metrics/tools:

lsns -t pid                    # List PID namespaces
ip netns list                  # List network namespaces
nsenter -t PID -a /bin/bash    # Enter all namespaces of PID
unshare -p -f /bin/bash        # Create new PID namespace
cat /proc/PID/ns/*             # Inode numbers (same inode = same namespace)

7. cgroups v2 (Resource Limits & Accounting)

What it is:

  • cgroup: Control group; limits/accounts CPU, memory, I/O, pids, network for process group
  • v2 unified hierarchy: Single tree (v1 had multiple controllers)
  • CPU limit: cpu.max=50000 100000 = 50% of one core
  • Memory limit: memory.max=1G enforces hard limit (OOM killer if exceeded)
  • PSI (Pressure Stall Info): Metrics on resource contention (CPU throttling, memory pressure, I/O wait)

Detailed explanation:

cgroups - How containers limit resources (and don’t crash the host):

What it is: A cgroup is a way to group processes and apply resource limits to the group as a whole.

Why it matters: Without cgroups, a runaway container process can use 100% CPU and starve all other processes on the host. With cgroups, you can say “this container gets max 2 CPU cores and 4GB RAM, period.”

Real-world example:

Host has 8 CPU cores, 32GB RAM

Container A (web server):
  - cgroup limits: 2 CPU cores, 4GB RAM
  - Actual usage: 1.5 cores, 2GB RAM β†’ OK

Container B (batch job goes wild):
  - cgroup limits: 2 CPU cores, 4GB RAM
  - Tries to use: 8 cores, 10GB RAM
  - cgroup enforces: Only gets 2 cores max, throttled
  - Memory: Process killed (OOM) when exceeds 4GB

Host remains responsive: Containers can't steal unlimited resources

CPU limits - The two numbers explained:

Format: cpu.max=50000 100000

What it means:

  • First number (50000): CPU quota in microseconds per period
  • Second number (100000): Period length in microseconds

Math:

50000 microseconds / 100000 microseconds = 0.5 = 50% of one CPU core

100000 microseconds = 0.1 seconds = period resets 10 times per second

In each 0.1 second period:
- cgroup can use max 50000 microseconds (0.05 seconds) of CPU
- After using 50000Β΅s, process is throttled until next period

More examples:

cpu.max=200000 100000 = 200% = 2 full CPU cores
cpu.max=400000 100000 = 400% = 4 full CPU cores
cpu.max=10000  100000 = 10%  = 0.1 CPU cores

What “throttled” means:

Process runs for 50ms β†’ reaches quota β†’ scheduler stops running it
Wait 50ms (rest of period) β†’ new period starts β†’ process can run again

Result: Bursty performance (run, pause, run, pause...)
Check: cat /sys/fs/cgroup/cpu.stat
  nr_throttled: 1234       ← Number of times throttled
  throttled_time: 5000000  ← Microseconds spent throttled

Memory limits - Hard vs soft:

memory.max (hard limit):

memory.max=1G

Process allocates memory:
  500MB β†’ OK
  800MB β†’ OK
  1.1GB β†’ OOM KILL (exceeds limit)

Kernel kills process immediately when limit exceeded
No warning, no grace period

memory.high (soft limit):

memory.high=1G
memory.max=2G

Process allocates memory:
  500MB β†’ OK
  1.1GB β†’ Kernel starts aggressive reclamation (swapping, cache eviction)
        β†’ Process slows down but not killed
  2.1GB β†’ OOM KILL (exceeded hard limit)

Think of memory.high as "warning track" before hard limit

Why use soft limits:

  • Gives application chance to release memory (GC, cache flush)
  • Avoids abrupt crashes
  • Production pattern: memory.high=0.8 Γ— memory.max

PSI (Pressure Stall Information) - Early warning system:

What it is: PSI tells you when processes are waiting for resources.

Example:

$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

$ cat /proc/pressure/memory
some avg10=15.50 avg60=8.23 avg300=3.45 total=125000
full avg10=5.00 avg60=2.10 avg300=0.80 total=50000

Interpretation:
  some avg10=15.50 β†’ 15.5% of last 10 seconds, some process waited for memory
  full avg10=5.00  β†’ 5% of last 10 seconds, ALL processes stalled (thrashing)

Action:
  some > 10% β†’ Memory pressure building, consider adding RAM
  full > 1%  β†’ Critical, processes constantly stalled

Why this is better than traditional metrics:

Traditional: free -h shows 100MB free β†’ "Is that bad?"
PSI: avg10=0.00 β†’ "No processes waiting, system fine despite low free RAM"

Traditional: free -h shows 10GB free β†’ "Looks OK"
PSI: avg10=20.00 β†’ "Processes waiting 20% of time, something wrong!"

Unified hierarchy (v2) vs v1:

cgroups v1 problem: Each controller (CPU, memory, I/O) had separate hierarchies. You could have:

/sys/fs/cgroup/cpu/container1/
/sys/fs/cgroup/memory/container2/

Process in container1 for CPU, container2 for memory β†’ confusing!

cgroups v2 solution: Single hierarchy:

/sys/fs/cgroup/container1/
  β”œβ”€β”€ cpu.max
  β”œβ”€β”€ memory.max
  β”œβ”€β”€ io.max
  └── pids.max

All controllers unified, process can't be in multiple groups

Pitfall 1: Memory limit too aggressive

  • Setting memory.max=1G for a 2GB app = guaranteed OOM kill
  • Fix: Set memory.high for soft limit (kernel reclaims); use memory.max as absolute last resort

Pitfall 2: Ignoring memory swap limit

  • If cgroup has swap allowed, process can use disk (slow); no distinction between RSS and swap
  • Fix: In production, set memory.memsw.max = memory.max (forbid swap)

Key metrics/tools:

mount -t cgroup2               # Verify cgroup v2 mounted
cat /proc/PID/cgroup           # Show PIDs's cgroup
echo "+memory +cpu +io" > /sys/fs/cgroup2/cgroup.subtree_control  # Enable controllers
cat /sys/fs/cgroup2/memory.stat  # Memory pressure, oom kills
docker inspect CONTAINER | grep -i memory  # Container limits

8. Linux Security Modules (LSM)

What it is:

  • SELinux: Type enforcement (TE); security contexts on files, processes, ports
  • AppArmor: Path-based access control (simpler than SELinux)
  • Capabilities: Fine-grained privileges; e.g., CAP_NET_ADMIN for network config
  • seccomp: Syscall filtering; restrict which syscalls a process can invoke

Pitfall 1: SELinux in enforcing mode causes mysterious failures

  • Default policies may block legitimate app behavior
  • Error: “Permission denied” but no clear root cause
  • Fix: Check audit2why for denials; use audit2allow to generate rules; test in permissive first

Pitfall 2: Overpermissive capabilities

  • Containers run with unnecessary capabilities (e.g., CAP_SYS_ADMIN = almost root)
  • Fix: Use docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE for strict baseline

Key metrics/tools:

getenforce                              # SELinux status (Disabled/Permissive/Enforcing)
setenforce 0                            # Switch to permissive (temp)
tail -f /var/log/audit/audit.log | grep denied  # Live denial logging
apparmor_status                         # AppArmor profiles
getcap /path/to/binary                  # Show binary capabilities
setcap cap_net_bind_service=+ep /bin/myapp  # Grant capability
seccomp-tools show /proc/PID/status     # Show seccomp filter for process

9. eBPF (In-Kernel Observability & Enforcement)

What it is:

  • eBPF: Sandboxed VM in kernel; run custom programs without module recompilation
  • BCC (eBPF Compiler Collection): Python wrapper; easy-to-use eBPF tools
  • bpftrace: High-level eBPF language; one-liners for tracing
  • XDP (eXpress Data Path): Attach eBPF to NIC driver; ultra-low-latency packet processing
  • Tracing: Hook syscalls, kernel functions, events; track execution flow

Pitfall 1: eBPF programs have kernel resource limits

  • Large eBPF program β†’ verifier rejection (“stack size too large”)
  • Maps too large β†’ OOM or allocation failure
  • Fix: Use BCC/bpftrace for small, focused tracing; production tools (Cilium) handle complexity

Pitfall 2: Unbounded eBPF loops crash kernel

  • Verifier doesn’t allow loops; workaround: #pragma unroll (only fixed count)
  • Fix: For finite state machines, use loops with bounded iteration count

Key metrics/tools:

bpftool prog show              # List loaded eBPF programs
bpftool map dump name MAPNAME  # Dump map contents
bpftrace -l                    # List available tracepoints
bpftrace -e 'tracepoint:syscalls:sys_enter_open* { printf("%s\n", str(args->filename)); }' -c "ls /"  # Trace file opens
sudo /usr/share/bcc/tools/opensnoop  # Trace open() syscalls
sudo /usr/share/bcc/tools/biotop     # Top processes by block I/O

Quick Troubleshooting Matrix

SymptomSuspect SubsystemFirst CheckQuick Fix
High load but low CPU%Scheduler OR I/O waitiostat -x 1, vmstat (wa column)Check disk/network; run perf record
OOM kills despite free RAMMemory / cgroupscat /proc/pressure/memory, cgroup v2 limitsIncrease memory.max or add swap
Can’t create filesVFS / inodesdf -i shows 100%Cleanup files or recreate fs with more inodes
Slow disk random I/OBlock I/O schedulercat /sys/block/sda/queue/schedulerSet to noop or mq-deadline on SSDs
Connection refused (random ports)Networking / conntrackcat /proc/net/stat/nf_conntrackIncrease nf_conntrack_max
Container can’t reach networkNamespacesip netns list, ip route showCheck veth pair, routes in netns
Permission denied (no obvious reason)LSM (SELinux/AppArmor)getenforce, setenforce 0Temporarily disable, check /var/log/audit/audit.log
Can’t trace with eBPFeBPF / kernel versionuname -r (need β‰₯4.9), bpftool prog showUpgrade kernel or use ftrace as fallback

One-Page Cheat: Critical Limits

# Memory
sysctl vm.swappiness=10                 # Reduce swapping
sysctl vm.watermark_scale_factor=10     # Kswapd reclaim threshold
sysctl vm.max_map_count=262144          # Max memory maps per process

# Networking
sysctl net.core.somaxconn=65535         # Backlog for listening sockets
sysctl net.ipv4.tcp_max_syn_backlog=65535  # SYN backlog (DDoS mitigation)
sysctl net.netfilter.nf_conntrack_max=2000000  # Conntrack table

# File descriptors
ulimit -n 65535                         # Per-process FD limit

# Kernel debug
echo 1 > /proc/sys/kernel/perf_event_paranoid  # Allow unprivileged perf

# cgroups
echo 262144 > /sys/fs/cgroup2/pids.max  # Max PIDs per cgroup

Further Reading