Executive Summary
Performance baseline = safe defaults that work for most workloads, with clear tuning for specific scenarios.
This guide covers:
- sysctl: Kernel parameters (network, filesystem, VM) with production-safe values
- ulimits: Resource limits (open files, processes, memory locks)
- CPU Governor: Frequency scaling & power management on servers
- NUMA: Awareness for multi-socket systems (big apps, databases)
- I/O Scheduler: NVMe/SSD vs. spinning disk tuning
1. sysctl Kernel Parameters
Why sysctl Matters
Problem: Default kernel parameters are conservative (fit laptops, embedded systems)
Solution: Tune for your workload (databases, web servers, HPC)
Trade-off: More throughput vs. latency / memory vs. stability
Network Parameters (Most Critical)
What they do:
somaxconn
= backlog for listening sockets (SYN queue)tcp_fin_timeout
= how long to hold connections in FIN_WAIT stateip_local_port_range
= available ephemeral ports (client-side)tcp_tw_reuse
= reuse TIME_WAIT connections (for clients)
Production-safe defaults:
# /etc/sysctl.d/99-production.conf
# ===== NETWORK TUNING =====
# Listening socket backlog (default 128, too low for high concurrency)
# Increase to handle traffic spikes
net.core.somaxconn = 32768
# TCP backlog (for SYN flood mitigation & high concurrency)
net.ipv4.tcp_max_syn_backlog = 32768
# How long connections linger in FIN_WAIT state (default 60s)
# Lower for fast reconnections (e.g., load balancers); higher for stability
net.ipv4.tcp_fin_timeout = 30
# Ephemeral port range (default 32768-60999, only ~28k ports)
# Increase if you need many concurrent client connections
net.ipv4.ip_local_port_range = 10000 65535
# TCP time-wait assassination (allow reuse of TIME_WAIT after 1 sec)
# WARNING: Only safe if no data in flight; use with caution
net.ipv4.tcp_tw_reuse = 1
# TCP keepalive (detect dead connections; seconds between probes)
net.ipv4.tcp_keepalives_intvl = 15
net.ipv4.tcp_keepalives_probes = 5
net.ipv4.tcp_keepalive_time = 600
# TCP retransmission timeout (adjust for high-latency networks)
# net.ipv4.tcp_retries2 = 15 # Default is good for most cases
# Disable TCP Nagle (send small packets immediately; good for latency-sensitive apps)
# net.ipv4.tcp_nodelay = 1 # Usually set per-socket by apps
# ===== FILESYSTEM TUNING =====
# inotify: max watched files (default 8192, too low for large apps)
# Each watched file = inode; file descriptors for monitoring
fs.inotify.max_user_watches = 524288
# Max open file descriptors (system-wide; default 2097152)
# Increase only if needed; each FD consumes kernel memory
fs.file-max = 2097152
# Dentry cache size (filename → inode mappings)
# Usually auto-tuned; leave defaults unless hitting memory pressure
# fs.dentry_scan_pct = 10
# ===== VM (MEMORY) TUNING =====
# Swappiness: 0=never swap (risky), 100=swap aggressively (slow)
# Default 60; for databases/low-latency, use 10-20
# For servers with plenty of RAM: use 10
vm.swappiness = 10
# Watermark scale: controls when kswapd wakes up (memory reclaim)
# Higher = reclaim earlier (keep more free); default 10
# Increase if OOM killer triggered too late
vm.watermark_scale_factor = 50
# Memory overcommit: 0=conservative, 1=allow (risky), 2=strict
# Default 1; for predictability, use 2 (but need swap)
vm.overcommit_memory = 1
# Dirty page writeback: seconds before dirty pages flushed to disk
# Lower = more frequent sync (less data loss); higher = batching (throughput)
vm.dirty_writeback_centisecs = 500 # 5 seconds (default 500)
# Dirty page ratio: when to start writeback (% of RAM)
# Default 20; lower = more frequent writes; higher = batching
vm.dirty_ratio = 10
# Background dirty ratio: trigger writeback (% of RAM)
# Default 10; when RAM > this %, start async writeback
vm.dirty_background_ratio = 5
# ===== TRANSPARENT HUGEPAGES (THP) =====
# THP: 0=off, 1=madvise (app opt-in), 2=always
# Default 2; many databases prefer off (latency spikes on defrag)
# Check vendor: MongoDB, Redis, PostgreSQL often recommend: never
# vm.transparent_hugepage = "never" # See section 3 for setup
# THP defrag: 0=off, 1=on, 2=on+madvise
# Higher = fewer small pages but more CPU
vm.transparent_hugepage_defrag = 1
# ===== KERNEL HARDENING =====
# Core dumps (0=off prevents dump; good for security)
kernel.core_max_uses_percent = 0
# Restrict ptrace (prevent debugging other processes)
kernel.yama.ptrace_scope = 2
# Restrict eBPF (prevent user eBPF; only root/admin)
kernel.unprivileged_bpf_disabled = 1
# ===== MISCELLANEOUS =====
# Max memory maps per process (default 65530)
# Increase for JVM/large memory apps
vm.max_map_count = 262144
# TCP performance tweaks
net.ipv4.tcp_slow_start_after_idle = 0 # Don't reset cwnd after idle (throughput)
net.ipv4.tcp_congestion_control = bbr # Google BBR (if available)
Apply safely:
# Copy to sysctl.d/
sudo cp 99-production.conf /etc/sysctl.d/
# Dry-run (show what would change)
sudo sysctl -n -f /etc/sysctl.d/99-production.conf
# Apply
sudo sysctl -p /etc/sysctl.d/99-production.conf
# Or reload all
sudo sysctl --system
# Verify
sudo sysctl -a | grep somaxconn
sysctl net.core.somaxconn
Scenario-Specific Tuning
High-traffic web server (nginx, Apache):
# Maximize listening socket backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Many ephemeral ports (don't worry about running out)
net.ipv4.ip_local_port_range = 1024 65535
# Fast time-wait reuse (safe: no client-server persistence)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
# Disable Nagle (send small packets immediately)
net.ipv4.tcp_nodelay = 1
# Writeback tuning (batching for throughput)
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
Low-latency service (trading API, real-time apps):
# Conservative memory tuning (avoid swaps)
vm.swappiness = 1
# Disable THP (latency spikes)
vm.transparent_hugepage = "never"
# Fast memory reclaim
vm.watermark_scale_factor = 100
# Disable TIME_WAIT reuse (safer, no reuse of stale connections)
net.ipv4.tcp_tw_reuse = 0
# Tune for latency, not throughput
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
Database server (PostgreSQL, MySQL):
# Disable swap (avoid disk I/O latency)
vm.swappiness = 0
# Disable THP (most databases recommend)
vm.transparent_hugepage = "never"
# Conservative memory overcommit
vm.overcommit_memory = 2
# Increase max memory maps (shared buffers, connections)
vm.max_map_count = 262144
# Faster writeback (minimize dirty pages in memory)
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
# Many connections (ephemeral ports)
net.ipv4.ip_local_port_range = 10000 65535
# Backlog for many connections
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 32768
2. Resource Limits (ulimits)
Why ulimits Matter
Problem: Process hits limit (e.g., max 1024 open files) → app crashes
Solution: Increase limits based on workload
Trade-off: More resources per process vs. system-wide quota
Critical Limits
What they do:
nofile
= max open files (file descriptors)nproc
= max processes/threads per usermemlock
= max locked memory (for real-time, databases using hugepages)msgqueue
= max POSIX message queue sizeas
= max virtual memory per process
Set globally (/etc/security/limits.conf
):
# /etc/security/limits.conf
# Format: domain type item value
# ===== FOR ALL USERS =====
# Soft limit: warning; can be increased by user
# Hard limit: absolute ceiling; requires admin to increase
* soft nofile 65536 # Open files
* hard nofile 65536 # Max open files
* soft nproc 32768 # Processes per user
* hard nproc 32768 # Max processes
* soft memlock unlimited # Locked memory (for hugepages)
* hard memlock unlimited
* soft msgqueue 819200 # Message queue size
* hard msgqueue 819200
* soft sigpending 32768 # Pending signals
* hard sigpending 32768
# ===== FOR SPECIFIC USERS =====
# Database user
postgres soft nofile 65536
postgres hard nofile 65536
postgres soft nproc 32768
postgres hard nproc 32768
postgres soft memlock unlimited
postgres hard memlock unlimited
# App user
appuser soft nofile 16384
appuser hard nofile 32768
appuser soft nproc 4096
appuser hard nproc 8192
# Root (be careful!)
root soft nofile 65536
root hard nofile unlimited
root soft nproc unlimited
root hard nproc unlimited
Apply:
# Method 1: Edit /etc/security/limits.conf (system-wide)
sudo tee -a /etc/security/limits.conf > /dev/null << 'LIMITS'
* soft nofile 65536
* hard nofile 65536
LIMITS
# Method 2: Per-service (systemd)
sudo mkdir -p /etc/systemd/system/myapp.service.d
sudo tee /etc/systemd/system/myapp.service.d/limits.conf > /dev/null << 'SLIMITS'
[Service]
LimitNOFILE=65536
LimitNPROC=32768
LimitMEMLOCK=infinity
SLIMITS
# Reload
sudo systemctl daemon-reload
sudo systemctl restart myapp
# Verify (as the user)
ulimit -a
# or
cat /proc/PID/limits
Verify limits are applied:
# Check current limits (for current user)
ulimit -a
# Check specific limit
ulimit -n # Open files
ulimit -u # Processes
# Check for running process
cat /proc/PID/limits
# Output:
# Limit Soft Limit Hard Limit Units
# Max cpu time unlimited unlimited seconds
# Max file size unlimited unlimited bytes
# Max data size unlimited unlimited bytes
# Max stack size 8388608 unlimited bytes
# Max core file size 0 unlimited bytes
# Max resident set unlimited unlimited bytes
# Max processes 32768 32768 processes
# Max open files 65536 65536 files
# Max locked memory unlimited unlimited bytes
Scenario-Specific Limits
Web server (nginx, Apache):
www-data soft nofile 65536
www-data hard nofile 65536
www-data soft nproc 32768
www-data hard nproc 32768
Database (PostgreSQL, MySQL):
postgres soft nofile 65536
postgres hard nofile 65536
postgres soft nproc 32768
postgres hard nproc 32768
postgres soft memlock unlimited # For shared buffers
postgres hard memlock unlimited
JVM application:
appuser soft nofile 65536
appuser hard nofile 65536
appuser soft nproc 32768
appuser hard nproc 32768
appuser soft memlock unlimited
appuser hard memlock unlimited
appuser soft as unlimited # Virtual memory (for heap)
appuser hard as unlimited
3. CPU Governor & Frequency Scaling
Why CPU Governor Matters
Problem: Default “powersave” governor underperforms on servers
Solution: Use “performance” governor for consistent latency
Trade-off: Max performance vs. power consumption
Check Current Governor
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Output: powersave, powersave, ...
# Check available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# Output: performance powersave
# Check CPU frequencies
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# Output: 1200000 (in kHz)
# More detail: cpupower tool
cpupower frequency-info
Set Performance Governor (Persistent)
Method 1: cpupower (oneshot):
# Install
apt install linux-tools-generic
# or
dnf install kernel-tools
# Set performance on all cores
sudo cpupower frequency-set -g performance
# Verify
cpupower frequency-info
# Set back to powersave
sudo cpupower frequency-set -g powersave
Method 2: GRUB bootloader (persistent):
# Edit GRUB
sudo vi /etc/default/grub
# Add to kernel command line
GRUB_CMDLINE_LINUX="... intel_pstate=passive" # For Intel
# or
GRUB_CMDLINE_LINUX="... amd_pstate=passive" # For AMD
# Update GRUB
sudo grub-mkconfig -o /boot/grub/grub.cfg
# Reboot
sudo reboot
Method 3: systemd (modern, recommended):
# Create service
sudo tee /etc/systemd/system/cpu-perf-governor.service > /dev/null << 'CPUPERF'
[Unit]
Description=Set CPU Governor to Performance
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/bin/cpupower frequency-set -g performance
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
CPUPERF
# Enable & start
sudo systemctl daemon-reload
sudo systemctl enable cpu-perf-governor
sudo systemctl start cpu-perf-governor
# Verify
cpupower frequency-info
Intel vs. AMD Frequency Scaling
Intel (modern):
- Default:
intel_pstate
driver (hardware-assisted) - For max performance: disable turbo boost or use
passive
mode - Check:
cat /sys/devices/system/cpu/intel_pstate/status
AMD (modern):
- Default:
amd_pstate
driver (newer, more efficient) - For max performance: set to
performance
mode - Check:
cat /sys/devices/system/cpu/amd_pstate/status
Fallback (older systems):
- Use
cpufreq-set
(if available) - Or set via sysfs:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
4. I/O Scheduler (NVMe/SSD Tuning)
Why I/O Scheduler Matters
Problem: Wrong scheduler adds latency, reduces throughput
Solution: Match scheduler to device type
Trade-off: Latency vs. throughput optimization
Check Current Scheduler
# Show all block devices
lsblk
# Check scheduler for /dev/sda
cat /sys/block/sda/queue/scheduler
# Output: [noop] deadline cfq
# Brackets = current scheduler
# Check for NVMe
cat /sys/block/nvme0n1/queue/scheduler
# Output: [none] mq-deadline
Scheduler Selection
Device Type | Scheduler | Reason |
---|---|---|
NVMe | none or mq-deadline | Device handles scheduling; minimal overhead |
SSD | noop or mq-deadline | Fast random access; no seek optimization needed |
Spinning Disk | deadline | Read prioritization; seek time mitigation |
Legacy (< 5.0 kernel) | CFQ (Completely Fair Queueing) | Good for mixed workloads |
Set I/O Scheduler
Method 1: Temporary (runtime):
# Set for /dev/sda
echo "mq-deadline" | sudo tee /sys/block/sda/queue/scheduler
# Verify
cat /sys/block/sda/queue/scheduler
Method 2: Persistent (GRUB):
# Edit GRUB
sudo vi /etc/default/grub
# Add
GRUB_CMDLINE_LINUX="... elevator=mq-deadline"
# Update & reboot
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot
Method 3: udev rule (persistent):
# Create rule
sudo tee /etc/udev/rules.d/60-iosched.rules > /dev/null << 'UDEVRULE'
# Set scheduler for NVMe
SUBSYSTEM=="block", KERNEL=="nvme*", ATTR{queue/scheduler}="mq-deadline"
# Set scheduler for SSD (/dev/sda, /dev/sdb)
SUBSYSTEM=="block", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"
# Set for virtio (KVM/QEMU)
SUBSYSTEM=="block", KERNEL=="vd*", ATTR{queue/scheduler}="mq-deadline"
UDEVRULE
# Reload rules
sudo udevadm control --reload-rules
sudo udevadm trigger
5. NUMA Awareness (Big Systems)
Why NUMA Matters
Problem: Multi-socket systems (e.g., 2 Ă— 64 cores) have local & remote RAM
- Local access: ~100ns, Remote access: ~500ns (5x slower!)
Solution: Bind processes/memory to local NUMA nodes
Benefit: Better cache locality, predictable latency
Check NUMA Configuration
# List NUMA nodes
numactl --hardware
# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 ... 31
# node 0 size: 256000 MB
# node 0 free: 100000 MB
# node 1 cpus: 32 33 34 35 ... 63
# node 1 size: 256000 MB
# node 1 free: 150000 MB
# Or use lscpu
lscpu | grep -i numa
# NUMA node0 CPU(s): 0-31
# NUMA node1 CPU(s): 32-63
Bind Process to NUMA Node
Method 1: numactl (one-time):
# Bind to node 0 (CPUs 0-31)
numactl --cpunodebind=0 --membind=0 /path/to/app
# Bind to node 1
numactl --cpunodebind=1 --membind=1 /path/to/app
# Verify (in another terminal)
numactl -p PID # Show NUMA bindings
ps -o pid,psr,cmd -p PID # PSR = CPU (shows if on node 0 or 1)
Method 2: systemd service:
[Service]
ExecStart=/path/to/app
# Bind to NUMA node 0
CPUAffinity=0-31
NUMAPolicy=bind
NUMAMask=0
# Or, NUMA node 1
# CPUAffinity=32-63
# NUMAMask=1
Method 3: Container (Docker/Kubernetes):
# Docker: pin to NUMA node 0
docker run --cpuset-cpus=0-31 --memory-swap=0 myapp
# Kubernetes: pin to node 0
kubectl run myapp --cpuset-cpus=0-31 --limits=memory=64Gi
NUMA-Aware Application Design
For big apps (databases, data processing):
- Start N worker threads = N NUMA nodes
- Bind each thread to local node (CPUs + memory)
- Minimize cross-node memory access
- Example: PostgreSQL with
parallel_workers
, each bound to local node
Check memory locality (after running app):
# Show memory by NUMA node
numastat
# Per_node and system-wide stats
# Show per-process memory distribution
cat /proc/PID/numa_maps
# Tells you which nodes have app's memory
Performance Baseline Checklist
Pre-Deployment
- sysctl tuned for workload (web, DB, real-time)
-
/etc/sysctl.d/99-production.conf
deployed & verified - ulimits set (nofile, nproc, memlock) per-service
- CPU governor set to
performance
(if server) - I/O scheduler matched to device type (NVMe →
none
, SSD →mq-deadline
) - THP disabled if database vendor recommends
- NUMA binding tested (if multi-socket system)
- Baseline performance measured (latency, throughput)
Post-Deployment
- sysctl applied:
sysctl --system
succeeds - ulimits verified:
ulimit -a
shows expected values - CPU governor active:
cpupower frequency-info
showsperformance
- I/O scheduler confirmed:
cat /sys/block/*/queue/scheduler
- Performance matches baseline (no regressions)
- No swapping (if swappiness tuned):
vmstat 1 5
→si/so
near 0 - NUMA memory local:
numastat
shows > 95% local access
Ongoing Monitoring
- Weekly: Check CPU throttling (
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
) - Weekly: Monitor disk I/O (
iostat -x 1 5
→await
,svctm
,%util
) - Monthly: Measure application latency (p99, p95)
- Quarterly: Review sysctl changes (kernel updates may reset some)
Quick Reference: Production Defaults
# ===== sysctl (essential) =====
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 32768
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 10000 65535
fs.inotify.max_user_watches = 524288
vm.swappiness = 10
vm.max_map_count = 262144
# ===== ulimits (essential) =====
* soft nofile 65536
* hard nofile 65536
* soft nproc 32768
* hard nproc 32768
* soft memlock unlimited
* hard memlock unlimited
# ===== CPU Governor =====
sudo cpupower frequency-set -g performance
# ===== I/O Scheduler (NVMe) =====
echo "mq-deadline" | sudo tee /sys/block/nvme0n1/queue/scheduler
# ===== Disable THP (for databases) =====
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# ===== NUMA Binding =====
numactl --cpunodebind=0 --membind=0 /path/to/app
Common Pitfalls
Pitfall 1: Excessive TCP TIME_WAIT Reuse
Problem: tcp_tw_reuse=1
+ data in flight → packet reordering, connection resets
Fix: Only use for stateless services (load balancers, proxies); test thoroughly first
Pitfall 2: THP Defrag Causing Latency Spikes
Problem: Defragmentation pauses app for milliseconds
Fix: Disable THP for latency-sensitive workloads: echo never > /sys/kernel/mm/transparent_hugepage/enabled
Pitfall 3: I/O Scheduler Causing Jitter
Problem: CFQ on NVMe/SSD = added latency due to sorting overhead
Fix: Use noop
or mq-deadline
for NVMe/SSD
Pitfall 4: ulimits Not Applied to Service
Problem: systemd service still has default 1024 open files
Fix: Set via /etc/systemd/system/SERVICE.service.d/limits.conf
(not /etc/security/limits.conf
)
Pitfall 5: NUMA Remote Memory Access
Problem: App threads on node 0, but memory allocated from node 1 → slow
Fix: Use numactl --membind=0
when starting app, or set vm.zone_reclaim_mode=1
Measurement & Validation
Measure Baseline Performance
Before tuning:
# Latency
ping -c 100 localhost | grep min/avg/max
# Network throughput
iperf3 -c 127.0.0.1 -t 10
# Disk I/O
fio --name=random-read --ioengine=libaio --direct=1 --rw=randread --bs=4k --size=1G --numjobs=4 --runtime=60 --group_reporting
After tuning:
# Compare same test; expect 5-30% improvement (depends on tuning)
# Examples:
# - somaxconn increase: 10-20% more concurrent connections
# - I/O scheduler change: 5-15% lower latency on random I/O
# - CPU governor: 5-10% faster requests (no frequency scaling)
# - THP disable: 50%+ lower latency spikes (for DBs)