Executive Summary

Performance baseline = safe defaults that work for most workloads, with clear tuning for specific scenarios.

This guide covers:

  • sysctl: Kernel parameters (network, filesystem, VM) with production-safe values
  • ulimits: Resource limits (open files, processes, memory locks)
  • CPU Governor: Frequency scaling & power management on servers
  • NUMA: Awareness for multi-socket systems (big apps, databases)
  • I/O Scheduler: NVMe/SSD vs. spinning disk tuning

1. sysctl Kernel Parameters

Why sysctl Matters

Problem: Default kernel parameters are conservative (fit laptops, embedded systems)
Solution: Tune for your workload (databases, web servers, HPC)
Trade-off: More throughput vs. latency / memory vs. stability

Network Parameters (Most Critical)

What they do:

  • somaxconn = backlog for listening sockets (SYN queue)
  • tcp_fin_timeout = how long to hold connections in FIN_WAIT state
  • ip_local_port_range = available ephemeral ports (client-side)
  • tcp_tw_reuse = reuse TIME_WAIT connections (for clients)

Production-safe defaults:

# /etc/sysctl.d/99-production.conf

# ===== NETWORK TUNING =====

# Listening socket backlog (default 128, too low for high concurrency)
# Increase to handle traffic spikes
net.core.somaxconn = 32768

# TCP backlog (for SYN flood mitigation & high concurrency)
net.ipv4.tcp_max_syn_backlog = 32768

# How long connections linger in FIN_WAIT state (default 60s)
# Lower for fast reconnections (e.g., load balancers); higher for stability
net.ipv4.tcp_fin_timeout = 30

# Ephemeral port range (default 32768-60999, only ~28k ports)
# Increase if you need many concurrent client connections
net.ipv4.ip_local_port_range = 10000 65535

# TCP time-wait assassination (allow reuse of TIME_WAIT after 1 sec)
# WARNING: Only safe if no data in flight; use with caution
net.ipv4.tcp_tw_reuse = 1

# TCP keepalive (detect dead connections; seconds between probes)
net.ipv4.tcp_keepalives_intvl = 15
net.ipv4.tcp_keepalives_probes = 5
net.ipv4.tcp_keepalive_time = 600

# TCP retransmission timeout (adjust for high-latency networks)
# net.ipv4.tcp_retries2 = 15  # Default is good for most cases

# Disable TCP Nagle (send small packets immediately; good for latency-sensitive apps)
# net.ipv4.tcp_nodelay = 1  # Usually set per-socket by apps

# ===== FILESYSTEM TUNING =====

# inotify: max watched files (default 8192, too low for large apps)
# Each watched file = inode; file descriptors for monitoring
fs.inotify.max_user_watches = 524288

# Max open file descriptors (system-wide; default 2097152)
# Increase only if needed; each FD consumes kernel memory
fs.file-max = 2097152

# Dentry cache size (filename → inode mappings)
# Usually auto-tuned; leave defaults unless hitting memory pressure
# fs.dentry_scan_pct = 10

# ===== VM (MEMORY) TUNING =====

# Swappiness: 0=never swap (risky), 100=swap aggressively (slow)
# Default 60; for databases/low-latency, use 10-20
# For servers with plenty of RAM: use 10
vm.swappiness = 10

# Watermark scale: controls when kswapd wakes up (memory reclaim)
# Higher = reclaim earlier (keep more free); default 10
# Increase if OOM killer triggered too late
vm.watermark_scale_factor = 50

# Memory overcommit: 0=conservative, 1=allow (risky), 2=strict
# Default 1; for predictability, use 2 (but need swap)
vm.overcommit_memory = 1

# Dirty page writeback: seconds before dirty pages flushed to disk
# Lower = more frequent sync (less data loss); higher = batching (throughput)
vm.dirty_writeback_centisecs = 500  # 5 seconds (default 500)

# Dirty page ratio: when to start writeback (% of RAM)
# Default 20; lower = more frequent writes; higher = batching
vm.dirty_ratio = 10

# Background dirty ratio: trigger writeback (% of RAM)
# Default 10; when RAM > this %, start async writeback
vm.dirty_background_ratio = 5

# ===== TRANSPARENT HUGEPAGES (THP) =====

# THP: 0=off, 1=madvise (app opt-in), 2=always
# Default 2; many databases prefer off (latency spikes on defrag)
# Check vendor: MongoDB, Redis, PostgreSQL often recommend: never
# vm.transparent_hugepage = "never"  # See section 3 for setup

# THP defrag: 0=off, 1=on, 2=on+madvise
# Higher = fewer small pages but more CPU
vm.transparent_hugepage_defrag = 1

# ===== KERNEL HARDENING =====

# Core dumps (0=off prevents dump; good for security)
kernel.core_max_uses_percent = 0

# Restrict ptrace (prevent debugging other processes)
kernel.yama.ptrace_scope = 2

# Restrict eBPF (prevent user eBPF; only root/admin)
kernel.unprivileged_bpf_disabled = 1

# ===== MISCELLANEOUS =====

# Max memory maps per process (default 65530)
# Increase for JVM/large memory apps
vm.max_map_count = 262144

# TCP performance tweaks
net.ipv4.tcp_slow_start_after_idle = 0  # Don't reset cwnd after idle (throughput)
net.ipv4.tcp_congestion_control = bbr  # Google BBR (if available)

Apply safely:

# Copy to sysctl.d/
sudo cp 99-production.conf /etc/sysctl.d/

# Dry-run (show what would change)
sudo sysctl -n -f /etc/sysctl.d/99-production.conf

# Apply
sudo sysctl -p /etc/sysctl.d/99-production.conf

# Or reload all
sudo sysctl --system

# Verify
sudo sysctl -a | grep somaxconn
sysctl net.core.somaxconn

Scenario-Specific Tuning

High-traffic web server (nginx, Apache):

# Maximize listening socket backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Many ephemeral ports (don't worry about running out)
net.ipv4.ip_local_port_range = 1024 65535

# Fast time-wait reuse (safe: no client-server persistence)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# Disable Nagle (send small packets immediately)
net.ipv4.tcp_nodelay = 1

# Writeback tuning (batching for throughput)
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10

Low-latency service (trading API, real-time apps):

# Conservative memory tuning (avoid swaps)
vm.swappiness = 1

# Disable THP (latency spikes)
vm.transparent_hugepage = "never"

# Fast memory reclaim
vm.watermark_scale_factor = 100

# Disable TIME_WAIT reuse (safer, no reuse of stale connections)
net.ipv4.tcp_tw_reuse = 0

# Tune for latency, not throughput
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

Database server (PostgreSQL, MySQL):

# Disable swap (avoid disk I/O latency)
vm.swappiness = 0

# Disable THP (most databases recommend)
vm.transparent_hugepage = "never"

# Conservative memory overcommit
vm.overcommit_memory = 2

# Increase max memory maps (shared buffers, connections)
vm.max_map_count = 262144

# Faster writeback (minimize dirty pages in memory)
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

# Many connections (ephemeral ports)
net.ipv4.ip_local_port_range = 10000 65535

# Backlog for many connections
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 32768

2. Resource Limits (ulimits)

Why ulimits Matter

Problem: Process hits limit (e.g., max 1024 open files) → app crashes
Solution: Increase limits based on workload
Trade-off: More resources per process vs. system-wide quota

Critical Limits

What they do:

  • nofile = max open files (file descriptors)
  • nproc = max processes/threads per user
  • memlock = max locked memory (for real-time, databases using hugepages)
  • msgqueue = max POSIX message queue size
  • as = max virtual memory per process

Set globally (/etc/security/limits.conf):

# /etc/security/limits.conf
# Format: domain type item value

# ===== FOR ALL USERS =====

# Soft limit: warning; can be increased by user
# Hard limit: absolute ceiling; requires admin to increase

*   soft    nofile      65536           # Open files
*   hard    nofile      65536           # Max open files

*   soft    nproc       32768           # Processes per user
*   hard    nproc       32768           # Max processes

*   soft    memlock     unlimited       # Locked memory (for hugepages)
*   hard    memlock     unlimited

*   soft    msgqueue    819200          # Message queue size
*   hard    msgqueue    819200

*   soft    sigpending  32768           # Pending signals
*   hard    sigpending  32768

# ===== FOR SPECIFIC USERS =====

# Database user
postgres    soft    nofile      65536
postgres    hard    nofile      65536
postgres    soft    nproc       32768
postgres    hard    nproc       32768
postgres    soft    memlock     unlimited
postgres    hard    memlock     unlimited

# App user
appuser     soft    nofile      16384
appuser     hard    nofile      32768
appuser     soft    nproc       4096
appuser     hard    nproc       8192

# Root (be careful!)
root        soft    nofile      65536
root        hard    nofile      unlimited
root        soft    nproc       unlimited
root        hard    nproc       unlimited

Apply:

# Method 1: Edit /etc/security/limits.conf (system-wide)
sudo tee -a /etc/security/limits.conf > /dev/null << 'LIMITS'
*   soft    nofile      65536
*   hard    nofile      65536
LIMITS

# Method 2: Per-service (systemd)
sudo mkdir -p /etc/systemd/system/myapp.service.d
sudo tee /etc/systemd/system/myapp.service.d/limits.conf > /dev/null << 'SLIMITS'
[Service]
LimitNOFILE=65536
LimitNPROC=32768
LimitMEMLOCK=infinity
SLIMITS

# Reload
sudo systemctl daemon-reload
sudo systemctl restart myapp

# Verify (as the user)
ulimit -a
# or
cat /proc/PID/limits

Verify limits are applied:

# Check current limits (for current user)
ulimit -a

# Check specific limit
ulimit -n          # Open files
ulimit -u          # Processes

# Check for running process
cat /proc/PID/limits
# Output:
# Limit                     Soft Limit           Hard Limit           Units
# Max cpu time              unlimited            unlimited            seconds
# Max file size             unlimited            unlimited            bytes
# Max data size             unlimited            unlimited            bytes
# Max stack size             8388608              unlimited            bytes
# Max core file size        0                    unlimited            bytes
# Max resident set          unlimited            unlimited            bytes
# Max processes             32768                32768                processes
# Max open files            65536                65536                files
# Max locked memory         unlimited            unlimited            bytes

Scenario-Specific Limits

Web server (nginx, Apache):

www-data    soft    nofile      65536
www-data    hard    nofile      65536
www-data    soft    nproc       32768
www-data    hard    nproc       32768

Database (PostgreSQL, MySQL):

postgres    soft    nofile      65536
postgres    hard    nofile      65536
postgres    soft    nproc       32768
postgres    hard    nproc       32768
postgres    soft    memlock     unlimited  # For shared buffers
postgres    hard    memlock     unlimited

JVM application:

appuser     soft    nofile      65536
appuser     hard    nofile      65536
appuser     soft    nproc       32768
appuser     hard    nproc       32768
appuser     soft    memlock     unlimited
appuser     hard    memlock     unlimited
appuser     soft    as          unlimited  # Virtual memory (for heap)
appuser     hard    as          unlimited

3. CPU Governor & Frequency Scaling

Why CPU Governor Matters

Problem: Default “powersave” governor underperforms on servers
Solution: Use “performance” governor for consistent latency
Trade-off: Max performance vs. power consumption

Check Current Governor

# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Output: powersave, powersave, ...

# Check available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# Output: performance powersave

# Check CPU frequencies
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# Output: 1200000  (in kHz)

# More detail: cpupower tool
cpupower frequency-info

Set Performance Governor (Persistent)

Method 1: cpupower (oneshot):

# Install
apt install linux-tools-generic
# or
dnf install kernel-tools

# Set performance on all cores
sudo cpupower frequency-set -g performance

# Verify
cpupower frequency-info

# Set back to powersave
sudo cpupower frequency-set -g powersave

Method 2: GRUB bootloader (persistent):

# Edit GRUB
sudo vi /etc/default/grub

# Add to kernel command line
GRUB_CMDLINE_LINUX="... intel_pstate=passive"  # For Intel
# or
GRUB_CMDLINE_LINUX="... amd_pstate=passive"    # For AMD

# Update GRUB
sudo grub-mkconfig -o /boot/grub/grub.cfg

# Reboot
sudo reboot

Method 3: systemd (modern, recommended):

# Create service
sudo tee /etc/systemd/system/cpu-perf-governor.service > /dev/null << 'CPUPERF'
[Unit]
Description=Set CPU Governor to Performance
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/cpupower frequency-set -g performance
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
CPUPERF

# Enable & start
sudo systemctl daemon-reload
sudo systemctl enable cpu-perf-governor
sudo systemctl start cpu-perf-governor

# Verify
cpupower frequency-info

Intel vs. AMD Frequency Scaling

Intel (modern):

  • Default: intel_pstate driver (hardware-assisted)
  • For max performance: disable turbo boost or use passive mode
  • Check: cat /sys/devices/system/cpu/intel_pstate/status

AMD (modern):

  • Default: amd_pstate driver (newer, more efficient)
  • For max performance: set to performance mode
  • Check: cat /sys/devices/system/cpu/amd_pstate/status

Fallback (older systems):

  • Use cpufreq-set (if available)
  • Or set via sysfs: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

4. I/O Scheduler (NVMe/SSD Tuning)

Why I/O Scheduler Matters

Problem: Wrong scheduler adds latency, reduces throughput
Solution: Match scheduler to device type
Trade-off: Latency vs. throughput optimization

Check Current Scheduler

# Show all block devices
lsblk

# Check scheduler for /dev/sda
cat /sys/block/sda/queue/scheduler
# Output: [noop] deadline cfq

# Brackets = current scheduler

# Check for NVMe
cat /sys/block/nvme0n1/queue/scheduler
# Output: [none] mq-deadline

Scheduler Selection

Device TypeSchedulerReason
NVMenone or mq-deadlineDevice handles scheduling; minimal overhead
SSDnoop or mq-deadlineFast random access; no seek optimization needed
Spinning DiskdeadlineRead prioritization; seek time mitigation
Legacy (< 5.0 kernel)CFQ (Completely Fair Queueing)Good for mixed workloads

Set I/O Scheduler

Method 1: Temporary (runtime):

# Set for /dev/sda
echo "mq-deadline" | sudo tee /sys/block/sda/queue/scheduler

# Verify
cat /sys/block/sda/queue/scheduler

Method 2: Persistent (GRUB):

# Edit GRUB
sudo vi /etc/default/grub

# Add
GRUB_CMDLINE_LINUX="... elevator=mq-deadline"

# Update & reboot
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot

Method 3: udev rule (persistent):

# Create rule
sudo tee /etc/udev/rules.d/60-iosched.rules > /dev/null << 'UDEVRULE'
# Set scheduler for NVMe
SUBSYSTEM=="block", KERNEL=="nvme*", ATTR{queue/scheduler}="mq-deadline"

# Set scheduler for SSD (/dev/sda, /dev/sdb)
SUBSYSTEM=="block", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"

# Set for virtio (KVM/QEMU)
SUBSYSTEM=="block", KERNEL=="vd*", ATTR{queue/scheduler}="mq-deadline"
UDEVRULE

# Reload rules
sudo udevadm control --reload-rules
sudo udevadm trigger

5. NUMA Awareness (Big Systems)

Why NUMA Matters

Problem: Multi-socket systems (e.g., 2 Ă— 64 cores) have local & remote RAM

  • Local access: ~100ns, Remote access: ~500ns (5x slower!)
    Solution: Bind processes/memory to local NUMA nodes
    Benefit: Better cache locality, predictable latency

Check NUMA Configuration

# List NUMA nodes
numactl --hardware
# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 ... 31
# node 0 size: 256000 MB
# node 0 free: 100000 MB
# node 1 cpus: 32 33 34 35 ... 63
# node 1 size: 256000 MB
# node 1 free: 150000 MB

# Or use lscpu
lscpu | grep -i numa
# NUMA node0 CPU(s): 0-31
# NUMA node1 CPU(s): 32-63

Bind Process to NUMA Node

Method 1: numactl (one-time):

# Bind to node 0 (CPUs 0-31)
numactl --cpunodebind=0 --membind=0 /path/to/app

# Bind to node 1
numactl --cpunodebind=1 --membind=1 /path/to/app

# Verify (in another terminal)
numactl -p PID  # Show NUMA bindings
ps -o pid,psr,cmd -p PID  # PSR = CPU (shows if on node 0 or 1)

Method 2: systemd service:

[Service]
ExecStart=/path/to/app

# Bind to NUMA node 0
CPUAffinity=0-31
NUMAPolicy=bind
NUMAMask=0

# Or, NUMA node 1
# CPUAffinity=32-63
# NUMAMask=1

Method 3: Container (Docker/Kubernetes):

# Docker: pin to NUMA node 0
docker run --cpuset-cpus=0-31 --memory-swap=0 myapp

# Kubernetes: pin to node 0
kubectl run myapp --cpuset-cpus=0-31 --limits=memory=64Gi

NUMA-Aware Application Design

For big apps (databases, data processing):

  1. Start N worker threads = N NUMA nodes
  2. Bind each thread to local node (CPUs + memory)
  3. Minimize cross-node memory access
  4. Example: PostgreSQL with parallel_workers, each bound to local node

Check memory locality (after running app):

# Show memory by NUMA node
numastat
# Per_node and system-wide stats

# Show per-process memory distribution
cat /proc/PID/numa_maps
# Tells you which nodes have app's memory

Performance Baseline Checklist

Pre-Deployment

  • sysctl tuned for workload (web, DB, real-time)
  • /etc/sysctl.d/99-production.conf deployed & verified
  • ulimits set (nofile, nproc, memlock) per-service
  • CPU governor set to performance (if server)
  • I/O scheduler matched to device type (NVMe → none, SSD → mq-deadline)
  • THP disabled if database vendor recommends
  • NUMA binding tested (if multi-socket system)
  • Baseline performance measured (latency, throughput)

Post-Deployment

  • sysctl applied: sysctl --system succeeds
  • ulimits verified: ulimit -a shows expected values
  • CPU governor active: cpupower frequency-info shows performance
  • I/O scheduler confirmed: cat /sys/block/*/queue/scheduler
  • Performance matches baseline (no regressions)
  • No swapping (if swappiness tuned): vmstat 1 5 → si/so near 0
  • NUMA memory local: numastat shows > 95% local access

Ongoing Monitoring

  • Weekly: Check CPU throttling (cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq)
  • Weekly: Monitor disk I/O (iostat -x 1 5 → await, svctm, %util)
  • Monthly: Measure application latency (p99, p95)
  • Quarterly: Review sysctl changes (kernel updates may reset some)

Quick Reference: Production Defaults

# ===== sysctl (essential) =====
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 32768
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 10000 65535
fs.inotify.max_user_watches = 524288
vm.swappiness = 10
vm.max_map_count = 262144

# ===== ulimits (essential) =====
* soft nofile 65536
* hard nofile 65536
* soft nproc 32768
* hard nproc 32768
* soft memlock unlimited
* hard memlock unlimited

# ===== CPU Governor =====
sudo cpupower frequency-set -g performance

# ===== I/O Scheduler (NVMe) =====
echo "mq-deadline" | sudo tee /sys/block/nvme0n1/queue/scheduler

# ===== Disable THP (for databases) =====
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# ===== NUMA Binding =====
numactl --cpunodebind=0 --membind=0 /path/to/app

Common Pitfalls

Pitfall 1: Excessive TCP TIME_WAIT Reuse

Problem: tcp_tw_reuse=1 + data in flight → packet reordering, connection resets
Fix: Only use for stateless services (load balancers, proxies); test thoroughly first

Pitfall 2: THP Defrag Causing Latency Spikes

Problem: Defragmentation pauses app for milliseconds
Fix: Disable THP for latency-sensitive workloads: echo never > /sys/kernel/mm/transparent_hugepage/enabled

Pitfall 3: I/O Scheduler Causing Jitter

Problem: CFQ on NVMe/SSD = added latency due to sorting overhead
Fix: Use noop or mq-deadline for NVMe/SSD

Pitfall 4: ulimits Not Applied to Service

Problem: systemd service still has default 1024 open files
Fix: Set via /etc/systemd/system/SERVICE.service.d/limits.conf (not /etc/security/limits.conf)

Pitfall 5: NUMA Remote Memory Access

Problem: App threads on node 0, but memory allocated from node 1 → slow
Fix: Use numactl --membind=0 when starting app, or set vm.zone_reclaim_mode=1


Measurement & Validation

Measure Baseline Performance

Before tuning:

# Latency
ping -c 100 localhost | grep min/avg/max
# Network throughput
iperf3 -c 127.0.0.1 -t 10
# Disk I/O
fio --name=random-read --ioengine=libaio --direct=1 --rw=randread --bs=4k --size=1G --numjobs=4 --runtime=60 --group_reporting

After tuning:

# Compare same test; expect 5-30% improvement (depends on tuning)
# Examples:
# - somaxconn increase: 10-20% more concurrent connections
# - I/O scheduler change: 5-15% lower latency on random I/O
# - CPU governor: 5-10% faster requests (no frequency scaling)
# - THP disable: 50%+ lower latency spikes (for DBs)

Further Reading