Most “performance tuning” you find online is a list of sysctl values copied from a 2014 forum post, applied without measurement, on hardware and kernels that no longer resemble the original. That is not tuning — it is cargo culting, and it regresses as often as it helps. Real tuning is a loop: measure a baseline, form a hypothesis about the bottleneck, change one class of parameter, measure again, keep it only if it helped. This article walks that loop end to end on a modern distribution (kernel 5.15+ or 6.x, tuned 2.18+), covering the knobs that actually move latency and throughput: tuned profiles, the network stack, virtual memory, CPU power management, NUMA locality, and block I/O schedulers.
The unifying discipline is the USE method (Brendan Gregg): for every resource — CPU, memory, network, disk — check Utilization, Saturation, and Errors before you touch a single tunable.
1. Establish a baseline with the USE method
You cannot claim a win without a number to beat. Before changing anything, capture the system under representative load and write the numbers down.
First, identify the saturation signals. Saturation is queueing — work waiting because a resource is full.
# CPU + run-queue saturation. Watch the 'r' (runnable) and 'b' (blocked) columns,
# plus 'wa' (iowait) under 'cpu'. r consistently > nr_cpus means CPU saturation.
vmstat 1 10
# Per-CPU utilization and steal time (steal = hypervisor took your cycles)
mpstat -P ALL 1 5
# Per-device disk: %util (utilization), aqu-sz (avg queue depth = saturation),
# r_await / w_await (latency, ms). High await with modest %util points at the device.
iostat -xz 1 5
# Pressure Stall Information: the single best saturation signal on modern kernels.
# 'some' = at least one task stalled; 'full' = all non-idle tasks stalled.
cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory
PSI (/proc/pressure/*, kernel 4.20+) is the most honest saturation metric available. A rising io full avg10 means the whole machine is blocked on storage; cpu some avg10 near 100 means tasks are perpetually waiting for a core. Treat these as your north-star numbers.
For latency-sensitive work, capture a wall-clock distribution of your actual workload, not a synthetic one if you can avoid it. For storage and network microbenchmarks I trust fio and a closed-loop request generator respectively:
# Storage baseline: 4k random read, direct I/O, queue depth 32, 60s.
# Report p99 latency, not just IOPS — averages hide tail pain.
fio --name=baseline --filename=/data/fiotest --direct=1 --rw=randread \
--bs=4k --iodepth=32 --numjobs=1 --runtime=60 --time_based \
--ioengine=libaio --group_reporting --percentile_list=50:95:99:99.9
Record: workload, kernel (uname -r), current tuned profile (tuned-adm active), and the headline numbers (IOPS, p99 latency, PSI averages). Everything that follows is judged against this.
2. Select and customize a tuned profile
tuned is the right starting layer because it sets dozens of coherent knobs — sysctl, CPU governor, disk scheduler, transparent hugepages — as a single named, revertible bundle. Hand-editing /etc/sysctl.conf in isolation fights whatever the active profile already applied.
tuned-adm list # show available profiles
tuned-adm active # what is applied now
tuned-adm recommend # what tuned thinks fits this machine
The stock profiles map cleanly to workload classes:
| Profile | Intended use | Notable behavior |
|---|---|---|
throughput-performance |
Bulk compute, batch, databases | CPU governor performance, larger VM dirty ratios, no THP defrag stalls |
latency-performance |
Low-latency services | Disables most C-states via force_latency, governor performance |
network-latency |
Trading, RPC fan-out | Builds on latency-performance, disables THP, tunes busy_read/busy_poll |
network-throughput |
Bulk transfer, streaming | Large TCP buffers and backlogs |
virtual-guest |
VMs | Higher vm.dirty_ratio, sane defaults under a hypervisor |
Never edit a stock profile in place — it is owned by the package and will be overwritten on upgrade. Instead create a child profile that inherits with include= and overrides only the deltas you have measured a reason for:
# /etc/tuned/kv-db/tuned.conf
[main]
summary=Postgres on NVMe, derived from throughput-performance
include=throughput-performance
[sysctl]
vm.dirty_background_ratio=5
vm.dirty_ratio=15
vm.swappiness=1
[vm]
transparent_hugepages=never
[disk]
# devices.udev.regex matches by kernel name; set scheduler + readahead per class
devices=nvme*n*
elevator=none
readahead=256
Apply and confirm it took effect:
tuned-adm profile kv-db
tuned-adm verify # asserts every setting in the profile is actually live
tuned-adm verify is the step almost everyone skips. It re-reads each knob and tells you if something else on the box (a competing unit, a stale /etc/sysctl.d/ drop-in) overrode your profile. If verify fails, fix the conflict before you trust any later measurement.
3. Tune the network stack
Network tuning is where copied values do the most damage, because the right buffer size is a function of bandwidth-delay product, not a magic constant. Compute, do not guess: BDP (bytes) = link bandwidth (bytes/s) x RTT (seconds). A 10 GbE link at 10 ms RTT needs roughly 12.5 MB of in-flight buffer per stream.
# Inspect current limits
sysctl net.core.rmem_max net.core.wmem_max
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
sysctl net.ipv4.tcp_congestion_control
# List available congestion-control algorithms (BBR may need a module)
sysctl net.ipv4.tcp_available_congestion_control
A defensible high-bandwidth profile, sized for ~16 MB of autotuned TCP window:
# /etc/sysctl.d/90-net-throughput.conf
# Max socket buffer the kernel will grant (bytes)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# TCP autotuning: min / default / max. Leave default modest; let it grow to max.
net.ipv4.tcp_rmem = 4096 131072 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
# Backlog of fully-established sockets waiting for accept()
net.core.somaxconn = 4096
# Backlog of half-open (SYN) connections
net.ipv4.tcp_max_syn_backlog = 8192
# Per-CPU packet ingress queue when packets arrive faster than the stack drains
net.core.netdev_max_backlog = 16384
# Reuse TIME_WAIT sockets for new outbound connections (safe; NOT tcp_tw_recycle)
net.ipv4.tcp_tw_reuse = 1
On congestion control, bbr is the modern default worth testing for fat, lossy, or long-RTT paths because it does not collapse throughput on sporadic loss the way loss-based cubic does. It requires the fq qdisc to pace correctly:
# Enable BBR (module may auto-load); pair with fq packet scheduling
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
A critical correction to the old internet folklore: never enable
net.ipv4.tcp_tw_recycle. It was removed entirely in kernel 4.12 because it broke connections from clients behind NAT. The safe sibling istcp_tw_reuse, shown above. If a guide tells you to settcp_tw_recycle, the rest of that guide is suspect.
Apply with sysctl --system and confirm somaxconn is actually large enough — many app servers also cap their own listen backlog, so the kernel value is a ceiling, not a guarantee.
4. Virtual memory: dirty ratios, swappiness, hugepages
The VM subsystem decides when dirty pages flush to disk and how aggressively the kernel reclaims. Defaults assume a desktop; servers with fast storage and large RAM want different behavior.
sysctl vm.dirty_ratio vm.dirty_background_ratio vm.swappiness
cat /sys/kernel/mm/transparent_hugepage/enabled
The two dirty knobs control writeback. vm.dirty_background_ratio is the percentage of available memory at which the kernel starts flushing in the background; vm.dirty_ratio is the hard ceiling at which writing processes are blocked until flushed. On a box with 256 GB RAM, the default 20% ceiling means 50+ GB can go dirty before a synchronous stall — a latency cliff. Lower it, and prefer the _bytes variants on large-memory machines for a fixed, predictable threshold:
# /etc/sysctl.d/91-vm.conf
# Fixed-byte thresholds are clearer than ratios on big-RAM hosts.
# Start background flush at 256 MB dirty; hard-block writers at 1 GB.
vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 1073741824
# Database hosts: keep the working set in RAM, swap only under true duress.
vm.swappiness = 1
# Free pagecache/dentry aggressiveness; raise slightly on inode-heavy NFS/file servers
vm.vfs_cache_pressure = 100
A few hard-won rules:
vm.swappiness=0does not disable swap — it makes the kernel avoid swapping until reclaim is nearly impossible, which can trigger the OOM killer sooner. Use1for “avoid but allow” on databases; reserve0only when you genuinely never want swap pressure and understand the OOM tradeoff.- Transparent Hugepages (THP) help some workloads and badly hurt others. Databases with their own buffer management (PostgreSQL, MySQL, Redis, MongoDB, all JVM-heavy apps) routinely recommend
neverbecause THP’s background defrag (khugepaged) and allocation stalls cause latency spikes. Set it explicitly rather than leaving it onmadviseby accident:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Make it durable through tuned ([vm] transparent_hugepages=never) or a kernel cmdline arg (transparent_hugepage=never), not a one-shot echo that evaporates on reboot.
5. CPU governors, C-states, and IRQ affinity
For latency-sensitive services, the enemy is the kernel saving power. Frequency scaling and deep C-states add tens to hundreds of microseconds of wakeup jitter.
# Which driver and governor are active?
cpupower frequency-info
# Per-core current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Set the performance governor to pin cores at max frequency (the cleanest way is via tuned’s latency-performance, but cpupower does it directly):
cpupower frequency-set -g performance
C-states are the bigger latency lever. When a core idles into a deep C-state (C6), the exit latency to wake it can dwarf your service’s processing time. The robust, supported way to cap idle depth is the PM QoS interface that tuned’s force_latency drives — request a maximum tolerable CPU wakeup latency in microseconds:
# Hold /dev/cpu_dma_latency open requesting <=10us wake latency.
# The kernel then refuses C-states whose exit latency exceeds the request.
# Keep the process running; closing the fd releases the constraint.
exec 4<>/dev/cpu_dma_latency
printf '\x0a\x00\x00\x00' >&4 # 10 (us) as a 32-bit little-endian int
In practice you let tuned own this with force_latency=10 in a [cpu] section rather than scripting the device by hand. For boot-time guarantees on dedicated latency nodes, the kernel cmdline arg intel_idle.max_cstate=1 (Intel) or processor.max_cstate=1 is the blunt, permanent instrument.
IRQ affinity matters once a single NIC queue’s softirq saturates one core. Spread interrupts off your application cores:
# See which CPUs handle each IRQ and the per-CPU interrupt counts
cat /proc/interrupts | less
# Let irqbalance distribute, OR pin a NIC queue's IRQ to a specific CPU mask.
# Example: pin IRQ 142 to CPU2 (mask 0x4).
echo 4 > /proc/irq/142/smp_affinity
The principle: keep NIC interrupt handling, the kernel softirq, and the application thread on cores that share an L3 cache, and away from the cores doing latency-critical user work. On NUMA boxes that means same-socket — which is the next section.
6. NUMA awareness and memory locality
On multi-socket servers, memory attached to a remote socket costs 1.5x-2x the access latency of local memory. A process scheduled on socket 1 reaching into socket 0’s DIMMs pays that tax on every miss. This is the single most common invisible regression on large machines.
# Topology: nodes, the CPUs in each, and the inter-node distance matrix
numactl --hardware
# Per-node hit/miss stats. numa_miss / numa_foreign climbing = remote-memory pain.
numastat -m
For a single dominant process (a database, a JVM, a packet processor), pin both its CPUs and its memory to one node so allocations stay local:
# Run on node 0's CPUs, allocate only from node 0's memory
numactl --cpunodebind=0 --membind=0 /usr/lib/postgresql/16/bin/postgres -D /data/pg
# Inspect a running process's NUMA placement
numastat -p $(pgrep -f postgres | head -1)
For services that legitimately span sockets, --interleave=all spreads allocations evenly so no single node’s memory bandwidth becomes the bottleneck — the right call for big in-memory caches:
numactl --interleave=all /usr/bin/redis-server /etc/redis/redis.conf
Beware
vm.zone_reclaim_mode. On older kernels it defaulted to on for some NUMA topologies, causing the kernel to aggressively reclaim local pagecache rather than allocate one remote page — devastating for file-cache-heavy workloads. Confirm it is0(sysctl vm.zone_reclaim_mode); leave it off unless you have a specific, measured reason.
The systemd-native way to bind a managed service is cleaner than wrapping ExecStart in numactl:
# /etc/systemd/system/postgresql.service.d/numa.conf
[Service]
NUMAPolicy=bind
NUMAMask=0
7. Block I/O schedulers and queue depth per device class
The scheduler that is right for a spinning disk is wrong for NVMe, and vice versa. Modern kernels ship the multi-queue block layer (blk-mq) with three relevant schedulers:
| Scheduler | Best for | Why |
|---|---|---|
none |
NVMe / fast SSD | The device has deep internal queues and reorders better than the OS; software scheduling only adds latency |
mq-deadline |
SATA SSD, mixed, predictable latency | Lightweight, bounds worst-case latency with read/write deadlines |
bfq |
Desktops, latency-fairness across competing apps | Proportional-share fairness; CPU cost too high for high-IOPS server paths |
Check and set per device — and know that /sys shows the active choice in brackets:
# Current scheduler for nvme0n1 (active one is in [brackets])
cat /sys/block/nvme0n1/queue/scheduler
# Set mq-deadline on a SATA SSD at runtime
echo mq-deadline > /sys/block/sda/queue/scheduler
# Set none on NVMe
echo none > /sys/block/nvme0n1/queue/scheduler
Runtime echoes do not survive reboot. Make the choice durable and device-class-aware with a udev rule, so a node with mixed media gets the right scheduler on each:
# /etc/udev/rules.d/60-ioscheduler.rules
# NVMe -> none
ACTION=="add|change", KERNEL=="nvme[0-9]*n[0-9]*", ATTR{queue/scheduler}="none"
# Non-rotational SATA/SAS (SSD) -> mq-deadline
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# Rotational (HDD) -> mq-deadline as well; bfq only if latency-fairness matters
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
Two supporting knobs per device:
nr_requests— the depth of the OS-side request queue. For high-IOPS NVMe under heavy parallelism, raising it (e.g. to 1024) can reduce queue-full backpressure:echo 1024 > /sys/block/nvme0n1/queue/nr_requests.read_ahead_kb— readahead window. Large sequential workloads (analytics scans, backups) benefit from raising it; small-random OLTP benefits from lowering it to avoid wasted I/O. There is no universal value — this is exactly the kind of knob to A/B withfioagainst your real access pattern.
echo 256 > /sys/block/nvme0n1/queue/read_ahead_kb # OLTP-leaning
echo 4096 > /sys/block/sdb/queue/read_ahead_kb # sequential-scan-leaning
Verify
Tuning without re-measurement is just hoping. After each change, re-run the same baseline and compare deltas, not absolute feelings.
# 1. Confirm the profile and every knob it sets are actually live
tuned-adm active
tuned-adm verify
# 2. Spot-check the individual sysctls you changed
sysctl vm.dirty_bytes vm.swappiness net.ipv4.tcp_congestion_control
# 3. Confirm per-device scheduler stuck after reboot
for d in /sys/block/{sd,nvme}*; do
printf '%s: %s\n' "$(basename "$d")" "$(cat "$d"/queue/scheduler)"
done
# 4. Re-run the identical fio baseline and diff p99 against the recorded number
fio --name=verify --filename=/data/fiotest --direct=1 --rw=randread \
--bs=4k --iodepth=32 --numjobs=1 --runtime=60 --time_based \
--ioengine=libaio --group_reporting --percentile_list=50:95:99:99.9
# 5. Watch PSI under the SAME load — io/cpu 'full avg10' should not have risen
watch -n1 'cat /proc/pressure/io /proc/pressure/cpu'
A change that improves throughput but raises p99 latency or PSI saturation is usually a regression in disguise for an interactive service. Decide which dimension you are optimizing before you look at the numbers, or you will rationalize whatever you got.
Enterprise scenario
A payments platform team ran their core PostgreSQL fleet on dual-socket Epyc servers with NVMe and 512 GB RAM. After a hardware refresh, p99 query latency got worse despite faster CPUs and disks. The on-call narrative was “the new NVMe firmware is bad.”
The USE method told a different story. iostat showed NVMe %util under 20% with sub-millisecond await — storage was idle. But numastat -p on the postgres backends showed numa_foreign climbing into the millions: the scheduler was spreading backends across both sockets while shared_buffers lived on node 0, so half the connections were doing every buffer access across the inter-socket link. The “slow disk” was actually remote-memory latency, multiplied by a higher core count that spread work wider across NUMA nodes than the old box.
The fix was locality, not faster hardware. They bound each postgres instance to a single NUMA node and let the second socket host a second instance, turning one cross-NUMA database into two node-local ones:
# /etc/systemd/system/postgresql@.service.d/numa.conf
# Templated unit: postgresql@0 binds node 0, postgresql@1 binds node 1.
[Service]
NUMAPolicy=bind
NUMAMask=%i
They also set vm.zone_reclaim_mode=0 (it had silently re-enabled on the new kernel for this topology) and confirmed THP never. p99 dropped 38% and, crucially, the variance collapsed — the tail got predictable. No firmware was changed. The lesson the team wrote into their runbook: on multi-socket hardware, prove memory locality before blaming any other resource.