Servers Administration

Methodical Linux Performance Tuning: tuned, sysctl, and I/O Schedulers

Most “performance tuning” you find online is a list of sysctl values copied from a 2014 forum post, applied without measurement, on hardware and kernels that no longer resemble the original. That is not tuning — it is cargo culting, and it regresses as often as it helps. Real tuning is a loop: measure a baseline, form a hypothesis about the bottleneck, change one class of parameter, measure again, keep it only if it helped. This article walks that loop end to end on a modern distribution (kernel 5.15+ or 6.x, tuned 2.18+), covering the knobs that actually move latency and throughput: tuned profiles, the network stack, virtual memory, CPU power management, NUMA locality, and block I/O schedulers.

The unifying discipline is the USE method (Brendan Gregg): for every resource — CPU, memory, network, disk — check Utilization, Saturation, and Errors before you touch a single tunable.

1. Establish a baseline with the USE method

You cannot claim a win without a number to beat. Before changing anything, capture the system under representative load and write the numbers down.

First, identify the saturation signals. Saturation is queueing — work waiting because a resource is full.

# CPU + run-queue saturation. Watch the 'r' (runnable) and 'b' (blocked) columns,
# plus 'wa' (iowait) under 'cpu'. r consistently > nr_cpus means CPU saturation.
vmstat 1 10

# Per-CPU utilization and steal time (steal = hypervisor took your cycles)
mpstat -P ALL 1 5

# Per-device disk: %util (utilization), aqu-sz (avg queue depth = saturation),
# r_await / w_await (latency, ms). High await with modest %util points at the device.
iostat -xz 1 5

# Pressure Stall Information: the single best saturation signal on modern kernels.
# 'some' = at least one task stalled; 'full' = all non-idle tasks stalled.
cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory

PSI (/proc/pressure/*, kernel 4.20+) is the most honest saturation metric available. A rising io full avg10 means the whole machine is blocked on storage; cpu some avg10 near 100 means tasks are perpetually waiting for a core. Treat these as your north-star numbers.

For latency-sensitive work, capture a wall-clock distribution of your actual workload, not a synthetic one if you can avoid it. For storage and network microbenchmarks I trust fio and a closed-loop request generator respectively:

# Storage baseline: 4k random read, direct I/O, queue depth 32, 60s.
# Report p99 latency, not just IOPS — averages hide tail pain.
fio --name=baseline --filename=/data/fiotest --direct=1 --rw=randread \
    --bs=4k --iodepth=32 --numjobs=1 --runtime=60 --time_based \
    --ioengine=libaio --group_reporting --percentile_list=50:95:99:99.9

Record: workload, kernel (uname -r), current tuned profile (tuned-adm active), and the headline numbers (IOPS, p99 latency, PSI averages). Everything that follows is judged against this.

2. Select and customize a tuned profile

tuned is the right starting layer because it sets dozens of coherent knobs — sysctl, CPU governor, disk scheduler, transparent hugepages — as a single named, revertible bundle. Hand-editing /etc/sysctl.conf in isolation fights whatever the active profile already applied.

tuned-adm list          # show available profiles
tuned-adm active        # what is applied now
tuned-adm recommend     # what tuned thinks fits this machine

The stock profiles map cleanly to workload classes:

Profile Intended use Notable behavior
throughput-performance Bulk compute, batch, databases CPU governor performance, larger VM dirty ratios, no THP defrag stalls
latency-performance Low-latency services Disables most C-states via force_latency, governor performance
network-latency Trading, RPC fan-out Builds on latency-performance, disables THP, tunes busy_read/busy_poll
network-throughput Bulk transfer, streaming Large TCP buffers and backlogs
virtual-guest VMs Higher vm.dirty_ratio, sane defaults under a hypervisor

Never edit a stock profile in place — it is owned by the package and will be overwritten on upgrade. Instead create a child profile that inherits with include= and overrides only the deltas you have measured a reason for:

# /etc/tuned/kv-db/tuned.conf
[main]
summary=Postgres on NVMe, derived from throughput-performance
include=throughput-performance

[sysctl]
vm.dirty_background_ratio=5
vm.dirty_ratio=15
vm.swappiness=1

[vm]
transparent_hugepages=never

[disk]
# devices.udev.regex matches by kernel name; set scheduler + readahead per class
devices=nvme*n*
elevator=none
readahead=256

Apply and confirm it took effect:

tuned-adm profile kv-db
tuned-adm verify           # asserts every setting in the profile is actually live

tuned-adm verify is the step almost everyone skips. It re-reads each knob and tells you if something else on the box (a competing unit, a stale /etc/sysctl.d/ drop-in) overrode your profile. If verify fails, fix the conflict before you trust any later measurement.

3. Tune the network stack

Network tuning is where copied values do the most damage, because the right buffer size is a function of bandwidth-delay product, not a magic constant. Compute, do not guess: BDP (bytes) = link bandwidth (bytes/s) x RTT (seconds). A 10 GbE link at 10 ms RTT needs roughly 12.5 MB of in-flight buffer per stream.

# Inspect current limits
sysctl net.core.rmem_max net.core.wmem_max
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
sysctl net.ipv4.tcp_congestion_control

# List available congestion-control algorithms (BBR may need a module)
sysctl net.ipv4.tcp_available_congestion_control

A defensible high-bandwidth profile, sized for ~16 MB of autotuned TCP window:

# /etc/sysctl.d/90-net-throughput.conf
# Max socket buffer the kernel will grant (bytes)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# TCP autotuning: min / default / max. Leave default modest; let it grow to max.
net.ipv4.tcp_rmem = 4096 131072 16777216
net.ipv4.tcp_wmem = 4096 16384  16777216

# Backlog of fully-established sockets waiting for accept()
net.core.somaxconn = 4096
# Backlog of half-open (SYN) connections
net.ipv4.tcp_max_syn_backlog = 8192
# Per-CPU packet ingress queue when packets arrive faster than the stack drains
net.core.netdev_max_backlog = 16384

# Reuse TIME_WAIT sockets for new outbound connections (safe; NOT tcp_tw_recycle)
net.ipv4.tcp_tw_reuse = 1

On congestion control, bbr is the modern default worth testing for fat, lossy, or long-RTT paths because it does not collapse throughput on sporadic loss the way loss-based cubic does. It requires the fq qdisc to pace correctly:

# Enable BBR (module may auto-load); pair with fq packet scheduling
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

A critical correction to the old internet folklore: never enable net.ipv4.tcp_tw_recycle. It was removed entirely in kernel 4.12 because it broke connections from clients behind NAT. The safe sibling is tcp_tw_reuse, shown above. If a guide tells you to set tcp_tw_recycle, the rest of that guide is suspect.

Apply with sysctl --system and confirm somaxconn is actually large enough — many app servers also cap their own listen backlog, so the kernel value is a ceiling, not a guarantee.

4. Virtual memory: dirty ratios, swappiness, hugepages

The VM subsystem decides when dirty pages flush to disk and how aggressively the kernel reclaims. Defaults assume a desktop; servers with fast storage and large RAM want different behavior.

sysctl vm.dirty_ratio vm.dirty_background_ratio vm.swappiness
cat /sys/kernel/mm/transparent_hugepage/enabled

The two dirty knobs control writeback. vm.dirty_background_ratio is the percentage of available memory at which the kernel starts flushing in the background; vm.dirty_ratio is the hard ceiling at which writing processes are blocked until flushed. On a box with 256 GB RAM, the default 20% ceiling means 50+ GB can go dirty before a synchronous stall — a latency cliff. Lower it, and prefer the _bytes variants on large-memory machines for a fixed, predictable threshold:

# /etc/sysctl.d/91-vm.conf
# Fixed-byte thresholds are clearer than ratios on big-RAM hosts.
# Start background flush at 256 MB dirty; hard-block writers at 1 GB.
vm.dirty_background_bytes = 268435456
vm.dirty_bytes           = 1073741824

# Database hosts: keep the working set in RAM, swap only under true duress.
vm.swappiness = 1

# Free pagecache/dentry aggressiveness; raise slightly on inode-heavy NFS/file servers
vm.vfs_cache_pressure = 100

A few hard-won rules:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Make it durable through tuned ([vm] transparent_hugepages=never) or a kernel cmdline arg (transparent_hugepage=never), not a one-shot echo that evaporates on reboot.

5. CPU governors, C-states, and IRQ affinity

For latency-sensitive services, the enemy is the kernel saving power. Frequency scaling and deep C-states add tens to hundreds of microseconds of wakeup jitter.

# Which driver and governor are active?
cpupower frequency-info
# Per-core current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set the performance governor to pin cores at max frequency (the cleanest way is via tuned’s latency-performance, but cpupower does it directly):

cpupower frequency-set -g performance

C-states are the bigger latency lever. When a core idles into a deep C-state (C6), the exit latency to wake it can dwarf your service’s processing time. The robust, supported way to cap idle depth is the PM QoS interface that tuned’s force_latency drives — request a maximum tolerable CPU wakeup latency in microseconds:

# Hold /dev/cpu_dma_latency open requesting <=10us wake latency.
# The kernel then refuses C-states whose exit latency exceeds the request.
# Keep the process running; closing the fd releases the constraint.
exec 4<>/dev/cpu_dma_latency
printf '\x0a\x00\x00\x00' >&4   # 10 (us) as a 32-bit little-endian int

In practice you let tuned own this with force_latency=10 in a [cpu] section rather than scripting the device by hand. For boot-time guarantees on dedicated latency nodes, the kernel cmdline arg intel_idle.max_cstate=1 (Intel) or processor.max_cstate=1 is the blunt, permanent instrument.

IRQ affinity matters once a single NIC queue’s softirq saturates one core. Spread interrupts off your application cores:

# See which CPUs handle each IRQ and the per-CPU interrupt counts
cat /proc/interrupts | less

# Let irqbalance distribute, OR pin a NIC queue's IRQ to a specific CPU mask.
# Example: pin IRQ 142 to CPU2 (mask 0x4).
echo 4 > /proc/irq/142/smp_affinity

The principle: keep NIC interrupt handling, the kernel softirq, and the application thread on cores that share an L3 cache, and away from the cores doing latency-critical user work. On NUMA boxes that means same-socket — which is the next section.

6. NUMA awareness and memory locality

On multi-socket servers, memory attached to a remote socket costs 1.5x-2x the access latency of local memory. A process scheduled on socket 1 reaching into socket 0’s DIMMs pays that tax on every miss. This is the single most common invisible regression on large machines.

# Topology: nodes, the CPUs in each, and the inter-node distance matrix
numactl --hardware

# Per-node hit/miss stats. numa_miss / numa_foreign climbing = remote-memory pain.
numastat -m

For a single dominant process (a database, a JVM, a packet processor), pin both its CPUs and its memory to one node so allocations stay local:

# Run on node 0's CPUs, allocate only from node 0's memory
numactl --cpunodebind=0 --membind=0 /usr/lib/postgresql/16/bin/postgres -D /data/pg

# Inspect a running process's NUMA placement
numastat -p $(pgrep -f postgres | head -1)

For services that legitimately span sockets, --interleave=all spreads allocations evenly so no single node’s memory bandwidth becomes the bottleneck — the right call for big in-memory caches:

numactl --interleave=all /usr/bin/redis-server /etc/redis/redis.conf

Beware vm.zone_reclaim_mode. On older kernels it defaulted to on for some NUMA topologies, causing the kernel to aggressively reclaim local pagecache rather than allocate one remote page — devastating for file-cache-heavy workloads. Confirm it is 0 (sysctl vm.zone_reclaim_mode); leave it off unless you have a specific, measured reason.

The systemd-native way to bind a managed service is cleaner than wrapping ExecStart in numactl:

# /etc/systemd/system/postgresql.service.d/numa.conf
[Service]
NUMAPolicy=bind
NUMAMask=0

7. Block I/O schedulers and queue depth per device class

The scheduler that is right for a spinning disk is wrong for NVMe, and vice versa. Modern kernels ship the multi-queue block layer (blk-mq) with three relevant schedulers:

Scheduler Best for Why
none NVMe / fast SSD The device has deep internal queues and reorders better than the OS; software scheduling only adds latency
mq-deadline SATA SSD, mixed, predictable latency Lightweight, bounds worst-case latency with read/write deadlines
bfq Desktops, latency-fairness across competing apps Proportional-share fairness; CPU cost too high for high-IOPS server paths

Check and set per device — and know that /sys shows the active choice in brackets:

# Current scheduler for nvme0n1 (active one is in [brackets])
cat /sys/block/nvme0n1/queue/scheduler

# Set mq-deadline on a SATA SSD at runtime
echo mq-deadline > /sys/block/sda/queue/scheduler
# Set none on NVMe
echo none > /sys/block/nvme0n1/queue/scheduler

Runtime echoes do not survive reboot. Make the choice durable and device-class-aware with a udev rule, so a node with mixed media gets the right scheduler on each:

# /etc/udev/rules.d/60-ioscheduler.rules
# NVMe -> none
ACTION=="add|change", KERNEL=="nvme[0-9]*n[0-9]*", ATTR{queue/scheduler}="none"
# Non-rotational SATA/SAS (SSD) -> mq-deadline
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# Rotational (HDD) -> mq-deadline as well; bfq only if latency-fairness matters
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"

Two supporting knobs per device:

echo 256  > /sys/block/nvme0n1/queue/read_ahead_kb   # OLTP-leaning
echo 4096 > /sys/block/sdb/queue/read_ahead_kb       # sequential-scan-leaning

Verify

Tuning without re-measurement is just hoping. After each change, re-run the same baseline and compare deltas, not absolute feelings.

# 1. Confirm the profile and every knob it sets are actually live
tuned-adm active
tuned-adm verify

# 2. Spot-check the individual sysctls you changed
sysctl vm.dirty_bytes vm.swappiness net.ipv4.tcp_congestion_control

# 3. Confirm per-device scheduler stuck after reboot
for d in /sys/block/{sd,nvme}*; do
  printf '%s: %s\n' "$(basename "$d")" "$(cat "$d"/queue/scheduler)"
done

# 4. Re-run the identical fio baseline and diff p99 against the recorded number
fio --name=verify --filename=/data/fiotest --direct=1 --rw=randread \
    --bs=4k --iodepth=32 --numjobs=1 --runtime=60 --time_based \
    --ioengine=libaio --group_reporting --percentile_list=50:95:99:99.9

# 5. Watch PSI under the SAME load — io/cpu 'full avg10' should not have risen
watch -n1 'cat /proc/pressure/io /proc/pressure/cpu'

A change that improves throughput but raises p99 latency or PSI saturation is usually a regression in disguise for an interactive service. Decide which dimension you are optimizing before you look at the numbers, or you will rationalize whatever you got.

Enterprise scenario

A payments platform team ran their core PostgreSQL fleet on dual-socket Epyc servers with NVMe and 512 GB RAM. After a hardware refresh, p99 query latency got worse despite faster CPUs and disks. The on-call narrative was “the new NVMe firmware is bad.”

The USE method told a different story. iostat showed NVMe %util under 20% with sub-millisecond await — storage was idle. But numastat -p on the postgres backends showed numa_foreign climbing into the millions: the scheduler was spreading backends across both sockets while shared_buffers lived on node 0, so half the connections were doing every buffer access across the inter-socket link. The “slow disk” was actually remote-memory latency, multiplied by a higher core count that spread work wider across NUMA nodes than the old box.

The fix was locality, not faster hardware. They bound each postgres instance to a single NUMA node and let the second socket host a second instance, turning one cross-NUMA database into two node-local ones:

# /etc/systemd/system/postgresql@.service.d/numa.conf
# Templated unit: postgresql@0 binds node 0, postgresql@1 binds node 1.
[Service]
NUMAPolicy=bind
NUMAMask=%i

They also set vm.zone_reclaim_mode=0 (it had silently re-enabled on the new kernel for this topology) and confirmed THP never. p99 dropped 38% and, crucially, the variance collapsed — the tail got predictable. No firmware was changed. The lesson the team wrote into their runbook: on multi-socket hardware, prove memory locality before blaming any other resource.

Checklist

linuxperformancetunedsysctltuning

Comments

Keep Reading