Methodical Linux Performance Tuning: tuned, sysctl, and I/O Schedulers

Most “performance tuning” you find online is a list of sysctl values copied from a 2014 forum post, applied without measurement, on hardware and kernels that no longer resemble the original. That is not tuning — it is cargo culting, and it regresses as often as it helps. Half of those pasted lines set knobs whose defaults are already correct on a modern kernel; a good fraction set knobs that were removed years ago (tcp_tw_recycle is the canonical landmine); and almost none of them were ever measured against a baseline on the machine they now run on. Real tuning is a loop: measure a baseline, form a hypothesis about the bottleneck, change one class of parameter, measure again, keep it only if it helped, and revert cleanly if it did not. Everything in this article is in service of that loop.

This is an Expert-level reference for tuning a modern distribution — kernel 5.15+ or 6.x, tuned 2.18+, util-linux, ethtool, and the bcc/bpftrace toolchain — end to end. It covers the layers that actually determine latency and throughput on a real server: tuned as the coherent orchestration layer; the virtual-memory subsystem (dirty ratios, swappiness, THP, hugepages); the CPU power and topology story (governors, C-states, IRQ affinity, isolcpus/nohz_full, NUMA locality); the block I/O layer (blk-mq schedulers per device class, queue depth, readahead); the network stack (socket buffers sized from bandwidth-delay product, backlogs, congestion control, NIC offloads, RSS/RPS, ring buffers); and the observability toolkit (perf, bpftrace, sar, PSI) that tells you the truth. Each layer is enumerated exhaustively — every knob, its default, when to change it, the trade-off, and the gotcha — in scannable tables, backed by real commands you can paste.

The unifying discipline is the USE method (Brendan Gregg): for every resource — CPU, memory, network, disk — check Utilization, Saturation, and Errors before you touch a single tunable. By the end you will localise a slow service to one resource with vmstat, iostat, mpstat, sar and /proc/pressure, change one coherent bundle through a tuned child profile, run tuned-adm verify, re-run the identical benchmark, and keep the change only if the number you decided to optimise — p99 latency, IOPS, throughput, or PSI saturation — actually moved.

What problem this solves

A server is “slow” and everyone has a theory. The database team blames the disk; the network team blames the app; someone opens a ticket for a “bad NVMe firmware.” Meanwhile the actual cause — remote-NUMA memory access, a NIC dumping every interrupt on CPU 0, transparent hugepage defrag stalling a latency-critical thread, or a TCP window three orders of magnitude too small for a 10 GbE fat pipe — sits there, perfectly measurable, unmeasured. The pain is real and expensive: tail latency that trips SLOs, throughput that plateaus below line rate, p99 that spikes without pattern, and a fleet that behaves differently after a hardware refresh because the tuning was implicit in the old topology.

What breaks without a method: engineers paste a “10 sysctls every server needs” list, restart, and either see no change (the defaults were already right) or a regression they cannot attribute (they changed twelve things at once). They set vm.swappiness=0 believing it disables swap (it does not) and get OOM-killed sooner. They enable net.ipv4.tcp_tw_recycle from a stale guide and break every client behind a NAT. They set a spinning-disk I/O scheduler on NVMe and add latency the device would never have had. Without a baseline none of this is caught — the change is judged by feeling, and feeling rationalises whatever it got.

Who hits this: anyone running latency-sensitive services (databases, RPC fan-out, trading, real-time media), high-throughput pipes (storage backends, streaming, bulk transfer), or multi-socket hardware where NUMA locality is invisible until it bites. It hits hardest on large-memory hosts (the default 20% dirty ratio becomes a 50 GB writeback cliff), on multi-queue NICs at 25/100 GbE (a single queue saturates one core), and on modern NVMe (where the OS scheduler is pure overhead). The fix is almost never “add hardware” — it is “measure, find the ceiling you actually hit, change the one class of knob that governs it, and prove the change with the same benchmark.”

To frame the whole field before the deep dive, here is every resource this article tunes, the USE signal that tells you it is the bottleneck, the primary knob layer, and the first tool to reach for:

Resource	Utilization signal	Saturation signal	The knob layer	First tool
CPU	`%usr`+`%sys` per core (`mpstat`)	run-queue `r > nr_cpus` (`vmstat`), `cpu some` PSI	Governor, C-states, IRQ/`isolcpus`, NUMA	`mpstat -P ALL`, `/proc/pressure/cpu`
Memory	`used`/`available` (`free`)	`si/so` swap-in/out, `mem some` PSI, reclaim	swappiness, dirty ratios, THP, hugepages	`free -m`, `/proc/pressure/memory`, `sar -B`
Disk / block I/O	`%util` per device (`iostat`)	`aqu-sz` queue depth, `io full` PSI	Scheduler, `nr_requests`, `read_ahead_kb`	`iostat -xz`, `/proc/pressure/io`
Network	throughput vs line rate	drops, backlog, retransmits	Socket buffers, backlogs, offloads, RSS/RPS	`ss`, `ethtool -S`, `sar -n DEV`
Interrupts / softirq	per-CPU IRQ rate	`sitime` on one core; `NET_RX` drops	IRQ affinity, RSS/RPS, ring buffers	`/proc/interrupts`, `mpstat -I`

Learning objectives

By the end of this article you can:

Establish a defensible baseline with the USE method — utilization, saturation (PSI, run-queue, queue depth), and errors per resource — and record the numbers every later change is judged against.
Drive tuned as the orchestration layer: list/recommend/activate profiles, read what each stock profile actually changes, author a custom child profile with include=, apply it, and prove it took with tuned-adm verify.
Choose and set the right CPU posture: performance vs schedutil governors, capping C-states via the PM-QoS force_latency mechanism, spreading IRQ affinity, and isolating cores with isolcpus/nohz_full/rcu_nocbs for latency nodes.
Prove and fix NUMA locality with numactl, numastat and lstopo — bind, interleave, or zone_reclaim_mode-fix a cross-socket regression.
Size the network stack correctly: compute socket buffers from bandwidth-delay product, tune somaxconn/tcp_max_syn_backlog/netdev_max_backlog, pick a congestion-control algorithm (bbr vs cubic), and manage NIC offloads, RSS/RPS and ring buffers with ethtool.
Tune the virtual-memory subsystem: dirty ratios vs _bytes on large-RAM hosts, vm.swappiness (and why 0 ≠ off), THP (never/madvise/always), and reserved hugepages for databases.
Select the correct block I/O scheduler per device class (none/mq-deadline/bfq/kyber) and make it durable with a udev rule; tune nr_requests and read_ahead_kb to the access pattern.
Use perf, bpftrace and sar to attribute a bottleneck to a specific function, syscall, or off-CPU stall — and to validate that a change moved the metric you chose to optimise.

Prerequisites & where this fits

You should be comfortable as a Linux administrator: systemd units and drop-ins, /proc and /sys layout, editing files under /etc/sysctl.d/ and /etc/tuned/, and reading dmesg/journalctl. You should know basic hardware terms — cores vs threads (SMT/Hyper-Threading), sockets, NUMA nodes, NVMe vs SATA — and basic networking (TCP handshake, RTT, MTU). Root or sudo is assumed; several knobs live in /sys and /proc/sys and require it. A test machine you can push under representative load, and a benchmark that resembles your real workload (fio for storage, a closed-loop generator for RPC), are essential — you cannot tune what you cannot measure.

This sits in the Servers / Administration track and is downstream of the platform basics. It pairs tightly with Mastering systemd: Units, Timers, Resource Control, and Service Hardening — because systemd cgroups (CPUAffinity, NUMAPolicy, IOSchedulingClass) are increasingly the right place to express per-service tuning. It complements Modern Linux Networking: Bonding, VLANs, and Firewalls with nftables and firewalld for the interface layer beneath the socket tuning here, and Building a Linux Audit Trail with auditd and eBPF Runtime Visibility for the eBPF tooling that also powers bpftrace observability. For the architectural framing of why you tune — right-sizing, caching, load testing — see Well-Architected Performance Efficiency Pillar: Right-Sizing, Caching, and Load Testing, and for the metrics discipline behind the numbers, Monitoring and Observability Basics: Logs, Metrics, and Traces.

A quick map of who owns which layer during a performance incident, so you localise fast and call the right person:

Layer	What lives here	Who usually owns it	Regressions it causes
Hardware / firmware	CPU, DIMMs, NIC, NVMe, BIOS power profile	Platform / DC	BIOS power-save throttling, wrong NUMA interleave
Kernel / cmdline	`isolcpus`, `intel_idle.max_cstate`, `hugepages`, `iommu`	OS / platform	Boot-time posture; irreversible without reboot
tuned profile	Coherent sysctl+CPU+disk+THP bundle	OS / platform	A stock profile overriding your intent silently
sysctl drop-ins	VM, net, fs kernel parameters	OS / app	Copied constants; conflicts between drop-ins
NIC / ethtool	Offloads, RSS/RPS, ring buffers, IRQ coalescing	Network / OS	Interrupts on one core; RX drops from small rings
Application	Threads, connection reuse, buffer sizes, NUMA-awareness	App / dev	Per-request sockets, cross-NUMA threads

Core concepts

Six mental models make every later decision obvious.

You cannot claim a win without a number to beat. Tuning is a control loop, not a checklist. Every change is a hypothesis (“the dirty ratio is causing writeback stalls”) tested against a recorded baseline (workload, kernel, active profile, p99, IOPS, PSI) by changing exactly one class of knob and re-running the identical benchmark. Change three things and it got faster, and you learned nothing — you cannot revert the two that hurt while keeping the one that helped. One class of knob per iteration is the entire discipline.

The USE method localises the bottleneck before you touch a knob. For each resource ask, in order: is it highly Utilized (busy), is it Saturated (work queueing because it is full), are there Errors? Saturation is the strongest signal — a resource can be 100% utilized and fine (a batch job wants that), but saturation means demand exceeds capacity and something is waiting. On modern kernels, Pressure Stall Information (PSI) in /proc/pressure/{cpu,io,memory} is the most honest saturation metric: some avg10 is the percent of time at least one task stalled in the last 10 s; full avg10 is the percent of time every non-idle task stalled. A rising io full avg10 means the whole machine is blocked on storage — no interpretation required.

tuned is the right orchestration layer; hand-editing single knobs fights it. tuned sets dozens of coherent knobs — sysctl, CPU governor, disk scheduler, THP, IRQ posture — as a single named, revertible bundle chosen for a workload class. Hand-edit /etc/sysctl.conf while a tuned profile is active and you fight whatever the profile already applied; the two can silently disagree. The correct pattern is to let tuned own the bundle and express your measured deltas as a child profile that include=s a stock one and overrides only what you have a reason to change — then tuned-adm verify to prove nothing else overrode you.

Power management trades latency for watts, and defaults favour watts. A modern CPU scales frequency down (cpufreq governors) and drops idle cores into deep sleep (C-states). Both add wakeup jitter — tens to hundreds of microseconds to bring a core back to full speed from a deep C-state. Free efficiency for a batch job; a tail-latency generator for a latency-critical service. The performance governor pins frequency high; capping C-states (via the PM-QoS /dev/cpu_dma_latency interface that tuned’s force_latency drives, or the blunt intel_idle.max_cstate=1 cmdline) bounds wakeup latency. You buy predictable latency with power and heat.

On multi-socket hardware, memory has a locality tax. Remote-NUMA memory costs roughly 1.5×–2× the access latency of local memory, at lower bandwidth. A thread on socket 1 reaching into socket 0’s DIMMs pays that tax on every cache miss. This is the most common invisible regression on large machines — it shows up not as high disk or CPU utilization but as latency that a higher core count makes worse by spreading work wider. numactl binds CPU and memory to a node; numastat proves remote-access pain (numa_foreign/numa_miss climbing).

The right knob for one device is wrong for another. There is no universal I/O scheduler, readahead value, or TCP buffer size — the correct value is a function of the device (NVMe vs SATA SSD vs HDD) and the workload (random OLTP vs sequential scan; short vs long-RTT pipe). Modern kernels use the multi-queue block layer (blk-mq) where none fits NVMe (the device reorders better than the OS) and mq-deadline fits SATA. TCP buffers must be sized from bandwidth-delay product (BDP = bandwidth × RTT), not a magic constant. Per-device, per-workload, measured — never one number for the whole fleet.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary at the end repeats these for lookup; this is the mental model side by side.

Concept	One-line definition	Where it lives	Why it matters to performance
USE method	Check Utilization, Saturation, Errors per resource	Methodology	Localises the bottleneck before you tune
PSI	Pressure Stall Information: `some`/`full` stall time	`/proc/pressure/{cpu,io,memory}`	The most honest saturation signal (kernel 4.20+)
tuned	Orchestrator setting coherent knob bundles by profile	`/etc/tuned`, `tuned-adm`	Revertible, named, verifiable tuning
sysctl	Runtime kernel parameters (`net.`, `vm.`, `fs.*`)	`/proc/sys`, `/etc/sysctl.d`	The individual knobs tuned and you set
cpufreq governor	Policy choosing CPU frequency	`/sys/.../cpufreq/scaling_governor`	`performance` = no frequency jitter
C-state	CPU idle sleep depth (C0 active … C6 deep)	`/sys/.../cpuidle`, `intel_idle`	Deep C-state exit latency = wakeup jitter
IRQ affinity	Which CPUs service a device’s interrupts	`/proc/irq/<n>/smp_affinity`	Keeps softirq off latency-critical cores
NUMA node	A socket’s local CPUs + memory	`numactl --hardware`	Remote memory = 1.5–2× latency tax
isolcpus / nohz_full	Cores removed from the scheduler / tickless	Kernel cmdline	Dedicated cores for latency-critical threads
blk-mq scheduler	Multi-queue block I/O scheduler	`/sys/block/<dev>/queue/scheduler`	`none` for NVMe, `mq-deadline` for SATA
THP	Transparent Hugepages (auto 2 MB pages)	`/sys/kernel/mm/transparent_hugepage`	Defrag stalls hurt latency-sensitive DBs
Hugepages	Explicitly reserved large pages (2 MB / 1 GB)	`vm.nr_hugepages`, `hugetlbfs`	Fewer TLB misses for big-memory DBs
BDP	Bandwidth-Delay Product = bandwidth × RTT	Calculation	The correct in-flight TCP buffer size
Congestion control	Algorithm pacing TCP send rate	`net.ipv4.tcp_congestion_control`	`bbr` survives loss; `cubic` collapses on it
RSS / RPS	Receive-Side Scaling (HW) / Packet Steering (SW)	NIC / `rps_cpus`	Spreads packet processing across cores
Ring buffer	NIC RX/TX descriptor queue	`ethtool -g`	Too small → RX drops under burst
PM QoS	Latency tolerance request to the kernel	`/dev/cpu_dma_latency`	Caps C-state depth (tuned `force_latency`)

1. Establish a baseline with the USE method

Before changing anything, capture the system under representative load and write the numbers down — workload, kernel version, active tuned profile, and headline metrics. First, the saturation signals (queueing — work waiting because a resource is full):

# CPU + run-queue saturation. Watch 'r' (runnable) and 'b' (blocked) columns,
# plus 'wa' (iowait) under 'cpu'. r consistently > nr_cpus means CPU saturation;
# high 'b' + 'wa' means tasks blocked on I/O.
vmstat 1 10

# Per-CPU utilization and steal time (steal = a hypervisor took your cycles).
# %idle near 0 on one CPU while others idle = a single-core bottleneck (softirq/IRQ).
mpstat -P ALL 1 5

# Per-device disk: %util (utilization), aqu-sz (avg queue depth = saturation),
# r_await / w_await (latency in ms). High await with modest %util points at the device;
# high %util + high aqu-sz means the queue is the bottleneck.
iostat -xz 1 5

# Memory pressure and reclaim: si/so (swap in/out per sec) should be ~0 on a healthy
# server; pgscank/pgscand (kswapd/direct reclaim scanning) rising = memory pressure.
sar -B 1 5
free -m

# Network throughput, drops and errors per interface
sar -n DEV 1 5
sar -n EDEV 1 5   # errors and drops specifically

Then read PSI — the north-star saturation metric on any kernel 4.20+:

# 'some' = at least one task stalled; 'full' = ALL non-idle tasks stalled.
# avg10/avg60/avg300 are % of time stalled over 10s/60s/5min windows; 'total' is a counter (us).
cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory

Treat the PSI averages as your headline numbers. For latency-sensitive work, capture a wall-clock distribution of your actual workload if you can; for storage microbenchmarks, fio with a percentile list is the reference:

# Storage baseline: 4k random read, direct I/O (bypass pagecache), queue depth 32, 60s.
# Report p99/p99.9 latency, not just IOPS — averages hide tail pain.
fio --name=baseline --filename=/data/fiotest --direct=1 --rw=randread \
    --bs=4k --iodepth=32 --numjobs=4 --group_reporting --runtime=60 --time_based \
    --ioengine=io_uring --percentile_list=50:95:99:99.9

# Record the headline numbers and the environment:
uname -r                    # kernel
tuned-adm active            # current profile
nproc; free -g              # cores and RAM
numactl --hardware | head   # NUMA topology

The metric that matters is not “faster” but which dimension you decided to optimise. Decide before you look. Here are the USE signals per resource, the exact command, and what “bad” looks like — the reference you scan at the start of every investigation:

Resource	Utilization command	Saturation command	Error command	“Bad” looks like
CPU	`mpstat -P ALL 1` (`%idle`)	`vmstat 1` (`r` col); `/proc/pressure/cpu`	`mpstat` `%steal` (hypervisor)	`r` > nr_cpus sustained; `cpu some avg10` high
Memory	`free -m`; `sar -r 1`	`vmstat 1` (`si`/`so`); `/proc/pressure/memory`	`dmesg` OOM; `/proc/vmstat` `oom_kill`	`si`/`so` > 0; `mem full avg10` > 0
Disk	`iostat -xz 1` (`%util`)	`iostat` (`aqu-sz`); `/proc/pressure/io`	`iostat` errors; `smartctl`	`aqu-sz` high + `await` climbing; `io full`
Network	`sar -n DEV 1` (rx/tx)	`ss -ti` (cwnd, retrans); backlog	`sar -n EDEV 1`; `ethtool -S` drops	rxdrop/txdrop > 0; retransmits climbing
File descriptors	`cat /proc/sys/fs/file-nr`	(near `fs.file-max`)	`EMFILE`/`ENFILE` in app logs	allocated near `file-max`
Sockets	`ss -s` (summary)	TIME_WAIT count; ephemeral range	`netstat -s` (listen drops)	listen overflows; port range exhausted

PSI as a first-class metric

PSI changes how you triage. Before PSI, distinguishing “the CPU is busy doing useful work” from “tasks are stalled waiting” required correlating run-queue length, iowait, and reclaim counters by hand; PSI collapses that into one honest number. The full line is the killer signal: full avg10 = 40 on /proc/pressure/io means that for 40% of the last ten seconds every non-idle process was blocked on I/O — a machine on its knees, no matter what “the disk is only 60% utilized” suggests. Wire PSI into monitoring and alerting; it is a leading indicator that a ceiling is being hit, often before utilization looks alarming.

PSI file	`some avgN` means	`full avgN` means	Alert threshold (starting point)
`/proc/pressure/cpu`	≥1 task waiting for a runnable CPU	(n/a — CPU has no “full”)	`some avg60 > 20` sustained
`/proc/pressure/io`	≥1 task blocked on I/O	every task blocked on I/O	`full avg60 > 5` = investigate
`/proc/pressure/memory`	≥1 task stalled in reclaim	every task stalled reclaiming	`full avg60 > 0` = memory-bound

2. tuned: profiles, custom profiles, and tuned-adm

tuned is the right starting layer because it sets dozens of coherent knobs — sysctl, CPU governor, disk scheduler, transparent hugepages, IRQ posture — as a single named, revertible bundle. Hand-editing /etc/sysctl.conf in isolation fights whatever the active profile already applied, and the two disagree silently.

tuned-adm list          # show available profiles (stock + any custom)
tuned-adm active        # what is applied now
tuned-adm recommend     # what tuned thinks fits this machine (VM? bare metal? laptop?)
tuned-adm profile_info  # summary of the active profile

The stock profiles map cleanly to workload classes. The -performance family disables power saving in exchange for latency; the -throughput family sizes buffers up; virtual-* profiles account for running under or over a hypervisor:

Profile	Intended use	Notable behaviour
`balanced`	General desktop/server default	Moderate power saving; `schedutil`/`powersave` governor
`throughput-performance`	Bulk compute, batch, databases	Governor `performance`, larger VM dirty ratios, `mq-deadline`/`none` disk, THP-friendly
`latency-performance`	Low-latency services	`force_latency=cstate.id_no_zero:1` caps C-states, governor `performance`, THP off
`network-latency`	Trading, RPC fan-out	Builds on latency-performance; disables THP, sets `busy_read`/`busy_poll`, lowers `net.core` coalescing
`network-throughput`	Bulk transfer, streaming	Large TCP buffers/backlogs; `net.core.rmem_max`/`wmem_max` raised
`virtual-guest`	VMs (guest side)	Higher `vm.dirty_ratio`, `elevator` sensible under a hypervisor
`virtual-host`	Hypervisor host (KVM)	THP on for guest backing, `sched_migration_cost` raised
`powersave`	Battery / density	Aggressive C-states and frequency scaling
`hpc-compute`	HPC nodes	`performance` governor, `numa_balancing` off, hugepages-friendly

To see exactly what a stock profile changes — never guess, read the source:

# Every stock profile lives here; read the one you are about to inherit
cat /usr/lib/tuned/latency-performance/tuned.conf
cat /usr/lib/tuned/throughput-performance/tuned.conf

# What plugins/units a profile touches (sysctl, cpu, disk, vm, sysfs, bootloader)
tuned-adm profile_info throughput-performance

Authoring a custom child profile

Never edit a stock profile in place — it is owned by the package and will be overwritten on upgrade. Instead create a child profile in /etc/tuned/<name>/tuned.conf that inherits with include= and overrides only the deltas you have measured a reason for. The child profile is where all your tuning lives, version-controlled and revertible:

# /etc/tuned/kv-postgres/tuned.conf
[main]
summary=PostgreSQL 16 on NVMe, dual-socket, derived from throughput-performance
include=throughput-performance

[sysctl]
# Writeback: fixed-byte thresholds on a large-RAM host (see VM section)
vm.dirty_background_bytes=268435456   # start background flush at 256 MB dirty
vm.dirty_bytes=1073741824             # hard-block writers at 1 GB dirty
vm.swappiness=1                       # avoid swap but allow under true duress
vm.zone_reclaim_mode=0                # never reclaim local pagecache to avoid remote alloc
# Network for a busy DB accepting many short-lived connections
net.core.somaxconn=4096
net.ipv4.tcp_max_syn_backlog=8192

[vm]
transparent_hugepages=never           # Postgres manages its own buffers; THP defrag stalls hurt

[cpu]
governor=performance
# Cap C-states for predictable wakeup latency (leave C1, disallow deep sleep)
force_latency=cstate.id_no_zero:1

[disk]
# devices.udev.regex matches by kernel name; set scheduler + readahead per class
devices=nvme*n*
elevator=none
readahead=256

[bootloader]
# Reserve explicit hugepages for shared_buffers (requires reboot to take)
cmdline=transparent_hugepage=never

Apply and confirm it took effect:

tuned-adm profile kv-postgres
tuned-adm verify           # asserts EVERY setting in the profile is actually live

tuned-adm verify is the step almost everyone skips. It re-reads each knob and tells you if something else on the box — a competing systemd unit, a stale /etc/sysctl.d/ drop-in, an elevator= kernel arg — overrode your profile. If verify fails it names the setting; fix the conflict before you trust any later measurement. The [main] include= chain is transitive, so kv-postgres → throughput-performance → (base); you only write the leaf deltas.

The tuned plugin sections you will actually use, and what each controls:

Section	Controls	Example key(s)	Notes
`[main]`	Profile metadata + inheritance	`summary`, `include`	`include` is the whole point — inherit, don’t copy
`[sysctl]`	Any `/proc/sys` parameter	`vm.swappiness`, `net.core.somaxconn`	Applied via the sysctl plugin; verifiable
`[vm]`	Transparent hugepages	`transparent_hugepages=never`	Cleaner than echoing to `/sys`
`[cpu]`	Governor, C-state latency, energy-perf-bias	`governor`, `force_latency`, `energy_perf_bias`	`force_latency` drives PM-QoS
`[disk]`	Per-device scheduler + readahead + spindown	`devices`, `elevator`, `readahead`	`devices=` regex targets a device class
`[sysfs]`	Arbitrary `/sys` writes	`/sys/kernel/mm/...=...`	Escape hatch for knobs without a plugin
`[bootloader]`	Kernel cmdline args	`cmdline=isolcpus=... nohz_full=...`	Requires reboot; use for isolcpus/hugepages/cstate
`[net]`	NIC-level tuning via ethtool	`features`, `channels`, `coalesce`	Ring buffers, offloads, RSS channels
`[scheduler]`	CPU affinity / IRQ isolation	`isolated_cores`, `default_irq_smp_affinity`	Pin app and IRQs off isolated cores

The tuned-adm verbs you will use, and exactly what each does:

Command	What it does	When to run it
`tuned-adm active`	Prints the currently applied profile	First thing, always — know your starting posture
`tuned-adm list`	Lists all available profiles	Discovering options
`tuned-adm recommend`	Suggests a profile from hardware/virt detection	Fresh box; sanity-check the default
`tuned-adm profile <name>`	Applies a profile (persists across reboot)	Activating your custom profile
`tuned-adm profile <a> <b>`	Merge multiple profiles (later wins on conflict)	Combining e.g. `latency-performance mssql`
`tuned-adm verify`	Asserts every setting in the profile is live	After every apply — catches conflicts
`tuned-adm off`	Deactivates tuned entirely (reverts to kernel defaults)	Isolating whether tuned is the variable
`tuned-adm profile_info`	Shows the active profile’s summary/plugins	Understanding what a profile touches

3. Virtual memory: dirty ratios, swappiness, THP, hugepages

The VM subsystem decides when dirty (modified, not-yet-written) pages flush to disk and how aggressively the kernel reclaims memory under pressure. Defaults assume a general-purpose machine; servers with fast storage and large RAM want different behaviour.

sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_bytes vm.dirty_background_bytes
sysctl vm.swappiness vm.vfs_cache_pressure vm.min_free_kbytes
cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag

Dirty ratios and writeback

vm.dirty_background_ratio is the percentage of available memory at which the kernel starts flushing dirty pages in the background (asynchronously, via the writeback threads); vm.dirty_ratio is the hard ceiling at which a process that dirties a page is blocked synchronously until enough is flushed. On a box with 256 GB RAM, the default 20% ceiling means 50+ GB can go dirty before a synchronous stall — a latency cliff where every writer freezes while a huge writeback drains. Lower the thresholds, and on large-memory machines prefer the _bytes variants for a fixed, predictable trigger (setting a _bytes value zeroes the corresponding _ratio):

# /etc/sysctl.d/91-vm.conf
# Fixed-byte thresholds are clearer than ratios on big-RAM hosts.
vm.dirty_background_bytes = 268435456   # start background flush at 256 MB dirty
vm.dirty_bytes           = 1073741824   # hard-block writers at 1 GB dirty

# Database hosts: keep the working set in RAM, swap only under true duress
vm.swappiness = 1

# Free pagecache/dentry aggressiveness; raise on inode-heavy NFS/file servers
vm.vfs_cache_pressure = 100

# Emergency reserve so atomic allocations and network RX don't fail under pressure.
# Raise on high-throughput/large-RAM hosts (but not absurdly — wastes RAM).
vm.min_free_kbytes = 1048576

The four writeback/reclaim knobs, side by side:

Knob	Controls	Default	Change when	Trade-off / gotcha
`vm.dirty_background_ratio`	% dirty to start async flush	10	Prefer `_bytes` on large RAM	Ratio of available, not total, memory
`vm.dirty_ratio`	% dirty to block writers	20	Lower on large RAM to cap stall size	Hitting it freezes all writers synchronously
`vm.dirty_background_bytes`	Bytes dirty to start async flush	0 (uses ratio)	Large-RAM host wanting a fixed trigger	Setting it zeroes `dirty_background_ratio`
`vm.dirty_bytes`	Bytes dirty to block writers	0 (uses ratio)	Large-RAM host, predictable ceiling	Setting it zeroes `dirty_ratio`
`vm.dirty_expire_centisecs`	Age (1/100 s) before a dirty page is eligible	3000 (30 s)	Lower for durability-sensitive writeback	Too low = constant small writes
`vm.dirty_writeback_centisecs`	How often writeback threads wake	500 (5 s)	Rarely; 0 disables periodic writeback	0 risks large data loss on crash

Swappiness — and why 0 is a trap

vm.swappiness (0–200 on modern kernels, default 60) biases the kernel between reclaiming pagecache and swapping out anonymous memory. Lower means “prefer dropping pagecache over swapping.” A few hard-won rules:

vm.swappiness=0 does not disable swap. It makes the kernel avoid swapping until reclaim is nearly impossible, which can trigger the OOM killer sooner because it will kill a process rather than swap. Use 1 for “avoid but allow” on databases; reserve 0 only when you genuinely never want swap pressure and accept the earlier-OOM trade-off. To truly disable swap, swapoff -a and remove it from /etc/fstab — a decision, not a swappiness value.
On cgroup-v2 systems, per-cgroup memory.swap.max and memory.high give finer control than the global knob — prefer them for isolating a service’s memory behaviour (see the systemd resource-control article).

`vm.swappiness`	Behaviour	Use for	Risk
`0`	Avoid swap until reclaim nearly impossible	Never-swap intent, understanding OOM	OOM-kills sooner instead of swapping
`1`	Swap only under true memory duress	Databases, latency-critical services	Minimal; the sane “avoid” value
`10`	Swap reluctantly	General servers preferring cache	—
`60`	Balanced (default)	General-purpose / desktop	May swap active pages under cache pressure
`100`+	Swap eagerly, favour pagecache	File servers with cold anon memory	Latency if hot pages get swapped

Transparent Hugepages (THP)

Transparent Hugepages let the kernel automatically back memory with 2 MB pages instead of 4 KB, reducing TLB misses — a real win for some workloads and a real hurt for others. Databases and JVMs that manage their own memory (PostgreSQL, MySQL/InnoDB, Redis, MongoDB, Oracle, Java heaps) routinely recommend never, because THP’s background defragmentation (khugepaged) and synchronous allocation stalls (collapsing/splitting huge pages) cause unpredictable latency spikes. Set it explicitly rather than leaving it on madvise (the modern default) by accident:

# Runtime (evaporates on reboot)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Confirm which mode is active — the [bracketed] value is live
cat /sys/kernel/mm/transparent_hugepage/enabled   # always [madvise] never

Make it durable through tuned ([vm] transparent_hugepages=never) or the kernel cmdline (transparent_hugepage=never), not a one-shot echo. The three THP modes:

THP mode	Behaviour	Best for	The gotcha
`always`	Kernel uses huge pages everywhere it can	HPC, big scientific arrays, some analytics	`khugepaged` defrag + alloc stalls hurt latency
`madvise`	Only where the app calls `madvise(MADV_HUGEPAGE)`	Apps that opt in deliberately	Silent default; may still stall on collapse
`never`	Disabled; only explicit hugepages used	PostgreSQL, MySQL, Redis, Mongo, JVM heaps	None for self-managing apps — the safe DB default

Explicit hugepages for databases

Distinct from THP, explicit hugepages are pages you reserve (2 MB or 1 GB) that an application maps deliberately via hugetlbfs or MAP_HUGETLB. A database with a large shared buffer (PostgreSQL shared_buffers, Oracle SGA) benefits: fewer, larger page-table entries mean fewer TLB misses on the hot buffer, and the memory is pinned (never swapped or split). Reserve them at boot for reliability (fragmentation makes runtime reservation of large counts unreliable):

# Inspect current hugepage state
grep Huge /proc/meminfo
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Reserve 8192 × 2 MB pages = 16 GB (runtime; may partially fail if fragmented)
sysctl -w vm.nr_hugepages=8192

# 1 GB hugepages must be reserved on the kernel cmdline (rarely available at runtime):
#   default_hugepagesz=1G hugepagesz=1G hugepages=16   -> 16 GB in 1 GB pages

For durable reservation, set vm.nr_hugepages in /etc/sysctl.d/ (early boot) or the cmdline. Grant the DB’s group access via vm.hugetlb_shm_group and configure the app to use them (huge_pages = on in postgresql.conf). Explicit hugepages vs THP, decision-wise:

Aspect	Explicit hugepages	Transparent Hugepages (THP)
How allocated	Reserved by admin, mapped by app deliberately	Kernel promotes 4K pages automatically
Page sizes	2 MB and 1 GB	2 MB only
Swappable	No — pinned	Can be split/reclaimed
Latency behaviour	Predictable (no runtime defrag)	Unpredictable stalls from `khugepaged`
Best for	DB shared buffers (Postgres, Oracle SGA)	General apps that opt in via `madvise`
Set via	`vm.nr_hugepages` / cmdline + app config	`/sys/.../transparent_hugepage` / tuned

4. CPU: governors, C-states, IRQ affinity, isolcpus, nohz_full

For latency-sensitive services, the enemy is the kernel saving power. Frequency scaling and deep C-states add tens to hundreds of microseconds of wakeup jitter that shows up directly in your p99.

# Which cpufreq driver and governor are active?
cpupower frequency-info

# Per-core current governor and available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

# Which idle driver and C-states exist, and their exit latencies (us)
cpupower idle-info

cpufreq governors

The governor is the policy that chooses core frequency. On modern Intel/AMD the driver is often intel_pstate/amd-pstate (which exposes only performance and powersave), or the generic acpi-cpufreq (which exposes the full set including schedutil, ondemand, conservative). For latency, performance pins every core at max frequency — no ramp-up delay on a wakeup:

# Pin all cores to max frequency (cleanest via tuned latency-performance, but direct works)
cpupower frequency-set -g performance

# Verify it stuck
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | sort -u

Governor	Behaviour	Best for	Trade-off
`performance`	Max frequency always	Latency-critical, HPC, databases	Highest power/heat; no frequency jitter
`powersave` (pstate)	Lowest frequency, ramps under load	Density, battery	Ramp-up latency on bursty load
`schedutil`	Scheduler-driven, modern default	General servers	Good balance; slight ramp latency
`ondemand`	Legacy load-based scaling	Older kernels/`acpi-cpufreq`	Coarser than schedutil
`conservative`	Like ondemand but gradual	Slowly-varying load	Slow to reach max on spikes

C-states — the bigger latency lever

C-states are deeper wins and deeper traps than governors. When a core idles into a deep C-state (C6), the exit latency to wake it can be tens to hundreds of microseconds — which can dwarf a service that processes a request in 20 µs. The robust, supported way to cap idle depth is the PM-QoS interface: request a maximum tolerable CPU wakeup latency in microseconds, and the kernel refuses any C-state whose exit latency exceeds it.

# Read each C-state's exit latency (us) and residency
cpupower idle-info
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name

# Hold /dev/cpu_dma_latency open requesting <=5us wake latency. The kernel then
# refuses C-states whose exit latency exceeds 5us. Keep the fd open; closing releases it.
exec 4<>/dev/cpu_dma_latency
printf '\x05\x00\x00\x00' >&4   # 5 (us) as a 32-bit little-endian int
# ... constraint holds while fd 4 stays open ...

In practice you let tuned own this with force_latency in a [cpu] section rather than scripting the device by hand — latency-performance sets force_latency=cstate.id_no_zero:1, meaning “allow only C-states with id ≤ 1 (C1), never deep sleep.” For boot-time guarantees on dedicated latency nodes, the kernel cmdline arg intel_idle.max_cstate=1 (Intel) or processor.max_cstate=1 idle=poll is the blunt, permanent instrument — but idle=poll burns 100% CPU on idle cores (extreme, only for the most latency-sensitive isolated cores).

C-state control	Mechanism	Effect	When to use
PM QoS (`force_latency`)	tuned `[cpu]` / `/dev/cpu_dma_latency`	Cap wakeup latency dynamically	Default, reversible way to bound C-states
`intel_idle.max_cstate=N`	Kernel cmdline (Intel)	Hard-cap idle depth at boot	Dedicated latency nodes, permanent
`processor.max_cstate=N`	Kernel cmdline (generic)	Hard-cap idle depth at boot	Non-Intel or as a fallback
`idle=poll`	Kernel cmdline	Never idle; spin instead	Extreme low-latency; wastes power/heat
`cpupower idle-set -d N`	Runtime	Disable a specific C-state	Testing which C-state costs you

IRQ affinity and softirq

An interrupt (IRQ) fires on whichever CPU the kernel or irqbalance assigned it; the follow-on softirq (e.g. NET_RX for received packets) runs there too. On a busy NIC, a single queue’s softirq can saturate one core to 100% %soft while others idle — a classic single-core bottleneck that mpstat -P ALL reveals instantly. Spread interrupts off your application cores:

# See which CPUs handle each IRQ and per-CPU interrupt counts (NIC IRQs named e.g. eth0-rx-0)
grep -E 'CPU|eth0' /proc/interrupts

# Per-CPU softirq breakdown — a lone CPU high in %soft is the tell
mpstat -I SCPU -P ALL 1 3

# Pin IRQ 142 to CPU2 (mask 0x4). Masks are hex bitmasks of CPUs (bit0=cpu0).
echo 4 > /proc/irq/142/smp_affinity
# Or as a CPU list:
echo 2 > /proc/irq/142/smp_affinity_list

# irqbalance distributes automatically; disable it if you pin manually and want it to stay
systemctl status irqbalance

The principle: keep NIC interrupt handling, the kernel softirq, and the application thread on cores that share an L3 cache, and away from cores doing latency-critical user work. On NUMA boxes that means same-socket as the NIC’s PCIe root. Many multi-queue NIC drivers ship a set_irq_affinity.sh script that spreads each queue’s IRQ to a distinct core — the right starting point.

isolcpus, nohz_full, rcu_nocbs — dedicated cores

For the hardest latency requirements (packet processing, trading, real-time), you carve cores out of the general scheduler entirely so nothing runs on them except the thread you pin. This is boot-time configuration via the kernel cmdline:

# Kernel cmdline (via tuned [bootloader] or grub) for a 16-core box dedicating cores 8-15:
#   isolcpus=8-15 nohz_full=8-15 rcu_nocbs=8-15 irqaffinity=0-7
# - isolcpus: remove 8-15 from the scheduler's load balancing (only explicit pins run there)
# - nohz_full: make 8-15 tickless (no periodic scheduler tick) when running one task
# - rcu_nocbs: offload RCU callbacks off 8-15 to the housekeeping cores
# - irqaffinity: default IRQs to the housekeeping cores 0-7

# Pin an application thread to an isolated core at runtime
taskset -c 8 ./my-latency-app
# Or via a systemd unit:  [Service] CPUAffinity=8-15

The tuned [scheduler] plugin expresses this cleanly with isolated_cores=8-15, which sets the cmdline and moves IRQs and kernel threads off those cores. The cost is real: isolated cores are unavailable to everything else, so you trade fleet-wide capacity for a few cores of near-jitter-free execution. Only do it when the USE method proves scheduler/tick jitter is your bottleneck.

CPU-isolation knob	What it removes from the core	Set where	Cost
`isolcpus=`	Scheduler load-balancing (no auto-placed tasks)	cmdline / tuned `[scheduler]`	Core idle unless explicitly pinned
`nohz_full=`	The periodic scheduler tick (when 1 task)	cmdline	Needs `rcu_nocbs`; timekeeping caveats
`rcu_nocbs=`	RCU callback processing	cmdline	Offloads work to housekeeping cores
`irqaffinity=`	Default IRQ delivery	cmdline	Concentrates IRQs on housekeeping cores
`taskset -c` / `CPUAffinity`	(pins a process TO cores)	runtime / systemd	Manual placement responsibility

5. NUMA awareness and memory locality

On multi-socket servers, memory attached to a remote socket costs 1.5×–2× the access latency of local memory, and remote bandwidth is lower. A process scheduled on socket 1 reaching into socket 0’s DIMMs pays that tax on every cache miss. This is the single most common invisible regression on large machines, and it gets worse with more cores because work spreads wider across nodes.

# Topology: nodes, the CPUs in each node, memory per node, and the inter-node distance matrix
numactl --hardware

# Per-node hit/miss stats. numa_miss / numa_foreign climbing = remote-memory pain.
numastat -m

# Rich topology (cores, caches, PCIe/NIC attachment per node) — invaluable for IRQ+NUMA
lstopo-no-graphics

numactl --hardware prints a distance matrix where local is 10 and remote is typically 20+ (a relative cost, not nanoseconds). numastat gives the counters that prove pain: numa_hit (allocations satisfied from the intended node), numa_miss/numa_foreign (allocations that had to come from another node), and other_node (allocations on a node other than where the process runs). Climbing numa_foreign is the smoking gun.

Binding vs interleaving

For a single dominant process (a database, a JVM, a packet processor), pin both its CPUs and its memory to one node so allocations stay local:

# Run on node 0's CPUs, allocate only from node 0's memory
numactl --cpunodebind=0 --membind=0 /usr/lib/postgresql/16/bin/postgres -D /data/pg

# Inspect a running process's NUMA placement and per-node memory
numastat -p $(pgrep -f 'postgres: checkpointer' )
cat /proc/$(pgrep -of postgres)/numa_maps | head

For services that legitimately span sockets (a big in-memory cache where any core may touch any key), --interleave=all spreads allocations round-robin so no single node’s memory bandwidth becomes the bottleneck:

numactl --interleave=all /usr/bin/redis-server /etc/redis/redis.conf

The systemd-native way to bind a managed service is cleaner than wrapping ExecStart in numactl:

# /etc/systemd/system/postgresql@.service.d/numa.conf
# Templated unit: postgresql@0 binds node 0, postgresql@1 binds node 1
[Service]
NUMAPolicy=bind
NUMAMask=%i

NUMA policy	Behaviour	Best for	Set via
`bind` (`--membind`)	Allocate only from listed node(s); fail/OOM if full	Single dominant process (DB, JVM)	`numactl --membind` / systemd `NUMAPolicy=bind`
`preferred`	Prefer a node, fall back to others if full	Mostly-local process tolerating overflow	`numactl --preferred` / `NUMAPolicy=preferred`
`interleave`	Round-robin across nodes	Bandwidth-bound spanning caches	`numactl --interleave=all` / `NUMAPolicy=interleave`
`local` (default)	Allocate on the node of the running CPU	General; works if the scheduler keeps it local	default; `NUMAPolicy=default`
`cpunodebind`	Restrict CPUs to a node (pairs with membind)	Pinning both CPU and memory	`numactl --cpunodebind`

zone_reclaim_mode — the silent killer

Beware vm.zone_reclaim_mode. On some NUMA topologies older kernels defaulted it to on, causing the kernel to aggressively reclaim local pagecache rather than allocate one page from a remote node — devastating for file-cache-heavy workloads, which lose their cache to avoid a cheap remote allocation. Confirm it is 0 and keep it off unless you have a specific measured reason:

sysctl vm.zone_reclaim_mode   # want: 0
# If not zero:
sysctl -w vm.zone_reclaim_mode=0

Also relevant is kernel.numa_balancing (Automatic NUMA Balancing) — the kernel periodically unmaps pages and migrates them toward the node accessing them. Helpful for workloads that move between nodes, but its page-fault overhead can hurt a process you have already pinned. If you bind explicitly, disable it (sysctl -w kernel.numa_balancing=0, or numa_balancing=disable on cmdline). The NUMA-related sysctls:

Knob	Default	Effect	Set to
`vm.zone_reclaim_mode`	0 (modern)	Reclaim local pagecache before remote alloc	`0` — almost always
`kernel.numa_balancing`	1	Auto-migrate pages toward accessing node	`0` if you pin explicitly
`kernel.numa_balancing_scan_delay_ms`	1000	How soon after fork to start scanning	Leave unless balancing tuning

6. Block I/O schedulers and queue depth per device class

The scheduler that is right for a spinning disk is wrong for NVMe, and vice versa. Modern kernels ship the multi-queue block layer (blk-mq) with these relevant schedulers:

# Which schedulers exist and which is active (the [bracketed] one)
cat /sys/block/nvme0n1/queue/scheduler   # e.g. [none] mq-deadline kyber bfq

# Is the device rotational? 0 = SSD/NVMe, 1 = HDD
cat /sys/block/sda/queue/rotational

Scheduler	Best for	Why
`none`	NVMe / fast SSD	The device has deep internal queues and reorders better than the OS; software scheduling only adds latency
`mq-deadline`	SATA SSD, HDD, mixed, predictable latency	Lightweight; bounds worst-case latency with per-read/write deadlines; prevents starvation
`kyber`	High-IOPS SSD wanting latency targets	Token-based; throttles to hit configurable read/write latency targets; low overhead
`bfq`	Desktops, interactive, latency-fairness across competing apps	Proportional-share fairness per process/cgroup; CPU cost too high for high-IOPS server paths

Set per device at runtime, then make it durable:

# Set none on NVMe (runtime)
echo none > /sys/block/nvme0n1/queue/scheduler
# Set mq-deadline on a SATA SSD (runtime)
echo mq-deadline > /sys/block/sda/queue/scheduler

Runtime echoes do not survive reboot. Make the choice durable and device-class-aware with a udev rule, so a node with mixed media gets the right scheduler on each device automatically:

# /etc/udev/rules.d/60-ioscheduler.rules
# NVMe -> none (device queues + reordering beat the OS scheduler)
ACTION=="add|change", KERNEL=="nvme[0-9]*n[0-9]*", ATTR{queue/scheduler}="none"
# Non-rotational SATA/SAS (SSD) -> mq-deadline
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# Rotational (HDD) -> mq-deadline (bfq only if you need latency-fairness on a shared disk)
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"

Reload rules and re-trigger without a reboot: udevadm control --reload && udevadm trigger --subsystem-match=block. The tuned [disk] plugin (devices=nvme*n*, elevator=none) is the alternative that keeps the choice inside your profile — pick one mechanism, not both, or tuned-adm verify will fight the udev rule.

Queue depth and readahead

Two supporting knobs per device fine-tune the OS-side queue and prefetch behaviour:

nr_requests — the depth of the OS-side request queue. For high-IOPS NVMe under heavy parallelism, raising it can reduce queue-full backpressure; too high wastes memory and can hurt latency by letting a huge queue build.
read_ahead_kb — the readahead window. Large sequential workloads (analytics scans, backups, media) benefit from a large window (prefetching pays off); small-random OLTP benefits from a small window (prefetch is wasted I/O). There is no universal value — this is exactly the knob to A/B with fio against your real access pattern.

# Deeper OS queue for high-parallelism NVMe (default often 256 on NVMe)
echo 1023 > /sys/block/nvme0n1/queue/nr_requests

# Readahead tuned to access pattern
echo 128  > /sys/block/nvme0n1/queue/read_ahead_kb   # small-random OLTP: minimal prefetch
echo 4096 > /sys/block/sdb/queue/read_ahead_kb       # sequential scans/backups: aggressive prefetch

The complete per-device block-queue knob set:

Knob (`/sys/block/<dev>/queue/`)	Controls	Typical default	Change for
`scheduler`	Active blk-mq scheduler	`none` (NVMe) / `mq-deadline` (SATA)	Match device class (see table above)
`nr_requests`	OS-side request queue depth	256 (NVMe) / 128	Raise for high-IOPS parallel NVMe
`read_ahead_kb`	Sequential readahead window	128	Raise for sequential, lower for random
`rotational`	Hint: 1=HDD, 0=SSD	Auto-detected	Correct only if mis-detected
`add_random`	Contribute to entropy pool	1 (HDD) / 0 (SSD)	Set 0 on SSD (avoids contention)
`rq_affinity`	Complete I/O on issuing CPU/socket	1	2 = strict same-CPU completion (cache locality)
`write_cache`	write-back vs write-through	device-set	Match to power-loss-protection reality
`max_sectors_kb`	Max I/O size per request	device max	Rarely; align to device/RAID stripe
`nomerges`	Disable request merging	0	1 or 2 to test merge overhead on fast NVMe

For mq-deadline and kyber, per-scheduler tunables live in /sys/block/<dev>/queue/iosched/:

Scheduler tunable	Scheduler	Controls	Default
`read_expire`	mq-deadline	Read deadline (ms) before prioritised	500
`write_expire`	mq-deadline	Write deadline (ms)	5000
`writes_starved`	mq-deadline	Reads served before a starved write	2
`fifo_batch`	mq-deadline	Requests dispatched per batch	16
`read_lat_nsec`	kyber	Target read latency (ns)	2000000
`write_lat_nsec`	kyber	Target write latency (ns)	10000000

7. The network stack: sockets, backlogs, congestion control

Network tuning is where copied values do the most damage, because the right buffer size is a function of the bandwidth-delay product, not a constant. Compute, do not guess: BDP (bytes) = bandwidth (bytes/s) × RTT (seconds). A 10 GbE link (1.25 GB/s) at 10 ms RTT needs ~12.5 MB of in-flight buffer per stream; a 1 GbE LAN at 0.2 ms needs ~25 KB. Sizing a LAN server’s buffers at 16 MB wastes memory; sizing a WAN transfer’s at 256 KB caps you far below line rate.

# Inspect current limits
sysctl net.core.rmem_max net.core.wmem_max
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
sysctl net.ipv4.tcp_congestion_control net.core.default_qdisc

# Available congestion-control algorithms (bbr may need a module)
sysctl net.ipv4.tcp_available_congestion_control

# Live per-socket internals: cwnd, rtt, retransmits, send/recv queue
ss -tin

Socket buffers sized from BDP

TCP autotuning grows the window between the min and max you set. Leave default modest and let it grow to max; size max from your worst-case BDP:

# /etc/sysctl.d/90-net-throughput.conf
# Max socket buffer the kernel will grant (bytes). Size for worst-case BDP.
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# TCP autotuning: min / default / max. Let it grow to max under load.
net.ipv4.tcp_rmem = 4096 131072 16777216
net.ipv4.tcp_wmem = 4096 16384  16777216

# Autotuning ON (default; do not disable)
net.ipv4.tcp_moderate_rcvbuf = 1

Knob	Controls	Default (approx)	Size from	Gotcha
`net.core.rmem_max`	Max receive socket buffer	212992	Worst-case BDP	Ceiling for `SO_RCVBUF`; apps can request up to this
`net.core.wmem_max`	Max send socket buffer	212992	Worst-case BDP	Ceiling for `SO_SNDBUF`
`net.ipv4.tcp_rmem`	min/default/max RX autotune	4096/131072/6291456	Set max = BDP	Middle value is per-socket default
`net.ipv4.tcp_wmem`	min/default/max TX autotune	4096/16384/4194304	Set max = BDP	Autotuning grows toward max under load
`net.ipv4.tcp_mem`	Global TCP memory pages (min/pressure/max)	auto-sized to RAM	Rarely — kernel sizes it	In pages, not bytes; total across all sockets
`net.core.optmem_max`	Max ancillary buffer per socket	20480	Rarely	Affects cmsg/control data

Backlogs and connection acceptance

Three separate queues govern connection acceptance and packet ingress; conflating them is a classic mistake:

# /etc/sysctl.d/90-net-conn.conf
# Backlog of FULLY-ESTABLISHED sockets waiting for the app to accept() them
net.core.somaxconn = 4096
# Backlog of HALF-OPEN (SYN received, not yet ACKed) connections
net.ipv4.tcp_max_syn_backlog = 8192
# Per-CPU packet ingress queue when packets arrive faster than the stack drains them
net.core.netdev_max_backlog = 16384
# Reuse TIME_WAIT sockets for new OUTBOUND connections (safe; NOT tcp_tw_recycle)
net.ipv4.tcp_tw_reuse = 1

The critical distinction: somaxconn is the accept queue depth — but it is a ceiling, not a guarantee. The application must also request a large listen() backlog (nginx backlog=, or the runtime default), or the kernel caps at whatever the app asked for. If a busy server drops connections under a burst, check netstat -s | grep -i 'listen' for “times the listen queue of a socket overflowed.”

Backlog knob	Queue it sizes	Default	Symptom when too small	Confirm overflow with
`net.core.somaxconn`	Accept queue (established, awaiting `accept()`)	4096 (recent) / 128 (old)	Connection resets/timeouts under burst	`ss -lnt` (Recv-Q on listener); `nstat` `TcpExtListenOverflows`
`net.ipv4.tcp_max_syn_backlog`	SYN queue (half-open)	1024–4096	SYN drops during SYN floods/bursts	`nstat TcpExtTCPReqQFullDrop`
`net.core.netdev_max_backlog`	Per-CPU ingress queue (softirq)	1000	RX drops when packets outpace stack	`/proc/net/softnet_stat` col 2 (drops)
`net.ipv4.tcp_tw_reuse`	(reuse TIME_WAIT for new outbound)	2 (loopback only) / 0	Ephemeral port exhaustion on chatty clients	`ss -s` TIME_WAIT count

A critical correction to old internet folklore: never enable net.ipv4.tcp_tw_recycle. It was removed entirely in kernel 4.12 because it broke connections from clients behind NAT (it dropped packets whose timestamps looked out of order across NATed sources). The safe sibling is tcp_tw_reuse (reuse TIME_WAIT sockets for outbound connections only), shown above. If a guide tells you to set tcp_tw_recycle, treat the rest of that guide as suspect.

Congestion control and qdisc

The congestion-control algorithm decides how fast TCP sends. cubic (the default) is loss-based: it treats any packet loss as congestion and cuts its window hard — which collapses throughput on fat, lossy, or long-RTT paths where loss is often not congestion. bbr models the path’s bandwidth and RTT instead, so sporadic loss does not tank it — the modern choice worth testing for WAN, wireless, or high-BDP links. BBR pairs with the fq qdisc for correct pacing:

# Enable BBR (module usually auto-loads) and the fq qdisc it needs
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

# Confirm on a live connection
ss -tin | grep -o 'bbr.*' | head

Congestion control	Model	Best for	Trade-off
`cubic`	Loss-based (default)	LAN, low-loss paths	Collapses throughput on non-congestive loss
`bbr`	Bandwidth+RTT model	WAN, high-BDP, lossy/wireless	Needs `fq` qdisc; can be aggressive to `cubic` neighbours
`reno`	Classic loss-based	Compatibility/reference	Conservative; rarely chosen deliberately
`dctcp`	ECN-based (datacenter)	Low-latency DC fabrics with ECN	Requires ECN end to end

Other TCP knobs worth knowing (change deliberately, not reflexively):

Knob	Controls	Default	Note
`net.ipv4.tcp_fin_timeout`	Seconds in FIN-WAIT-2	60	Lower reclaims sockets faster; rarely needed
`net.ipv4.tcp_keepalive_time`	Idle secs before keepalive probe	7200	Lower for faster dead-peer detection
`net.ipv4.ip_local_port_range`	Ephemeral port range	32768–60999	Widen for very chatty outbound clients
`net.ipv4.tcp_slow_start_after_idle`	Reset cwnd after idle	1	Set 0 for long-lived bursty connections
`net.ipv4.tcp_mtu_probing`	PMTU black-hole detection	0	Set 1 where PMTUD is broken (tunnels)
`net.ipv4.tcp_notsent_lowat`	Bytes unsent before app notified	-1	Lower for latency-sensitive senders

8. The NIC layer: offloads, ring buffers, RSS/RPS, coalescing

Below the socket sits the NIC, and its defaults are tuned for general use, not your workload. ethtool is the tool for all of it.

# Driver, link speed, and feature (offload) state
ethtool eth0
ethtool -k eth0          # offloads: GRO, GSO, TSO, LRO, checksum, RSS
ethtool -g eth0          # ring buffer sizes (current vs max)
ethtool -l eth0          # channel (queue) counts for RSS
ethtool -c eth0          # interrupt coalescing
ethtool -S eth0          # per-queue stats and drop counters (the truth source)

Ring buffers

The NIC ring is the descriptor queue between the card and the kernel. Too small and a burst overflows it, dropping packets before the stack even sees them (ethtool -S shows rx_dropped/rx_no_buffer/rx_fifo_errors). Raise RX/TX rings toward the device max for bursty high-throughput workloads:

# Current: RX 512, max 4096 — raise RX to 4096 for burst tolerance
ethtool -G eth0 rx 4096 tx 4096

The trade-off: larger rings add a little latency (packets can sit longer) and use more memory, but for throughput and burst tolerance they are usually the right move. Confirm the drop counter stops climbing after the change under the same load.

Offloads

Offloads move per-packet work (segmentation, checksums, coalescing) from the CPU to the NIC or into batched kernel paths, raising throughput and cutting CPU. They are on by default and usually want to stay on for throughput — but a few (notably LRO) merge packets in ways that break forwarding/routing and some latency-sensitive paths:

Offload	What it does	Keep on for	Turn off for
GRO (Generic Receive Offload)	Coalesces RX packets in software before the stack	Throughput on end hosts	Very latency-sensitive RX (adds small delay)
LRO (Large Receive Offload)	Coalesces RX in hardware (lossy of headers)	Pure end-host throughput	Any router/bridge/forwarder — breaks it
GSO (Generic Segmentation Offload)	Defers TX segmentation to just before the NIC	Throughput (default on)	Rarely off
TSO (TCP Segmentation Offload)	NIC segments large TCP sends	Throughput, lower CPU	Debugging; some virt paths
RX/TX checksum	NIC computes checksums	Almost always	Rarely (debugging corruption)
RSS (Receive-Side Scaling)	HW hashes flows across RX queues/CPUs	Multi-core throughput	Never off; tune queue count instead

# Turn LRO off on a forwarding host (routers/bridges must not coalesce)
ethtool -K eth0 lro off
# Turn GRO off only if you measure latency benefit and can afford the CPU
ethtool -K eth0 gro off

RSS and RPS — spreading packet processing

A single RX queue’s softirq processes on one CPU; at high packet rates that one core saturates while others idle. RSS (Receive-Side Scaling) is the hardware solution: the NIC hashes each flow to one of several RX queues, each with its own IRQ pinned to a different CPU. RPS (Receive Packet Steering) is the software fallback for NICs with too few hardware queues — the kernel redistributes received packets across CPUs after the IRQ:

# RSS: how many hardware queues (channels) does the NIC expose, and set them
ethtool -l eth0
ethtool -L eth0 combined 8      # 8 RX/TX queues -> spread IRQs across 8 cores

# RPS (software): steer RX from queue 0 to CPUs 0-7 (hex bitmask)
echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus
# RFS (flow steering) so a flow lands on the CPU running its app thread
sysctl -w net.core.rps_sock_flow_entries=32768
echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

Mechanism	Layer	Spreads	Use when
RSS	Hardware	Flows across HW RX queues → IRQs → CPUs	NIC has enough HW queues (modern 10G+)
RPS	Software (kernel)	Packets across CPUs after IRQ	NIC has too few HW queues
RFS	Software	Flows to the CPU running the consuming app	Cache locality between softirq and app
XPS (Transmit Packet Steering)	Software	TX across queues by CPU	Multi-queue TX locality
aRFS	Hardware-accelerated RFS	HW steers flow to app’s CPU	NIC + driver support aRFS

Interrupt coalescing

Coalescing batches interrupts — the NIC waits a few microseconds or a few packets before interrupting, trading a little latency for far fewer interrupts (and thus more throughput and less CPU). Lower it for latency, raise it for throughput:

# Adaptive coalescing (driver picks) is a good default; or set explicit values:
ethtool -C eth0 adaptive-rx on adaptive-tx on
ethtool -C eth0 rx-usecs 50 rx-frames 64     # throughput-leaning
ethtool -C eth0 rx-usecs 5  rx-frames 8      # latency-leaning

Coalescing knob	Controls	Latency-leaning	Throughput-leaning
`rx-usecs`	µs to wait before RX interrupt	Low (5)	Higher (50–100)
`rx-frames`	Packets to batch before RX interrupt	Low (8)	Higher (64)
`adaptive-rx`	Driver auto-tunes coalescing	on (good default)	on
`tx-usecs`/`tx-frames`	Same for TX	Low	Higher

9. Observability: perf, bpftrace, and sar

The coarse USE tools (vmstat, iostat, mpstat, PSI) tell you which resource is the bottleneck. To attribute it to a specific function, syscall, or off-CPU wait — the difference between “the CPU is busy” and “the CPU is busy in pg_checksum_page because you enabled data checksums” — you need perf and bpftrace. And to see trends over time rather than a live snapshot, you need sar.

sar — historical system activity

sar (from sysstat) records system metrics to /var/log/sa/ on a cron/timer schedule and lets you replay any window — indispensable for “it was slow at 03:00” investigations where you were asleep. Enable collection (systemctl enable --now sysstat), then query historically:

# CPU utilization every 10 min for today
sar -u
# Memory + swap; -B for paging (pgscan/pgsteal = reclaim pressure)
sar -r ; sar -B ; sar -S
# Block device throughput and await, per device
sar -d -p
# Network throughput and errors per interface
sar -n DEV ; sar -n EDEV
# Replay a specific past day's file and window
sar -u -f /var/log/sa/sa15 -s 03:00:00 -e 04:00:00

`sar` flag	Reports	Bottleneck it reveals
`-u`	Per-CPU %user/%system/%iowait/%steal	CPU saturation, iowait, hypervisor steal
`-r` / `-S`	Memory / swap utilization	Memory pressure, swap usage
`-B`	Paging: pgscan, pgsteal, majflt	Reclaim pressure, thrashing
`-d -p`	Per-device tps, await, %util	Disk latency and saturation over time
`-n DEV` / `-n EDEV`	Interface throughput / errors+drops	Network saturation and drops
`-q`	Run-queue length and load average	CPU saturation trend
`-w`	Context switches + task creation	Scheduler churn

perf — sampling profiler and event counter

perf samples the CPU (or hardware counters) to build a statistical picture of where cycles go. The flow is record → report, and a flame graph turns the report into a visual where width = time spent. Use it to find the hot function, cache-miss hotspots, or scheduler behaviour:

# System-wide CPU profile for 30s, then an interactive report
perf record -F 99 -a -g -- sleep 30
perf report --stdio | head -40

# Hardware counters: cache misses, branch misses, IPC for a command
perf stat -e cycles,instructions,cache-references,cache-misses,branch-misses ./myapp

# Off-CPU / scheduler latency: why a thread is NOT running
perf sched record -- sleep 10 ; perf sched latency

# Flame graph (needs Brendan Gregg's FlameGraph scripts)
perf record -F 99 -a -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

`perf` subcommand	What it shows	Use for
`perf record`/`report`	Sampled call stacks (on-CPU)	Finding the hot function; flame graphs
`perf stat`	Hardware/software event counts (IPC, misses)	Cache/branch behaviour, IPC per workload
`perf top`	Live top functions by samples	Watching a hotspot in real time
`perf sched`	Scheduler latency, off-CPU time	Why a thread waits/doesn’t run
`perf trace`	strace-like syscall summary (lower overhead)	Syscall latency and counts
`perf mem`/`perf c2c`	Memory access + cache-line contention	NUMA/false-sharing hotspots

bpftrace — ad-hoc kernel tracing with eBPF

bpftrace runs safe eBPF programs to answer questions the counters can’t — per-syscall latency histograms, off-CPU stacks, block-I/O size distributions, TCP retransmit sources — with near-zero overhead and no reboot. It is the sharpest instrument for “why is this slow” once USE has pointed at a resource:

# Distribution of block I/O size (bytes) issued to the device
bpftrace -e 'tracepoint:block:block_rq_issue { @bytes = hist(args->bytes); }'

# read() syscall latency histogram (microseconds) by process
bpftrace -e 'tracepoint:syscalls:sys_enter_read { @s[tid] = nsecs; }
             tracepoint:syscalls:sys_exit_read /@s[tid]/ {
               @us[comm] = hist((nsecs - @s[tid]) / 1000); delete(@s[tid]); }'

# Off-CPU time: where threads block (kernel stack) and for how long
bpftrace -e 'kprobe:finish_task_switch { @[kstack] = count(); }'

# Which process is issuing the most block I/O
bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); }'

The BCC toolkit ships ready-made tools built on the same machinery — reach for these before writing bespoke bpftrace:

Tool	Answers	Layer
`biolatency` / `biosnoop`	Block I/O latency distribution / per-I/O trace	Disk
`execsnoop` / `opensnoop`	New processes / file opens fleet-wide	Process/FS
`tcplife` / `tcpretrans`	TCP session summaries / retransmit sources	Network
`runqlat` / `runqlen`	Scheduler run-queue latency / length	CPU
`cachestat`	Pagecache hit/miss ratio	Memory
`offcputime` / `oncputime`	Off-CPU and on-CPU stack time	CPU/blocking
`funclatency`	Latency of a specific kernel/user function	Any

The observability discipline mirrors the tuning discipline: use these tools to attribute a bottleneck before changing a knob, and to validate that the change moved the specific metric — a flame graph before and after, or a biolatency histogram that shifted left. Remember that perf and eBPF are privileged (see Security notes); use targeted probes on production, not always-on broad tracing.

Architecture at a glance

There is no single diagram for performance tuning because there is no single “system” — there is a stack of resources, each with its own bottleneck, its own USE signal, and its own knob layer, and the whole point of the method is to walk that stack top to bottom rather than reach for one favourite knob. Picture the machine as five layers a request or a byte passes through, and picture the diagnostic loop wrapped around all of them.

Start at the top with the application — a thread that wants CPU, memory, a socket, and a block device. Beneath it sits the kernel scheduler and CPU layer: which core the thread runs on (at full frequency or crawling out of a deep C-state), whether an interrupt storm on the same core is stealing its cycles, and — on a multi-socket box — whether the memory it touches is local or a remote-NUMA round trip. Below that is the virtual-memory layer: hugepages vs 4 KB pages, THP defrag stalling the thread, a wall of dirty pages about to block every writer synchronously. Below that, two parallel I/O paths: the block layer (right or wrong scheduler, a queue full or starved, a readahead window matched or mismatched to the access pattern) and the network stack (a socket buffer sized to the bandwidth-delay product or three orders of magnitude off, a backlog that absorbs a burst or drops it, a NIC ring that overflows, interrupts piled on one core or spread across eight). At the very bottom sits the hardware and its firmware — the NVMe device, the DIMMs on each node, the NIC and its offload engines, the BIOS power profile that may be quietly throttling everything above it.

Now wrap the loop around it. The USE method is the entry point into this stack: vmstat/mpstat//proc/pressure/cpu point you at the CPU layer, free/sar -B//proc/pressure/memory at the VM layer, iostat//proc/pressure/io at the block layer, ss/sar -n DEV/ethtool -S at the network layer. The saturation signal — a run queue longer than the core count, a PSI full line climbing, a NIC drop counter ticking, a queue depth pinned high — tells you which layer to open, so you tune that layer and only that layer. tuned is the coherent bundle you apply across the layer; tuned-adm verify proves it took; perf/bpftrace attribute the stall to a specific function or off-CPU wait when the coarse tools are ambiguous; and the re-run of the identical baseline closes the loop. The architecture in one sentence: measure the stack from the top, localise to one layer by its saturation signal, change one coherent bundle there, and prove it against the same benchmark before moving down.

Real-world scenario

A payments platform team ran their core PostgreSQL 16 fleet on dual-socket AMD EPYC servers with NVMe and 512 GB RAM, on kernel 6.1 with tuned throughput-performance. After a hardware refresh (more cores, faster NVMe, same architecture), p99 query latency got worse — from a steady 4 ms to a spiky 7–11 ms — despite the faster hardware. The on-call narrative was “the new NVMe firmware is bad,” and someone had already opened a vendor ticket.

The USE method told a different story, layer by layer. iostat -xz 1 showed NVMe %util under 20% with sub-millisecond w_await and a shallow queue — storage was idle, not the bottleneck, and /proc/pressure/io full avg10 sat at 0. mpstat -P ALL 1 showed no single-core saturation and negligible %steal. free -g showed plenty of headroom and si/so at zero — not memory-starved. But numastat -p on the postgres backends showed numa_foreign climbing into the millions per minute: the Linux scheduler was spreading connection backends across both sockets while shared_buffers lived on node 0, so roughly half the connections were doing every buffer access across the inter-socket link. The “slow disk” was actually remote-NUMA memory latency, and the refresh made it worse precisely because more cores let the scheduler spread work wider across the two nodes than the old, smaller box ever had.

Two secondary findings compounded it. sysctl vm.zone_reclaim_mode returned 1 — it had silently re-enabled on the new kernel for this NUMA topology, so the kernel was reclaiming local pagecache rather than making a cheap remote allocation, throwing away hot cache. And cat /sys/kernel/mm/transparent_hugepage/enabled showed [madvise] (not never) — the refresh had reset it, reintroducing occasional khugepaged defrag stalls that showed up as the p99 spikes.

The fix was locality and posture, not faster hardware. They stopped fighting the scheduler and instead turned one cross-NUMA database into two node-local ones: a templated systemd unit bound postgresql@0 to node 0 and postgresql@1 to node 1, each with its own shared_buffers local to its node.

# /etc/systemd/system/postgresql@.service.d/numa.conf
# postgresql@0 binds node 0; postgresql@1 binds node 1
[Service]
NUMAPolicy=bind
NUMAMask=%i

They folded the other two fixes into the tuned child profile so they would survive the next kernel bump: [sysctl] vm.zone_reclaim_mode=0 and [vm] transparent_hugepages=never, then tuned-adm profile kv-postgres && tuned-adm verify to prove all three were live. They changed one class of knob at a time — NUMA binding first (re-measured), then zone_reclaim_mode (re-measured), then THP (re-measured) — so each delta was attributable.

The result: p99 dropped from the spiky 7–11 ms to a steady 4.3 ms, and — the part the team cared about most — the variance collapsed; the tail became predictable. No firmware was changed and the vendor ticket was closed. The runbook line they wrote: “On multi-socket hardware, prove memory locality with numastat before blaming any other resource — and re-pin your tuned posture after every kernel upgrade, because the defaults will drift back.”

Advantages and disadvantages

Methodical, measurement-driven tuning through tuned + sysctl + per-device knobs is powerful, but it is not free — the discipline itself has costs, and so does each knob. Weigh it honestly:

Advantages (why the method + these knobs help)	Disadvantages (why they bite)
The USE method + PSI localise the bottleneck to one resource fast — no guessing which layer	Requires a representative benchmark and a recorded baseline; without them you cannot judge a change
tuned applies coherent, named, revertible bundles — one command to apply, one to revert	A stock profile can silently override a hand-set sysctl; you must `verify` and know inheritance
Per-device I/O schedulers and per-node NUMA binding fix the specific device/topology, not the fleet	Per-device/per-node tuning is fiddly and must be re-checked after hardware or kernel changes
Sizing TCP buffers from BDP keeps fat pipes full without wasting RAM on LAN servers	Wrong BDP either caps throughput (too small) or wastes memory (too large) — you must compute it
Capping C-states / pinning `performance` governor removes tail-latency jitter for latency-critical nodes	Costs real watts and heat; wrong on a batch/density fleet where efficiency matters more
`isolcpus`/`nohz_full` give near-jitter-free cores for the hardest latency needs	Isolated cores are lost to everything else — a capacity trade-off, and boot-time (needs reboot)
perf/bpftrace attribute a stall to a function or off-CPU wait — no more hand-waving	Requires kernel symbols/tracepoints and skill to read; misread flame graphs mislead
Explicit hugepages cut TLB misses and pin DB buffers predictably	Reserved at boot; wrong count wastes RAM or starves the app; fragmentation blocks runtime reservation
Defaults are safe but generic — tuning wins real latency/throughput when a workload deviates	Many defaults are already right on modern kernels; over-tuning adds risk with no gain

The method is right whenever a workload deviates from “general purpose”: latency-critical services, high-BDP or high-throughput pipes, multi-socket hardware, big-memory databases. It is wrong to apply reflexively — modern defaults are sensible, and changing a dozen knobs “because a guide said so” adds regression risk with no measured gain. Every knob here can hurt if set without a baseline to prove it helped, which is exactly why the loop — baseline, one change, re-measure, keep-or-revert — is non-negotiable.

Hands-on lab

This lab establishes a baseline, applies a tuned child profile that changes one coherent bundle (VM writeback + THP + I/O scheduler), verifies it, and re-measures — the full loop, on a single test box. It is safe to run on a disposable VM or lab server; every change is reverted at the end. You need sudo, tuned, fio, sysstat (for sar/iostat/mpstat), and a scratch filesystem at /data (adjust paths). Do not run destructive fio against a device with real data.

Step 1 — Record the environment and the active profile.

uname -r
tuned-adm active
nproc; free -g
lsblk -d -o NAME,ROTA,SIZE,MODEL    # ROTA 0 = SSD/NVMe, 1 = HDD
cat /sys/block/nvme0n1/queue/scheduler   # note the [active] one
sysctl vm.dirty_ratio vm.swappiness
cat /sys/kernel/mm/transparent_hugepage/enabled

Expected: prints your kernel (e.g. 6.1.x), the current profile (often balanced or throughput-performance), core/RAM counts, and the current I/O scheduler in brackets. Write these down — they are the baseline environment.

Step 2 — Capture a storage baseline with fio.

# 4k random read, direct I/O, QD32, 4 jobs, 30s. Report p99/p99.9.
sudo fio --name=base --filename=/data/fiotest --size=2G --direct=1 --rw=randread \
    --bs=4k --iodepth=32 --numjobs=4 --group_reporting --runtime=30 --time_based \
    --ioengine=io_uring --percentile_list=50:95:99:99.9 | tee /tmp/fio-baseline.txt
grep -E 'IOPS|99.00th|99.90th' /tmp/fio-baseline.txt

Expected: an IOPS number and latency percentiles, e.g. IOPS=180k, 99.00th=[ 210], 99.90th=[ 450] (microseconds). Record IOPS, p99, p99.9 — these are the numbers every later change is judged against.

Step 3 — Watch PSI under the same load (in a second terminal). While Step 2 runs, in another shell:

watch -n1 'cat /proc/pressure/io /proc/pressure/cpu'

Expected: io some avg10 rises during the run; io full avg10 should stay low if storage is not the ceiling. Note the peak io some/full — a saturation baseline.

Step 4 — Author and apply a tuned child profile (one coherent bundle).

sudo mkdir -p /etc/tuned/lab-storage
sudo tee /etc/tuned/lab-storage/tuned.conf >/dev/null <<'EOF'
[main]
summary=Lab: throughput-performance + explicit VM writeback, THP off, NVMe none
include=throughput-performance

[sysctl]
vm.dirty_background_bytes=268435456
vm.dirty_bytes=1073741824
vm.swappiness=1

[vm]
transparent_hugepages=never

[disk]
devices=nvme*n*
elevator=none
readahead=128
EOF

sudo tuned-adm profile lab-storage

Expected: no error; tuned-adm active now shows lab-storage.

Step 5 — Verify the profile actually took (the step everyone skips).

sudo tuned-adm verify
sysctl vm.dirty_bytes vm.swappiness             # expect 1073741824 and 1
cat /sys/kernel/mm/transparent_hugepage/enabled # expect [never]
cat /sys/block/nvme0n1/queue/scheduler          # expect [none]

Expected: tuned-adm verify prints “Verfication succeeded, current system settings match the preset profile.” If it reports a mismatch, a stale /etc/sysctl.d/ drop-in or an elevator= cmdline is fighting the profile — resolve that before trusting the re-measure.

Step 6 — Re-run the identical fio baseline and diff.

sudo fio --name=tuned --filename=/data/fiotest --size=2G --direct=1 --rw=randread \
    --bs=4k --iodepth=32 --numjobs=4 --group_reporting --runtime=30 --time_based \
    --ioengine=io_uring --percentile_list=50:95:99:99.9 | tee /tmp/fio-tuned.txt
echo "=== BASELINE ==="; grep -E 'IOPS|99.00th|99.90th' /tmp/fio-baseline.txt
echo "=== TUNED    ==="; grep -E 'IOPS|99.00th|99.90th' /tmp/fio-tuned.txt

Expected: on an NVMe device that was already on none, the storage numbers may be unchanged — which is a valid, honest result proving the scheduler was not your bottleneck. The VM/THP changes matter under a write-heavy or large-dirty workload; re-run with --rw=randwrite to see the dirty-ratio effect. The lesson is the method, not a guaranteed win: you changed one bundle, verified it, and re-measured against a recorded number.

Step 7 — (Optional) Prove a NUMA/CPU observation with a quick tool.

numactl --hardware | head          # topology and distance matrix
cat /proc/pressure/cpu             # cpu saturation right now
mpstat -P ALL 1 3                  # per-core; look for a lone saturated core

Step 8 — Teardown (revert everything).

# Revert to the profile you recorded in Step 1 (e.g. throughput-performance or balanced)
sudo tuned-adm profile throughput-performance
sudo tuned-adm verify
sudo rm -rf /etc/tuned/lab-storage
sudo rm -f /data/fiotest /tmp/fio-baseline.txt /tmp/fio-tuned.txt
tuned-adm active                   # confirm you are back to the original profile

Expected: active profile is back to the original; the custom profile directory and test files are gone. Nothing you changed here persists.

Common mistakes & troubleshooting

Performance work goes wrong in predictable ways. This is the symptom → root cause → confirm → fix playbook; scan for your symptom and jump to the row:

#	Symptom	Root cause	Confirm (exact command)	Fix
1	“Tuning” changed nothing	Default was already correct on this kernel	Diff baseline vs after; both identical	Accept it; that knob was not your bottleneck
2	Change helped but you can’t say which	Changed multiple knob classes at once	(no single-variable measurement exists)	Revert all; change one class per iteration
3	Set a sysctl but it’s not live	tuned profile overrides your `/etc/sysctl.d/` drop-in	`tuned-adm verify` (reports mismatch); `sysctl <key>`	Move the delta into the tuned child profile
4	OOM-killer fires under memory pressure	`vm.swappiness=0` prevents swapping, kills instead	`dmesg -T \| grep -i oom`; `sysctl vm.swappiness`	Use `swappiness=1` (avoid-but-allow), not `0`
5	Every client behind NAT randomly fails	`tcp_tw_recycle` set (removed since 4.12; broke NAT)	`sysctl net.ipv4.tcp_tw_recycle` (errors on 4.12+)	Remove it; use `tcp_tw_reuse=1` for outbound
6	Throughput plateaus far below line rate	TCP buffers too small for the BDP	`ss -tin` (small cwnd/rcv-space); compute BDP	Raise `rmem_max`/`wmem_max`/`tcp_rmem` to BDP
7	p99 latency spikes with no disk/CPU cause	THP `khugepaged` defrag stalls	`cat /sys/kernel/mm/transparent_hugepage/enabled` = `[always]`/`[madvise]`	Set `never` for self-managing DBs/JVMs, durably
8	Writers freeze periodically	`vm.dirty_ratio` too high on large RAM → huge sync writeback	`sar -B 1` (pgpgout spikes); `vm.dirty_ratio` = 20	Set `vm.dirty_bytes`/`dirty_background_bytes`
9	One CPU pinned at 100% `%soft`, others idle	All NIC IRQs/softirq on one core (no RSS/RPS)	`mpstat -I SCPU -P ALL 1`; `cat /proc/interrupts`	RSS (`ethtool -L combined N`) / RPS / IRQ affinity
10	Latency worse after adding cores/sockets	Remote-NUMA memory access (work spread wider)	`numastat -p <pid>` (`numa_foreign` climbing)	Bind process CPU+memory to one node (`numactl`)
11	File-cache-heavy app slow, memory “free”	`vm.zone_reclaim_mode=1` reclaims local cache	`sysctl vm.zone_reclaim_mode` = 1	Set `vm.zone_reclaim_mode=0`
12	NVMe latency higher than expected	Legacy `bfq`/scheduler adding software overhead	`cat /sys/block/nvme0n1/queue/scheduler` = `[bfq]`/`[mq-deadline]`	Set `none`; make durable via udev rule
13	Sequential reads slow, disk under-utilized	`read_ahead_kb` too small for the scan pattern	`cat /sys/block/<d>/queue/read_ahead_kb`	Raise `read_ahead_kb` (e.g. 4096) for sequential
14	Connections dropped/reset under burst	`somaxconn` / app `listen()` backlog too small	`nstat -az \| grep ListenOverflow`; `ss -lnt` (Recv-Q)	Raise `somaxconn` and the app’s backlog
15	RX packets dropped under load	NIC ring buffer overflow	`ethtool -S eth0 \| grep -i drop`; `-g` shows small ring	Raise rings: `ethtool -G eth0 rx 4096 tx 4096`
16	Router/bridge corrupts or drops flows	LRO coalescing on a forwarding host	`ethtool -k eth0 \| grep large-receive` = on	`ethtool -K eth0 lro off` on forwarders
17	Latency jitter on an otherwise idle box	Deep C-state exit latency on wakeup	`cpupower idle-info` (high C6 exit latency)	`force_latency` / `intel_idle.max_cstate=1`
18	Frequency never reaches max under load	`powersave`/`schedutil` ramp latency	`cat .../scaling_governor`; `turbostat` (low MHz)	`performance` governor (tuned latency-performance)
19	`%steal` high in `mpstat`	Hypervisor is taking your CPU cycles	`mpstat -P ALL 1` (`%steal` > 0)	Not a guest tuning issue — raise with the host/provider
20	Config lost after reboot	Runtime `echo`/`sysctl -w` not persisted	Value differs before/after reboot	Persist via tuned profile / `/etc/sysctl.d/` / cmdline
21	`numa_balancing` thrash on a pinned process	Auto NUMA balancing fighting your explicit pin	`sysctl kernel.numa_balancing` = 1 while pinned	Set `kernel.numa_balancing=0` when binding manually

Two meta-mistakes invalidate everything: changing more than one class of knob per iteration (you can never attribute the result), and never re-running the identical baseline (you judge by feeling). Catch either and reset — an unattributable “win” is a future regression waiting to happen.

Best practices

Baseline before you touch anything. Record workload, uname -r, active tuned profile, and headline metrics (IOPS, p99, PSI). No baseline, no tuning — only guessing.
Change one class of knob per iteration, then re-measure. VM, then network, then CPU, then I/O — never all at once. Keep a change only if the dimension you chose to optimise moved.
Use the USE method as the entry point. Utilization, saturation (PSI, run-queue, queue depth), errors — per resource — localise the bottleneck before reaching for a favourite knob.
Own tuning through a tuned child profile. include= a stock profile, override only measured deltas, never edit stock profiles in place, and run tuned-adm verify after every apply.
Size TCP buffers from the bandwidth-delay product, not a copied constant. LAN and WAN want wildly different values.
Never set tcp_tw_recycle (removed since 4.12; breaks NAT). Use tcp_tw_reuse=1 for outbound reuse. Treat any guide that recommends tcp_tw_recycle as untrustworthy.
Set THP explicitly. never for self-managing databases and JVM heaps; make it durable via tuned or cmdline, not a one-shot echo.
Use vm.dirty_bytes/dirty_background_bytes on large-RAM hosts for a fixed, predictable writeback ceiling instead of a percentage that becomes tens of GB.
Pick the I/O scheduler per device class (none for NVMe, mq-deadline for SATA/HDD) and make it durable with a udev rule or the tuned [disk] plugin — not both.
Prove NUMA locality on multi-socket hardware with numastat before blaming disk or CPU; bind or interleave deliberately and keep zone_reclaim_mode=0.
Spread NIC interrupts (RSS/RPS/IRQ affinity) so no single core saturates on softirq; keep IRQ, softirq, and app thread L3-local.
Re-verify posture after every kernel or hardware change. Defaults drift back (THP, zone_reclaim_mode, governor); a tuned profile + verify catches the drift.
Persist everything. Runtime echo/sysctl -w evaporate on reboot — commit to a tuned profile, /etc/sysctl.d/, a udev rule, or the kernel cmdline.

Security notes

Performance tuning and security intersect more than people expect — several knobs and mitigations trade one for the other, and the observability tooling is privileged.

CPU vulnerability mitigations cost performance, and disabling them is a security decision. Spectre/Meltdown/MDS/Retbleed mitigations (mitigations=auto by default) add overhead, especially on syscall-heavy or context-switch-heavy workloads. mitigations=off on the kernel cmdline recovers it — but only do so on trusted, non-multi-tenant, isolated hosts where you accept the risk. Read the current state with grep . /sys/devices/system/cpu/vulnerabilities/* before deciding.
bpftrace/perf/eBPF are privileged and powerful. They can read kernel memory, arguments, and syscalls fleet-wide. Restrict who can run them: kernel.perf_event_paranoid (raise it to limit unprivileged perf), kernel.unprivileged_bpf_disabled=1 to require capabilities for BPF, and grant CAP_BPF/CAP_PERFMON narrowly rather than running everything as root. See the eBPF security discussion in Building a Linux Audit Trail with auditd and eBPF Runtime Visibility.
kernel.kptr_restrict and kernel.dmesg_restrict hide kernel pointers and the ring buffer from unprivileged users; keep them enabled (1) — leaking kernel addresses aids exploitation, and performance work does not require exposing them.
Do not disable memory protections for speed. randomize_va_space (ASLR) has negligible performance cost; never set it to 0 for “performance.” The same goes for stack protectors.
Isolated/tickless cores are an availability surface. isolcpus/nohz_full cores run only what you pin; a misconfiguration can starve system daemons. Treat the cmdline as security-relevant config under change control.
net.ipv4.tcp_syncookies=1 (default on) protects the SYN backlog against SYN-flood DoS — a performance-adjacent knob that is also a defence. Do not disable it while “tuning” the SYN backlog.
NIC offloads and observability tooling can leak. Promiscuous mode (from packet-capture tuning) and some offloads expose traffic; audit who can put an interface in promisc mode. Restrict CAP_NET_ADMIN.

Cost & sizing

Performance tuning’s cost is mostly opportunity and risk, not licence fees — the software (tuned, sysstat, bpftrace, ethtool) is free and in-distro. But the choices you make have real hardware and power implications:

Right-sizing beats over-provisioning. The biggest cost win is discovering, via the USE method, that a workload is bound by remote-NUMA latency or a small TCP window — not capacity — so you don’t buy more hardware. The payments-platform scenario ended on the same hardware; a naive team would have thrown money at “faster NVMe” that was already idle.
The performance governor and idle=poll cost watts and heat. Pinning cores at max frequency and forbidding idle states raises power draw measurably — fine on a handful of latency-critical nodes, wasteful across a large density fleet. Match the posture to the node’s job: performance for latency nodes, schedutil/balanced for the rest.
isolcpus trades capacity for jitter-free cores. Dedicating 8 of 16 cores to latency-critical threads halves the general-purpose capacity of that box — a real cost you pay for a real latency guarantee. Size the isolation to the actual latency-critical thread count.
Explicit hugepages reserve RAM up front. 16 GB of reserved hugepages is 16 GB unavailable to anything else, whether the DB uses it or not. Size vm.nr_hugepages to shared_buffers, not “round up generously.”
Larger NIC rings and socket buffers use memory. Rarely a cost driver at server scale, but 16 MB max socket buffers × tens of thousands of connections is real RAM — which is why you size buffers from BDP (LAN connections need KB, not MB) rather than blanket-maxing them.
Observability has a runtime cost. perf record and heavy bpftrace probes add overhead (usually low single-digit %); use them for diagnosis, not permanent always-on tracing, and prefer targeted probes over broad ones on production hosts.

Rough sizing guidance: a latency-critical database node justifies performance governor + capped C-states + NUMA binding + explicit hugepages (accepting the watts); a batch/analytics fleet is better on throughput-performance (or balanced) with default power saving, where efficiency matters more than the tail. The cost model is hardware not bought (the win) versus watts, heat, and reserved RAM (the price of aggressive postures) — apply the expensive postures only where measurement proves they pay.

Interview & exam questions

1. Explain the USE method and why saturation is the strongest signal. USE = for each resource (CPU, memory, disk, network) check Utilization, Saturation, and Errors. Saturation — work queueing because the resource is full — is the strongest signal because a resource can be 100% utilized yet fine (a batch job wants that), but saturation means demand exceeds capacity and something is waiting. On modern kernels, PSI (/proc/pressure/*) is the most honest saturation metric.

2. What does PSI full avg10 on /proc/pressure/io mean? It is the percentage of the last 10 seconds during which every non-idle task was stalled waiting on I/O. Unlike the some line (≥1 task stalled), full means the whole machine was blocked — a definitive “storage is the bottleneck” signal, regardless of what device utilization suggests.

3. Why should you never edit a stock tuned profile in place, and what do you do instead? Stock profiles under /usr/lib/tuned are package-owned and overwritten on upgrade, so your edits vanish. Instead create a child profile in /etc/tuned/<name>/tuned.conf that include=s the stock profile and overrides only your measured deltas; apply with tuned-adm profile and prove it with tuned-adm verify.

4. Does vm.swappiness=0 disable swap? What should a database use? No — 0 makes the kernel avoid swapping until reclaim is nearly impossible, which can trigger the OOM killer sooner (it kills rather than swaps). Databases should use 1 (“avoid but allow”). To truly disable swap you swapoff -a and remove it from fstab.

5. Why is tcp_tw_recycle dangerous and what replaces it? It was removed in kernel 4.12 because it dropped packets from clients behind NAT (it rejected connections whose TCP timestamps looked out of order across different sources sharing a NAT). The safe replacement is tcp_tw_reuse=1, which reuses TIME_WAIT sockets for new outbound connections only, without the NAT hazard.

6. How do you size TCP socket buffers correctly? From the bandwidth-delay product: BDP = link bandwidth × RTT. Set net.core.rmem_max/wmem_max and the max value of tcp_rmem/tcp_wmem to at least the worst-case BDP so a single stream can keep the pipe full. A 10 GbE link at 10 ms RTT needs ~12.5 MB; a 1 GbE LAN at 0.2 ms needs ~25 KB — which is why one constant for the whole fleet is wrong.

7. When is none the right I/O scheduler and why? For NVMe and fast SSDs. Those devices have deep internal queues and reorder requests better than the OS can, so a software scheduler only adds latency and CPU. mq-deadline suits SATA SSD/HDD (bounds worst-case latency, prevents starvation); bfq suits interactive desktops needing per-process fairness but is too CPU-costly for high-IOPS server paths.

8. What is remote-NUMA latency and how do you prove and fix it? On multi-socket servers, memory on a remote node costs ~1.5–2× the local access latency. A process spread across sockets while its memory lives on one node pays that tax constantly — and more cores make it worse by spreading work wider. Prove it with numastat -p <pid> (climbing numa_foreign); fix it by binding CPU+memory to one node (numactl --cpunodebind=N --membind=N or systemd NUMAPolicy=bind), and keep vm.zone_reclaim_mode=0.

9. Why do THP hurt some databases, and what should they set? Transparent Hugepages trigger background defragmentation (khugepaged) and synchronous huge-page collapse/split, causing unpredictable latency spikes for apps that manage their own memory (PostgreSQL, MySQL, Redis, Mongo, JVM heaps). Those should set THP to never (durably, via tuned or cmdline) and, if they want large pages, use explicit reserved hugepages instead.

10. What causes one CPU to sit at 100% %soft while others idle, and how do you fix it? All of a NIC’s interrupts/softirq are landing on one core (no RSS, or IRQs pinned to one CPU). Confirm with mpstat -I SCPU -P ALL 1 and /proc/interrupts. Fix by enabling multiple hardware queues (ethtool -L eth0 combined N) so RSS spreads flows across cores, or use RPS/IRQ affinity to distribute, keeping IRQ, softirq, and the app thread L3-local.

11. Difference between isolcpus, nohz_full, and IRQ affinity? isolcpus removes cores from the scheduler’s load balancing (only explicitly pinned tasks run there); nohz_full makes a core tickless (no periodic scheduler tick) when running a single task, cutting jitter; IRQ affinity (smp_affinity) controls which CPUs service a device’s interrupts. Together they carve near-jitter-free cores: isolate them, make them tickless, and steer IRQs to the housekeeping cores instead.

12. What is the difference between C-states and P-states (governors), and which matters more for latency? P-states (cpufreq governors) set the frequency of an active core; C-states set how deeply an idle core sleeps. Governors add ramp-up latency; C-states add wakeup (exit) latency that can be far larger (tens–hundreds of µs for C6). For latency-critical work, C-states are usually the bigger lever — cap them via PM-QoS (force_latency) or intel_idle.max_cstate=1, and pin the performance governor to remove frequency jitter too.

These map to vendor-neutral Linux Foundation LFCS/LFCE and RHCE/RHCSA performance-tuning objectives, and the eBPF/observability portions align with Linux Foundation eBPF and general SRE competency. A compact revision map:

Question theme	Cert / competency	Objective area
USE method, PSI, baselining	SRE / LFCE	Performance analysis and capacity
tuned profiles + verify	RHCSA/RHCE, LFCS	System tuning and configuration
sysctl (vm/net/fs)	LFCE, RHCE	Kernel runtime parameters
NUMA, CPU governors, C-states	LFCE / SRE	Hardware-aware tuning
I/O schedulers, block queue	LFCE, RHCE	Storage performance
Network stack, BDP, congestion	LFCE / SRE	Network performance
perf / bpftrace / eBPF	LF eBPF / SRE	Observability and tracing

Quick check

You change five sysctls and a scheduler at once, reboot, and latency improves. What is wrong with concluding your tuning “worked”?
/proc/pressure/io shows full avg10 = 35 under load, but iostat shows the NVMe device at only 40% %util. What do you conclude and which metric do you trust?
Your database on a dual-socket box got slower after a refresh that added cores. Name the most likely cause and the one command that confirms it.
A guide tells you to set net.ipv4.tcp_tw_recycle=1 and vm.swappiness=0. What is wrong with each?
You need to make an NVMe-none scheduler choice survive reboot on a mixed-media host without hand-editing every device. What mechanism do you use?

Answers

You changed more than one class of knob at once, so the improvement is unattributable — you cannot tell which change helped, and one of them may be hurting while another over-compensates. Revert everything and change one class per iteration, re-measuring each. An unattributable win is a future regression.
Trust PSI. io full avg10 = 35 means that for 35% of the last 10 seconds every non-idle task was blocked on I/O — the machine is I/O-bound regardless of the %util figure (which can under-report on multi-queue devices or when latency, not busy-time, is the ceiling). Storage is your bottleneck; investigate device latency (await) and queue depth.
Remote-NUMA memory latency: more cores let the scheduler spread the process’s threads wider across both sockets while its memory sits on one node, so more accesses cross the inter-socket link. Confirm with numastat -p <pid> — a climbing numa_foreign/numa_miss is the smoking gun. Fix by binding CPU+memory to one node.
tcp_tw_recycle was removed in kernel 4.12 because it breaks clients behind NAT (out-of-order timestamps across NATed sources get dropped) — never set it; use tcp_tw_reuse=1 for outbound. vm.swappiness=0 does not disable swap; it makes the kernel OOM-kill sooner rather than swap — use 1 on databases for “avoid but allow.”
A udev rule keyed on device class (KERNEL=="nvme[0-9]*n[0-9]*" → ATTR{queue/scheduler}="none"; KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0" → mq-deadline) in /etc/udev/rules.d/, or the tuned [disk] plugin (devices=nvme*n*, elevator=none) — one mechanism, not both, so tuned-adm verify does not fight the udev rule.

Glossary

USE method — Brendan Gregg’s checklist: for each resource, check Utilization, Saturation, and Errors before tuning. Localises the bottleneck.
PSI (Pressure Stall Information) — kernel counters at /proc/pressure/{cpu,io,memory}; some = ≥1 task stalled, full = all non-idle tasks stalled, over avg10/avg60/avg300 windows. The most honest saturation metric (kernel 4.20+).
tuned — the orchestration daemon that applies coherent, named, revertible knob bundles (profiles) covering sysctl, CPU, disk, THP and IRQ posture; controlled with tuned-adm.
tuned child profile — a profile in /etc/tuned/<name>/tuned.conf that include=s a stock profile and overrides only measured deltas; the correct place for all your tuning.
tuned-adm verify — asserts every setting in the active profile is actually live, catching conflicts from stale sysctl drop-ins or cmdline args.
sysctl — runtime kernel parameters under /proc/sys (vm.*, net.*, fs.*), set live with sysctl -w and persisted in /etc/sysctl.d/ or a tuned profile.
cpufreq governor — the policy choosing active-core frequency (performance, schedutil, powersave); performance pins max frequency, removing ramp-up jitter.
C-state — a CPU idle sleep depth (C0 active … C6 deep); deeper states save power but have larger exit latency, adding wakeup jitter to latency-critical work.
PM QoS / force_latency — a request (via /dev/cpu_dma_latency, driven by tuned’s force_latency) for a maximum tolerable CPU wakeup latency; the kernel then refuses deeper C-states.
IRQ affinity — which CPUs service a device’s hardware interrupts, set via /proc/irq/<n>/smp_affinity; used to keep NIC softirq off latency-critical cores.
NUMA (Non-Uniform Memory Access) — multi-socket architecture where each socket has local memory; remote-node access costs ~1.5–2× local latency. Managed with numactl/numastat.
vm.zone_reclaim_mode — when non-zero, reclaims local pagecache rather than allocating a remote NUMA page; usually harmful — keep 0.
isolcpus / nohz_full / rcu_nocbs — kernel cmdline options that remove cores from scheduler balancing, make them tickless, and offload RCU callbacks, respectively — for dedicated near-jitter-free cores.
blk-mq scheduler — the multi-queue block I/O scheduler: none (NVMe), mq-deadline (SATA/HDD), kyber (latency-target SSD), bfq (fairness/interactive).
nr_requests / read_ahead_kb — per-device block-queue knobs for OS-side queue depth and sequential prefetch window respectively.
THP (Transparent Hugepages) — automatic 2 MB pages (always/madvise/never); defrag stalls hurt self-managing databases, which prefer never.
Explicit hugepages — admin-reserved 2 MB/1 GB pages (vm.nr_hugepages) an app maps deliberately; pinned and predictable, ideal for DB shared buffers.
vm.dirty_ratio / dirty_background_ratio (and _bytes) — thresholds at which the kernel blocks writers / starts async writeback; use _bytes on large-RAM hosts to cap stall size.
vm.swappiness — 0–200 bias toward reclaiming pagecache vs swapping anon memory; 0 ≠ disabled (OOM-kills sooner); 1 = avoid-but-allow.
BDP (Bandwidth-Delay Product) — bandwidth × RTT; the correct in-flight TCP buffer size, and why one socket-buffer constant cannot fit both LAN and WAN.
Congestion control (cubic/bbr) — the algorithm pacing TCP sends; bbr models bandwidth+RTT (survives non-congestive loss), cubic is loss-based (collapses on it). bbr pairs with the fq qdisc.
somaxconn / tcp_max_syn_backlog / netdev_max_backlog — the accept queue, SYN (half-open) queue, and per-CPU ingress queue depths respectively; conflating them is a classic error.
RSS / RPS / RFS — Receive-Side Scaling (hardware flow-to-queue hashing), Receive Packet Steering (software), and Receive Flow Steering (to the app’s CPU) — spread packet processing across cores.
Ring buffer (NIC) — the RX/TX descriptor queue (ethtool -g); too small overflows under burst, dropping packets before the stack sees them.
Offloads (GRO/LRO/GSO/TSO) — features moving per-packet work to the NIC/batched kernel paths; LRO must be off on forwarders/routers.
perf / bpftrace — the CPU profiler and eBPF tracing front-end used to attribute a stall to a specific function, syscall, or off-CPU wait.

Next steps

You can now run the tuning loop end to end: baseline with USE, localise to one resource, change one coherent bundle through a tuned profile, verify it, and re-measure. Build outward:

Next: Mastering systemd: Units, Timers, Resource Control, and Service Hardening — express per-service CPU affinity, NUMA policy, I/O class and cgroup limits the modern way, instead of wrapping ExecStart in numactl/taskset.
Related: Modern Linux Networking: Bonding, VLANs, and Firewalls with nftables and firewalld — the interface, bonding and firewall layer that sits beneath the socket and NIC tuning here.
Related: Building a Linux Audit Trail with auditd and eBPF Runtime Visibility — go deeper on the eBPF toolchain that also powers bpftrace, and lock down who can run privileged tracing.
Related: Well-Architected Performance Efficiency Pillar: Right-Sizing, Caching, and Load Testing — the architectural framing of why you tune: measure, right-size, cache, and load-test before scaling.
Related: Monitoring and Observability Basics: Logs, Metrics, and Traces — turn the ad-hoc vmstat/iostat/PSI numbers here into continuous metrics, dashboards and alerts.