Ubuntu Server Performance Tuning Guide – Deep Technical Focus
This guide emphasizes in-depth kernel, subsystem, and workload-specific optimizations for Ubuntu Server 24.04 LTS and newer point releases. It assumes you are already familiar with basic tools (tuned, sysctl, cpupower, perf) and prioritizes understanding underlying mechanisms over copy-paste snippets.
1. Scheduler & Preemption Behavior
The default PREEMPT_VOLUNTARY kernel balances responsiveness and throughput. For latency-critical workloads (databases, message queues, soft real-time), switch to full preemption:
- Kernel command line: preempt=full
- Effect: voluntary preemption points are supplemented with forced preemption at nearly every scheduling opportunity
- Trade-off: ~2–8% higher context-switch rate, measurable increase in scheduler overhead under very low load
- Complementary: threadirqs (move softirqs and most IRQs to threads) reduces tail latency spikes during network or disk bursts
On large-core-count systems (AMD EPYC, Intel Sapphire Rapids / Emerald Rapids), also evaluate:
- skew_tick=1 → desynchronizes timer ticks → reduces cacheline bouncing in idle polling paths
- nohz_full=1-63 (example for cores 1–63) → completely tickless operation on isolated cores
2. Transparent Huge Pages (THP) Strategy
THP behavior directly affects TLB miss rate and memory allocation latency.
Three modes:
- always Kernel aggressively collapses 4 KiB pages into 2 MiB (or 1 GiB on newer hardware). Best for: large sequential access patterns, in-memory databases (PostgreSQL shared_buffers, Redis), JVM heaps with G1/ZGC/Shenandoah. Risk: allocation stalls during collapse, especially under memory pressure.
- madvise (increasingly recommended default in enterprise distros) Only huge pages where application explicitly requests via madvise(MADV_HUGEPAGE). PostgreSQL ≥15, Redis (with THP defrag disabled), many Java runtimes already use this correctly. Safest compromise for mixed workloads.
- never Disable completely. Use only when measuring consistent regression with madvise/always.
Practical tuning path:
- Set transparent_hugepage=madvise at boot
- For applications that benefit: ensure they set MADV_HUGEPAGE (most do)
- Monitor: cat /proc/meminfo | grep -i AnonHugePages and /proc/vmstat (thp_fault_alloc, thp_collapse_alloc)
3. Memory Reclaim & Dirty Page Writeback Dynamics
Server-oriented memory pressure handling differs significantly from desktop defaults.
Key tunables and their mechanical effects:
- vm.swappiness = 1–10 Controls how aggressively the kernel swaps anonymous memory vs. page cache reclaim. Very low values protect application working sets at the cost of potentially evicting more cache.
- vm.dirty_background_ratio / vm.dirty_ratio Percentage of total memory that can be dirty before background (pdflush) or direct writeback starts. Lower values → more frequent but smaller write bursts → better tail latency on flash. Higher values → larger coalesced writes → better sequential write throughput.
- vm.dirty_expire_centisecs & vm.dirty_writeback_centisecs Control aging of dirty pages. Tightening these reduces risk of sudden I/O storms when memory pressure triggers massive writeback.
Modern recommendation for SSD/NVMe servers with ≥64 GiB RAM:
- swappiness = 5–10
- dirty_background_ratio = 2–5
- dirty_ratio = 8–15
- dirty_expire_centisecs = 1000–2000
- dirty_writeback_centisecs = 500
4. TCP Stack Deep Tuning (High Concurrency + Bandwidth-Delay Product)
Critical subsystems that determine sustained connections and goodput:
- Congestion control: BBRv2 (kernel ≥5.4) remains dominant for most internet-facing and intra-DC traffic Advantages over CUBIC: better recovery from loss, lower queueing delay, fairness with drop-tail queues
- Queue disciplines: fq (Fair Queue) + pacing Prevents bufferbloat and provides per-flow fairness even without ECN
- SYN backlog & listen queue: tcp_max_syn_backlog should be significantly larger than somaxconn When application accept() queue fills → kernel silently drops new SYNs (no RST)
- Ephemeral port exhaustion avoidance: ip_local_port_range widened + tcp_tw_reuse + very low tcp_fin_timeout
- Memory pressure handling: Autotuning (tcp_rmem / tcp_wmem) works well up to ~16–32 MiB per socket Beyond that, static large values prevent mid-connection stalls
5. Block Layer & NVMe Specifics
Modern NVMe behavior is governed by:
- mq-deadline → kyber → none progression (none usually wins on pure NVMe since ~5.15)
- Very high nr_requests (1024–4096) allows deeper internal command queues
- Read-ahead tuning: lower (32–128 KiB) for random I/O dominant workloads, higher (512–2048 KiB) for streaming
- I/O priority & cgroup blkio.weight: increasingly relevant in container-dense environments
6. Workload-Tailored Patterns (Summary)
- Web frontends (NGINX / Envoy / Traefik) epoll, multi_accept, large accept queue, BBR + fq, THP madvise
- Relational databases (PostgreSQL, MariaDB) THP always or madvise + application huge page support, very low swappiness, O_DIRECT + large dirty ratios, tuned wal_buffers & checkpoint_completion_target
- In-memory stores (Redis, Memcached, KeyDB) THP never/madvise + disable transparent defrag, overcommit_memory=1, aggressive swappiness avoidance
- Container / Kubernetes nodes cgroup v2 memory + cpu + io controllers, netfilter bridge settings disabled, hugepages-2Mi allocation via kubelet
Measurement & Validation Layers
Use layered observability:
- perf stat -d -a sleep 10 → broad counters
- bpftrace scripts for softirq latency, TCP retransmit, block I/O completion histograms
- sysdig, bcc tools (biolatency, runqlat, offcputime)
- Application-specific: pg_stat_statements, Redis INFO latency, NGINX stub_status + Prometheus
Apply one category of change at a time, run production-representative load tests (wrk2, locust, sysbench, fio –randrepeat=0), compare p50/p95/p99 latencies, throughput, and CPU efficiency.
If you can describe your dominant workload, CPU generation, memory size, storage type (local NVMe vs networked), and primary bottleneck metric (latency tail, throughput plateau, CPU saturation, etc.), significantly more precise recommendations become possible.