Monitoring and Troubleshooting Debian System Performance Issues

By VPS.DO
March 1, 2026

This guide focuses on conceptual understanding and systematic reasoning for diagnosing performance problems on Debian servers (Debian 13 “Trixie” and later kernels in 2026). The goal is to teach you how to think about performance rather than just run a fixed set of commands.

1. The Fundamental Performance Layers (Where Time Is Spent)

Every performance issue eventually maps to one or more of these layers:

Layer	Primary Resource(s)	Typical Symptoms	First Observability Tools	Kernel / Subsystem Involved
CPU	Cores, scheduler	High %user, %sys, high context switches	top/htop, mpstat, perf, pidstat	scheduler (EEVDF), softirq, steal time
Memory	RAM, swap, page cache	High swap usage, kswapd CPU, OOM killer	free -h, vmstat, sar -r, /proc/meminfo	mm subsystem, reclaim, compaction
I/O (block)	Disk/SSD/NVMe	High iowait, long service times, queue depth	iostat -x, iotop, blktrace, sar -d	block layer, I/O scheduler (mq-deadline/none)
Network stack	NIC, TCP/IP, sockets	Retransmits, drops, high softirq, queue full	ss -s, nstat -az, sar -n DEV/ETCP, ethtool -S	netfilter, tcp_congestion, qdisc
Application / userspace	Threads, locks, syscalls	High latency despite low system usage	strace -c, perf record, bpftrace, slow logs	syscall interface, futex, epoll

Golden rule: Never tune or blame one layer until you have confirmed it is the actual bottleneck using layered observability.

2. Observability Hierarchy (Start Broad → Go Deep)

Follow this decision tree every time:

Quick system-wide snapshot (30 seconds)
- uptime → load average vs number of cores
- vmstat 1 10 → procs (r/b), cpu (%us/sy/id/wa/st), memory (si/so), system (in/cs)
- iostat -x 1 5 → %util, await, svctm, r/s, w/s, rrqm/s, wrqm/s
- mpstat 1 5 → per-core breakdown (%usr, %soft, %idle, %irq, %steal)
- free -h + cat /proc/meminfo | grep -i commit → committed_AS vs total RAM
Which resource is saturated? → pick the dominant layer
Which process / kernel thread is responsible?
- top -c -1 / htop (press F2 → Columns → add DELAY, SWAP, etc.)
- pidstat -dulh 1 (I/O, CPU, memory per process)
- perf top (live kernel/userspace samples)
Why is that process/kernel spending time there?
- CPU: perf record -g -p PID — sleep 10 then perf report
- I/O: iotop, blktrace, bpftrace ‘tracepoint:block:* { @[comm] = count(); }’
- Network: tcpdump, ss -m (memory usage per socket), nstat -az | grep -i drop
- Syscall level: strace -c -p PID or bpftrace ‘tracepoint:raw_syscalls:sys_enter_* { @[probe] = count(); }’

3. Common Real-World Patterns (2026 Era)

Symptom	Most Likely Root Causes (ordered by frequency)	Key Diagnostic Commands / Indicators
High load, but mostly idle CPU	Uninterruptible sleep (D state) due to slow I/O, NFS hang, bad multipath, LVM snapshot, mdadm resync	`ps -eo state,comm,pid
High %softirq / ksoftirqd	Network flood, high packet rate, bad driver, misconfigured qdisc, iptables/nftables rules in hot path	mpstat -P ALL 1, cat /proc/softirqs, nstat -az
High iowait / await > 10–20 ms	Saturated SSD/NVMe (queue depth too high), write amplification, too many small random writes, no write cache	iostat -xdz 1, iotop -o, fio –name=test –rw=randwrite
Memory pressure (kswapd0 high CPU)	Overcommitment, huge anonymous memory (Java, databases), too aggressive swappiness, no zswap/zram	vmstat 1, sar -r, grep Anon /proc/meminfo, swapon -s
TCP retransmits / connection timeouts	Buffer exhaustion, congestion control mismatch, middlebox issues, asymmetric routing, SYN flood remnants	ss -s, `nstat -az
High context switches (> 10k/sec/core)	Too many threads, short-lived processes, excessive epoll_wait/polling, futex contention	vmstat 1 (cs column), perf record -e context-switches …

4. Modern Observability Stack Recommendations (Debian 13+)

Minimal effective stack (low overhead):

sar (sysstat package) → historical data (enable in /etc/default/sysstat)
prometheus-node-exporter + textfile collector → metrics every 15–60 s
bpftrace or bcc-tools → dynamic tracing without recompilation
perf → sampling profiler (kernel + userspace)
journalctl -u <service> + systemd-analyze blame → service startup/boot latency

For serious production environments:

node_exporter + blackbox_exporter + process-exporter
Grafana + Loki (logs) + Tempo (traces)
eBPF-based tools: pixie, deepflow, cilium Hubble (if using Kubernetes)

5. Troubleshooting Discipline (Checklist Mindset)

Reproduce & baseline — can you reliably trigger the issue? Capture metrics during good vs bad periods.
Eliminate noise — disable cron jobs, backups, log rotation during testing.
Confirm saturation — is any resource >80–90% utilized for sustained periods?
Correlate time series — did CPU spike first, or I/O, or memory pressure?
Look at tail latency — averages hide the problem (use histograms: perf, bpftrace, application metrics)
Change one variable — test sysctl, mount options, application config — measure before/after
Consider kernel version — newer kernels (backports) often fix scheduler, TCP, NVMe issues

Final Mental Model

Performance is rarely about “finding the magic tune”. It is about locating where latency accumulates in the stack and removing or mitigating the dominant contributor.

In 2026, most Debian performance issues are still:

I/O saturation on database / log-heavy workloads
Network stack pressure from high connection churn
Memory pressure from JVM / in-memory caches
Scheduler thrashing from too many threads or misbehaving cgroups

Learn to read vmstat, iostat -x, mpstat, sar, and perf fluently — these five views cover ~90% of real incidents. Once you see the pattern in the numbers, the root cause usually becomes obvious.

Monitor continuously, troubleshoot methodically, tune sparingly.

Monitoring and Troubleshooting Debian System Performance Issues