Monitoring and Troubleshooting Debian System Performance Issues
This guide focuses on conceptual understanding and systematic reasoning for diagnosing performance problems on Debian servers (Debian 13 “Trixie” and later kernels in 2026). The goal is to teach you how to think about performance rather than just run a fixed set of commands.
1. The Fundamental Performance Layers (Where Time Is Spent)
Every performance issue eventually maps to one or more of these layers:
| Layer | Primary Resource(s) | Typical Symptoms | First Observability Tools | Kernel / Subsystem Involved |
|---|---|---|---|---|
| CPU | Cores, scheduler | High %user, %sys, high context switches | top/htop, mpstat, perf, pidstat | scheduler (EEVDF), softirq, steal time |
| Memory | RAM, swap, page cache | High swap usage, kswapd CPU, OOM killer | free -h, vmstat, sar -r, /proc/meminfo | mm subsystem, reclaim, compaction |
| I/O (block) | Disk/SSD/NVMe | High iowait, long service times, queue depth | iostat -x, iotop, blktrace, sar -d | block layer, I/O scheduler (mq-deadline/none) |
| Network stack | NIC, TCP/IP, sockets | Retransmits, drops, high softirq, queue full | ss -s, nstat -az, sar -n DEV/ETCP, ethtool -S | netfilter, tcp_congestion, qdisc |
| Application / userspace | Threads, locks, syscalls | High latency despite low system usage | strace -c, perf record, bpftrace, slow logs | syscall interface, futex, epoll |
Golden rule: Never tune or blame one layer until you have confirmed it is the actual bottleneck using layered observability.
2. Observability Hierarchy (Start Broad → Go Deep)
Follow this decision tree every time:
- Quick system-wide snapshot (30 seconds)
- uptime → load average vs number of cores
- vmstat 1 10 → procs (r/b), cpu (%us/sy/id/wa/st), memory (si/so), system (in/cs)
- iostat -x 1 5 → %util, await, svctm, r/s, w/s, rrqm/s, wrqm/s
- mpstat 1 5 → per-core breakdown (%usr, %soft, %idle, %irq, %steal)
- free -h + cat /proc/meminfo | grep -i commit → committed_AS vs total RAM
- Which resource is saturated? → pick the dominant layer
- Which process / kernel thread is responsible?
- top -c -1 / htop (press F2 → Columns → add DELAY, SWAP, etc.)
- pidstat -dulh 1 (I/O, CPU, memory per process)
- perf top (live kernel/userspace samples)
- Why is that process/kernel spending time there?
- CPU: perf record -g -p PID — sleep 10 then perf report
- I/O: iotop, blktrace, bpftrace ‘tracepoint:block:* { @[comm] = count(); }’
- Network: tcpdump, ss -m (memory usage per socket), nstat -az | grep -i drop
- Syscall level: strace -c -p PID or bpftrace ‘tracepoint:raw_syscalls:sys_enter_* { @[probe] = count(); }’
3. Common Real-World Patterns (2026 Era)
| Symptom | Most Likely Root Causes (ordered by frequency) | Key Diagnostic Commands / Indicators |
|---|---|---|
| High load, but mostly idle CPU | Uninterruptible sleep (D state) due to slow I/O, NFS hang, bad multipath, LVM snapshot, mdadm resync | `ps -eo state,comm,pid |
| High %softirq / ksoftirqd | Network flood, high packet rate, bad driver, misconfigured qdisc, iptables/nftables rules in hot path | mpstat -P ALL 1, cat /proc/softirqs, nstat -az |
| High iowait / await > 10–20 ms | Saturated SSD/NVMe (queue depth too high), write amplification, too many small random writes, no write cache | iostat -xdz 1, iotop -o, fio –name=test –rw=randwrite |
| Memory pressure (kswapd0 high CPU) | Overcommitment, huge anonymous memory (Java, databases), too aggressive swappiness, no zswap/zram | vmstat 1, sar -r, grep Anon /proc/meminfo, swapon -s |
| TCP retransmits / connection timeouts | Buffer exhaustion, congestion control mismatch, middlebox issues, asymmetric routing, SYN flood remnants | ss -s, `nstat -az |
| High context switches (> 10k/sec/core) | Too many threads, short-lived processes, excessive epoll_wait/polling, futex contention | vmstat 1 (cs column), perf record -e context-switches … |
4. Modern Observability Stack Recommendations (Debian 13+)
Minimal effective stack (low overhead):
- sar (sysstat package) → historical data (enable in /etc/default/sysstat)
- prometheus-node-exporter + textfile collector → metrics every 15–60 s
- bpftrace or bcc-tools → dynamic tracing without recompilation
- perf → sampling profiler (kernel + userspace)
- journalctl -u <service> + systemd-analyze blame → service startup/boot latency
For serious production environments:
- node_exporter + blackbox_exporter + process-exporter
- Grafana + Loki (logs) + Tempo (traces)
- eBPF-based tools: pixie, deepflow, cilium Hubble (if using Kubernetes)
5. Troubleshooting Discipline (Checklist Mindset)
- Reproduce & baseline — can you reliably trigger the issue? Capture metrics during good vs bad periods.
- Eliminate noise — disable cron jobs, backups, log rotation during testing.
- Confirm saturation — is any resource >80–90% utilized for sustained periods?
- Correlate time series — did CPU spike first, or I/O, or memory pressure?
- Look at tail latency — averages hide the problem (use histograms: perf, bpftrace, application metrics)
- Change one variable — test sysctl, mount options, application config — measure before/after
- Consider kernel version — newer kernels (backports) often fix scheduler, TCP, NVMe issues
Final Mental Model
Performance is rarely about “finding the magic tune”. It is about locating where latency accumulates in the stack and removing or mitigating the dominant contributor.
In 2026, most Debian performance issues are still:
- I/O saturation on database / log-heavy workloads
- Network stack pressure from high connection churn
- Memory pressure from JVM / in-memory caches
- Scheduler thrashing from too many threads or misbehaving cgroups
Learn to read vmstat, iostat -x, mpstat, sar, and perf fluently — these five views cover ~90% of real incidents. Once you see the pattern in the numbers, the root cause usually becomes obvious.
Monitor continuously, troubleshoot methodically, tune sparingly.