Monitoring and Troubleshooting Debian System Performance Issues

Monitoring and Troubleshooting Debian System Performance Issues

This guide focuses on conceptual understanding and systematic reasoning for diagnosing performance problems on Debian servers (Debian 13 “Trixie” and later kernels in 2026). The goal is to teach you how to think about performance rather than just run a fixed set of commands.

1. The Fundamental Performance Layers (Where Time Is Spent)

Every performance issue eventually maps to one or more of these layers:

Layer Primary Resource(s) Typical Symptoms First Observability Tools Kernel / Subsystem Involved
CPU Cores, scheduler High %user, %sys, high context switches top/htop, mpstat, perf, pidstat scheduler (EEVDF), softirq, steal time
Memory RAM, swap, page cache High swap usage, kswapd CPU, OOM killer free -h, vmstat, sar -r, /proc/meminfo mm subsystem, reclaim, compaction
I/O (block) Disk/SSD/NVMe High iowait, long service times, queue depth iostat -x, iotop, blktrace, sar -d block layer, I/O scheduler (mq-deadline/none)
Network stack NIC, TCP/IP, sockets Retransmits, drops, high softirq, queue full ss -s, nstat -az, sar -n DEV/ETCP, ethtool -S netfilter, tcp_congestion, qdisc
Application / userspace Threads, locks, syscalls High latency despite low system usage strace -c, perf record, bpftrace, slow logs syscall interface, futex, epoll

Golden rule: Never tune or blame one layer until you have confirmed it is the actual bottleneck using layered observability.

2. Observability Hierarchy (Start Broad → Go Deep)

Follow this decision tree every time:

  1. Quick system-wide snapshot (30 seconds)
    • uptime → load average vs number of cores
    • vmstat 1 10 → procs (r/b), cpu (%us/sy/id/wa/st), memory (si/so), system (in/cs)
    • iostat -x 1 5 → %util, await, svctm, r/s, w/s, rrqm/s, wrqm/s
    • mpstat 1 5 → per-core breakdown (%usr, %soft, %idle, %irq, %steal)
    • free -h + cat /proc/meminfo | grep -i commit → committed_AS vs total RAM
  2. Which resource is saturated? → pick the dominant layer
  3. Which process / kernel thread is responsible?
    • top -c -1 / htop (press F2 → Columns → add DELAY, SWAP, etc.)
    • pidstat -dulh 1 (I/O, CPU, memory per process)
    • perf top (live kernel/userspace samples)
  4. Why is that process/kernel spending time there?
    • CPU: perf record -g -p PID — sleep 10 then perf report
    • I/O: iotop, blktrace, bpftrace ‘tracepoint:block:* { @[comm] = count(); }’
    • Network: tcpdump, ss -m (memory usage per socket), nstat -az | grep -i drop
    • Syscall level: strace -c -p PID or bpftrace ‘tracepoint:raw_syscalls:sys_enter_* { @[probe] = count(); }’

3. Common Real-World Patterns (2026 Era)

Symptom Most Likely Root Causes (ordered by frequency) Key Diagnostic Commands / Indicators
High load, but mostly idle CPU Uninterruptible sleep (D state) due to slow I/O, NFS hang, bad multipath, LVM snapshot, mdadm resync `ps -eo state,comm,pid
High %softirq / ksoftirqd Network flood, high packet rate, bad driver, misconfigured qdisc, iptables/nftables rules in hot path mpstat -P ALL 1, cat /proc/softirqs, nstat -az
High iowait / await > 10–20 ms Saturated SSD/NVMe (queue depth too high), write amplification, too many small random writes, no write cache iostat -xdz 1, iotop -o, fio –name=test –rw=randwrite
Memory pressure (kswapd0 high CPU) Overcommitment, huge anonymous memory (Java, databases), too aggressive swappiness, no zswap/zram vmstat 1, sar -r, grep Anon /proc/meminfo, swapon -s
TCP retransmits / connection timeouts Buffer exhaustion, congestion control mismatch, middlebox issues, asymmetric routing, SYN flood remnants ss -s, `nstat -az
High context switches (> 10k/sec/core) Too many threads, short-lived processes, excessive epoll_wait/polling, futex contention vmstat 1 (cs column), perf record -e context-switches …

4. Modern Observability Stack Recommendations (Debian 13+)

Minimal effective stack (low overhead):

  • sar (sysstat package) → historical data (enable in /etc/default/sysstat)
  • prometheus-node-exporter + textfile collector → metrics every 15–60 s
  • bpftrace or bcc-tools → dynamic tracing without recompilation
  • perf → sampling profiler (kernel + userspace)
  • journalctl -u <service> + systemd-analyze blame → service startup/boot latency

For serious production environments:

  • node_exporter + blackbox_exporter + process-exporter
  • Grafana + Loki (logs) + Tempo (traces)
  • eBPF-based tools: pixie, deepflow, cilium Hubble (if using Kubernetes)

5. Troubleshooting Discipline (Checklist Mindset)

  1. Reproduce & baseline — can you reliably trigger the issue? Capture metrics during good vs bad periods.
  2. Eliminate noise — disable cron jobs, backups, log rotation during testing.
  3. Confirm saturation — is any resource >80–90% utilized for sustained periods?
  4. Correlate time series — did CPU spike first, or I/O, or memory pressure?
  5. Look at tail latency — averages hide the problem (use histograms: perf, bpftrace, application metrics)
  6. Change one variable — test sysctl, mount options, application config — measure before/after
  7. Consider kernel version — newer kernels (backports) often fix scheduler, TCP, NVMe issues

Final Mental Model

Performance is rarely about “finding the magic tune”. It is about locating where latency accumulates in the stack and removing or mitigating the dominant contributor.

In 2026, most Debian performance issues are still:

  • I/O saturation on database / log-heavy workloads
  • Network stack pressure from high connection churn
  • Memory pressure from JVM / in-memory caches
  • Scheduler thrashing from too many threads or misbehaving cgroups

Learn to read vmstat, iostat -x, mpstat, sar, and perf fluently — these five views cover ~90% of real incidents. Once you see the pattern in the numbers, the root cause usually becomes obvious.

Monitor continuously, troubleshoot methodically, tune sparingly.

Fast • Reliable • Affordable VPS - DO It Now!

Get top VPS hosting with VPS.DO’s fast, low-cost plans. Try risk-free with our 7-day no-questions-asked refund and start today!