
Ubuntu Server Network Troubleshooting – Deep Technical Focus
Network issues on Ubuntu Server often stem from subtle interactions between kernel networking stack, systemd-resolved, Netplan’s declarative model, systemd-networkd backend behavior, and modern security/default hardening features. This guide prioritizes conceptual understanding and diagnostic reasoning over configuration snippets, targeting experienced administrators working on 24.04 LTS and later releases.
1. Understanding the Modern Ubuntu Networking Stack Layers
- Kernel netdev layer Physical/link state, carrier detection, ethtool-negotiated speed/duplex, offload features (TSO, GSO, GRO, checksum), RSS/indirection table, interrupt coalescing and napi weight. Misbehavior here usually manifests as NO-CARRIER, link-flapping, or extremely poor throughput despite correct IP configuration.
- systemd-networkd Manages link configuration, DHCPv4/v6 client, static addressing, routes (including policy routing), neighbor tables, and link-local addressing. It operates asynchronously and can fail silently if carrier never appears or DHCP server is unreachable during critical boot phases.
- Netplan Purely a frontend: it translates YAML into backend-specific files (/run/systemd/network/*.network, *.netdev). Critical behaviors: – renderer mismatch (networkd vs NetworkManager) – optional: true vs false impacting systemd-networkd-wait-online – match: clauses using predictable interface names (enpXsY, ens3) vs legacy eth0 – activation-mode (off, manual, auto) controlling when links are brought up
- systemd-resolved Local stub resolver (127.0.0.53:53), per-link DNS configuration, DNSSEC validation, split-DNS via search domains and routing domains, DoT/DoH support (experimental in recent releases). Most name resolution failures trace back to this daemon rather than /etc/resolv.conf (which is now a controlled symlink).
2. Systematic Layered Diagnosis Approach
Layer 1 – Physical & Data-Link
- Carrier sense failure: kernel dmesg | grep -i eth.*link
- Speed/duplex negotiation problems: ethtool shows different values on both ends → autoneg off or cable category mismatch
- Offload conflicts: TSO/GSO/GRO bugs with certain NIC drivers (igc, r8169, mlx5) → disable via ethtool -K
- Multi-queue / RSS imbalance: single flow pinned to one queue → poor multi-core scaling
Layer 2/3 – Addressing & Routing
- DHCP transaction visibility: journalctl -u systemd-networkd -g DHCP Look for REQUEST → OFFER → ACK sequence, lease renewal failures, NAK responses
- Static IP misapplication: conflicting addresses from cloud-init, old ifupdown configs, or duplicate netplan files
- Route priority & metric conflicts: multiple default routes with equal metrics → unpredictable forwarding
- Neighbor (ARP/ND) resolution failures: incomplete entries in ip neigh show → possible MTU mismatch, proxy ARP issues, or firewall dropping gratuitous ARP
Layer 3.5 – Firewall & NAT
- nftables (default since 22.04) vs legacy iptables: rules may drop early in INPUT/forward chain
- ufw status verbose shows effective policy but not actual nft rules
- conntrack table exhaustion under high connection rate → nf_conntrack_count near nf_conntrack_max
Layer 4+ – Transport & Application
- TCP congestion control mismatch (BBR vs CUBIC) on high BDP paths
- ECN blackholing → fallback to non-ECN slows recovery
- SYN cookies triggered → indicates listen backlog overflow or spoofed traffic
- Socket buffer pressure → autotuning hits ceiling → poor goodput
3. High-Impact Diagnostic Techniques
- Packet-level visibility without capture ss -K dst 8.8.8.8 → show sockets attempting to reach destination ss -s → summary of socket states (many TIME_WAIT = port exhaustion risk)
- Per-interface statistics deep dive ip -s -s link show dev enp1s0 → RX/TX errors, dropped, overruns, collisions ethtool -S enp1s0 | grep -i drop → driver-level drops (very common on virtio-net)
- DNS debugging without dig/nslookup resolvectl query –cache=no google.com → bypass cache resolvectl flush-caches; resolvectl statistics → see cache hit rate & upstream failures systemd-resolved –dump-state → full internal state dump
- Boot-time network ordering issues systemd-analyze critical-chain systemd-networkd-wait-online.service Long delays almost always trace to wait-online timeout when optional: false on optional interfaces
4. Frequent Root Causes in Production (2025–2026 Observations)
- Cloud-init + Netplan race → stale /etc/netplan/50-cloud-init.yaml overrides user config
- systemd-resolved DoT fallback failure on captive portals or broken upstream resolvers
- Predictable interface naming disabled → old configs reference eth0 instead of enpXsY
- Kernel driver regression after point release (common with igc, r8125, ixgbe)
- MTU mismatch on VXLAN/Geneve/GRE tunnels or jumbo-frame enabled switches
- nf_conntrack_buckets too small for NAT-heavy workloads → hash collisions → early drops
5. Resolution Patterns
- Prefer netplan try over netplan apply during debugging
- Use match.macaddress: or match.perm MAC when renaming interfaces
- For DNS issues: set global fallback DNS in resolved.conf, disable DNSSEC if upstream strips signatures
- For carrier detection problems: decrease carrier-wait timeout or use RequiredForOnline=no
- When performance is poor but connectivity exists: force congestion control to bbr, increase somaxconn, tune tcp_low_latency
Start diagnosis at the lowest layer that shows abnormality (link state → addressing → routing → resolution → transport). Once you identify which layer fails, 80% of problems become trivial to resolve.
If you can describe the exact symptom pattern (e.g. “DHCP never completes”, “DNS resolves intermittently”, “link up but no traffic passes”, “boot hangs 2 minutes on network”), outputs from ip -c addr, resolvectl status, journalctl -u systemd-networkd -b, or your environment type (bare metal, KVM, cloud provider), far more targeted reasoning can be applied.