Debugging PMTU blackholes in MariaDB replication

TCP window collapse over WireGuard interfaces

Environment: Ubuntu 22.04.3 LTS, Kernel 5.15.0-88-generic. Hardware: Two bare-metal instances, Intel Xeon E-2388G, 64GB ECC RAM. Network: 1Gbps private switched network. Application Layer: Nginx 1.22.1, PHP-FPM 8.1, MariaDB 10.6.12. Topology: Frontend web node communicating with a dedicated database node via a WireGuard tunnel (wg0).

Routine state inspection of the MariaDB processlist on the database node revealed a recurring pattern. Specific UPDATE threads were stalling in the query end state for exactly 30 seconds before being terminated by the server due to net_read_timeout.

The frontend node was hosting an event registration portal running the Livesay - Event & Conference WordPress Theme. The application functioned normally for standard page loads, queries returning small result sets, and standard form submissions. The stalled threads exclusively correlated with administrative operations saving complex schedule matrices or speaker profile metadata.

I examined the php-fpm slow log on the frontend node. The stack trace consistently pointed to the mysqli_query execution block. CPU utilization on both nodes remained below 5%. Memory usage was static. There was no I/O wait contention on the NVMe storage subsystem. The issue was isolated to the network transport layer between the frontend and the database.

To diagnose the state of the TCP connections, I utilized ss (socket statistics) on the frontend node, filtering for connections established with the database node's WireGuard IP (10.0.0.2) on port 3306.

# ss -ntpei dst 10.0.0.2:3306

State  Recv-Q  Send-Q  Local Address:Port   Peer Address:Port  Process
ESTAB  0       14480   10.0.0.1:45192       10.0.0.2:3306      users:(("php-fpm",pid=1142,fd=14))
     ino:458122 sk:1001 cgroup:/system.slice/php8.1-fpm.service <->
     skmem:(r0,rb87380,t0,tb212992,f0,w14480,o0,bl0,d0) ts sack bbr wscale:7,7 rto:210 rtt:1.2/0.4
     ato:40 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:4202 bytes_received:3118
     segs_out:15 segs_in:12 data_segs_out:8 data_segs_in:5 bbr:(bw:11.2Mbps,mrtt:1.1,pacing_gain:2.88,cwnd_gain:2.88)
     send 96.5Mbps lastsnd:12000 lastrcv:12010 lastack:12010 pacing_rate 31.9Mbps delivery_rate 11.2Mbps
     busy:120ms unacked:10 retrans:0/4 lost:0 rcv_space:14480 rcv_ssthresh:64089 minrtt:1.1

The output contains several critical metrics indicating a transport stall. The Send-Q (Send Queue) holds 14,480 bytes. This means the application has written 14.4 KB of data to the socket buffer, but the kernel has not yet discarded it because it has not received TCP Acknowledgments (ACKs) from the remote peer.

The unacked:10 value indicates 10 in-flight TCP segments have not been acknowledged. The retrans:0/4 value shows that the current packet has been retransmitted 4 times. lastsnd:12000 indicates 12 seconds have passed since the application wrote to the socket. The connection is effectively deadlocked.

I initiated a packet capture on both the physical interface (eth1) and the virtual interface (wg0) of the frontend node using tcpdump. I triggered the application event to reproduce the stall.

# tcpdump -i wg0 -nn -S -vvv host 10.0.0.2 and port 3306 -w /tmp/wg0_trace.pcap
# tcpdump -r /tmp/wg0_trace.pcap

14:01:23.100102 IP (tos 0x0, ttl 64, id 54321, offset 0, flags [DF], proto TCP (6), length 1488)
    10.0.0.1.45192 > 10.0.0.2.3306: Flags [P.], cksum 0x1a2b (correct), seq 1000000000:1000001448, ack 2000000000, win 502, options [nop,nop,TS val 123456 ecr 654321], length 1448
14:01:23.310150 IP (tos 0x0, ttl 64, id 54322, offset 0, flags [DF], proto TCP (6), length 1488)
    10.0.0.1.45192 > 10.0.0.2.3306: Flags [P.], cksum 0x1a2b (correct), seq 1000000000:1000001448, ack 2000000000, win 502, options [nop,nop,TS val 123666 ecr 654321], length 1448
14:01:23.730210 IP (tos 0x0, ttl 64, id 54323, offset 0, flags [DF], proto TCP (6), length 1488)
    10.0.0.1.45192 > 10.0.0.2.3306: Flags [P.], cksum 0x1a2b (correct), seq 1000000000:1000001448, ack 2000000000, win 502, options [nop,nop,TS val 124086 ecr 654321], length 1448

The trace on wg0 shows the frontend node (10.0.0.1) transmitting a TCP segment with 1448 bytes of payload. The DF (Don't Fragment) bit is set in the IP header. The node receives no ACK. Instead, it enters an exponential backoff retransmission cycle. The timestamp delta increases (210ms, 420ms, 840ms...).

I simultaneously analyzed the capture on the physical interface (eth1), which transports the encrypted UDP packets representing the WireGuard tunnel.

# tcpdump -i eth1 -nn -vvv udp port 51820

14:01:23.100150 IP (tos 0x0, ttl 64, id 12345, offset 0, flags [DF], proto UDP (17), length 1548)
    192.168.1.10.51820 > 192.168.1.11.51820: UDP, length 1520

The outer IP packet size is 1548 bytes. This exceeds the standard Ethernet Maximum Transmission Unit (MTU) of 1500 bytes. Because the outer IP header also has the DF flag set, the physical switch or router between the two bare-metal nodes drops the packet. It does not forward it, and it does not fragment it.

According to RFC 1191, a router dropping a packet with the DF bit set must send an ICMP Type 3 Code 4 message ("Destination Unreachable, Fragmentation Needed and Don't Fragment was Set") back to the sender. This mechanism is Path MTU Discovery (PMTUD).

I checked the eth1 capture for any inbound ICMP traffic. There was none. I reviewed the iptables ruleset on the frontend node.

# iptables -L INPUT -v -n
Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 154K   12M ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0
 200K   18M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
   50  3200 ACCEPT     tcp  --  eth1   *       0.0.0.0/0            0.0.0.0/0            tcp dpt:22
  10K  600K ACCEPT     tcp  --  eth1   *       0.0.0.0/0            0.0.0.0/0            tcp dpt:80
  25K 1500K ACCEPT     tcp  --  eth1   *       0.0.0.0/0            0.0.0.0/0            tcp dpt:443
   15  2100 ACCEPT     udp  --  eth1   *       0.0.0.0/0            0.0.0.0/0            udp dpt:51820

The INPUT chain policy is DROP. There is no rule explicitly permitting ICMP traffic. The RELATED,ESTABLISHED rule does not catch ICMP Type 3 Code 4 messages generated by an intermediate router because those messages originate from the router's IP address, not the destination peer's IP address, failing standard netfilter connection tracking state checks unless specifically handled by the nf_conntrack_ipv4 module under certain conditions which were not met here.

Because the ICMP packet was dropped by the local firewall, the kernel's network stack remained unaware that the PMTU was 1500. The TCP stack continued to use the MSS derived from the wg0 interface configuration.

Let us examine the interface configuration.

# ip addr show wg0
4: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none
    inet 10.0.0.1/24 scope global wg0
       valid_lft forever preferred_lft forever

The MTU of wg0 is set to 1500. This is an administrative error.

To understand why this breaks the connection, we must dissect the packet encapsulation overhead mathematically. When the application writes data to the socket, the kernel constructs an sk_buff (socket buffer).

  1. Application Payload: The application sends a MySQL protocol packet. If the packet is large, the kernel segments it.
  2. TCP Header: The kernel prepends a 20-byte TCP header.
  3. Inner IP Header: The kernel prepends a 20-byte IPv4 header. The destination is 10.0.0.2.
  4. If the wg0 MTU is 1500, the kernel allows the combined size of (Payload + TCP Header + Inner IP Header) to reach 1500 bytes. This means the TCP Maximum Segment Size (MSS) is calculated as 1500 - 20 - 20 = 1460 bytes. (In my ss output, it was 1448 due to the 12-byte TCP Timestamps option being enabled).
  5. WireGuard Encapsulation: The packet enters the WireGuard routing process. WireGuard encrypts the 1500-byte inner packet. It adds its own protocol overhead: an 8-byte nonce, a 16-byte Poly1305 authentication tag, a 4-byte sender index, and a 4-byte counter. Total WireGuard overhead: 32 bytes.
  6. UDP Header: WireGuard sends this payload over UDP. It prepends an 8-byte UDP header.
  7. Outer IP Header: The kernel prepends a 20-byte outer IPv4 header. The destination is 192.168.1.11.

The total size of the frame sent to the physical NIC driver is: Inner MTU (1500) + WireGuard (32) + UDP (8) + Outer IP (20) = 1560 bytes.

The physical interface eth1 has an MTU of 1500. The packet is too large.

If this configuration is broken, why did standard application traffic succeed?

When a user browses the frontend, the database queries are brief. A SELECT query requesting a post ID or a boolean flag requires perhaps 60 bytes of payload. 60 (Payload) + 40 (TCP/IP) + 60 (Outer Headers) = 160 bytes. 160 bytes passes through the 1500-byte physical MTU limit without issue.

The problem only manifests when the application transmits a payload that forces the TCP stack to utilize the full MSS negotiated during the handshake. This frequently occurs when administrators Download WordPress Theme packages that bundle complex configuration arrays or inline assets. Saving theme settings triggers an UPDATE statement containing serialized arrays that easily exceed 2,000 bytes.

The kernel fragments the stream at the 1448-byte MSS boundary. The resulting 1560-byte outer UDP packet hits the physical switch, the switch drops it and sends an ICMP error, the local firewall drops the ICMP error, and the TCP state machine hangs in ESTAB while retransmitting endlessly.

This is a PMTUD blackhole.

The immediate fix is to define the wg0 MTU to account for the encapsulation overhead. Physical MTU (1500) - IPv4 Header (20) - UDP Header (8) - WireGuard Header (32) = 1440.

However, correcting the interface MTU alone is sometimes insufficient if traversing complex networks with varying MTUs, particularly if ICMP relies on intermediate nodes. A secondary, robust measure involves modifying the TCP payload directly within the netfilter framework to enforce MSS clamping. This instructs the kernel to inspect the SYN packets during the initial TCP three-way handshake and artificially rewrite the MSS option to a safe value, ensuring neither side ever attempts to send a segment larger than the specified limit.

ip link set dev wg0 mtu 1440
iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o wg0 -j TCPMSS --clamp-mss-to-pmtu
iptables -A INPUT -p icmp --icmp-type 3/4 -m state --state ESTABLISHED,RELATED -j ACCEPT

评论 0