loke.dev
Header image for What Nobody Tells You About Path MTU Discovery: Why Your 'Healthy' Network Is Silently Dropping Large Payloads

What Nobody Tells You About Path MTU Discovery: Why Your 'Healthy' Network Is Silently Dropping Large Payloads

Stop debugging your application code for random connection hangs; you are likely a victim of the ICMP black hole where large packets are discarded without a trace.

· 8 min read

You’re logged into a remote server via SSH. You run ls, and the directory listing appears instantly. You run uptime, and it's snappy. Then, you try to cat a large configuration file or run a git pull, and the terminal simply freezes. No error message, no "connection lost"—just a blinking cursor that refuses to move. You check your ping; it's 20ms. You check your bandwidth; you have 1Gbps. Your network looks perfectly healthy, yet it is silently swallowing your data.

This isn't a fluke. You are likely a victim of an ICMP Black Hole, a direct result of a failed Path MTU Discovery (PMTUD) process.

The 1500-Byte Lie

Almost every Ethernet network on the planet defaults to an MTU (Maximum Transmission Unit) of 1500 bytes. This is the largest frame size a network interface can send without breaking it into pieces. When your application sends data, the TCP stack chunks it into segments. If you add the 20-byte IPv4 header and the 20-byte TCP header, you're left with a Maximum Segment Size (MSS) of 1460 bytes.

As long as every hop between you and your destination supports 1500 bytes, life is good. But the modern internet is a mess of tunnels, VPNs, and virtualized overlays.

Imagine your packet travels through a GRE tunnel or an IPsec VPN. These protocols need their own headers. To fit a 1500-byte packet inside a tunnel, the router has to add, say, 24 bytes of overhead. Now your 1500-byte packet becomes 1524 bytes. If the physical link only supports 1500, that packet can't pass.

How PMTUD Is Supposed to Work

The IP protocol has a mechanism to handle this: the DF (Don't Fragment) bit. Most modern TCP stacks set the DF bit on all outgoing packets. Why? Because fragmentation is computationally expensive for routers and can lead to reassembly errors.

When a router receives a packet that is too large for the next hop's MTU and the DF bit is set, it is required by RFC 792 to:
1. Drop the packet.
2. Send an ICMP "Destination Unreachable" message (Type 3, Code 4: "Fragmentation Needed and Don't Fragment was Set") back to the sender.
3. Include the MTU of that next hop in that ICMP message.

The sender receives this ICMP packet, realizes the path is narrower than expected, reduces its MTU for that specific destination, and resends the data. This is Path MTU Discovery.

The Birth of the ICMP Black Hole

The system breaks because many network administrators view ICMP as a security risk. They follow a "block all ICMP" policy on their firewalls, thinking they are hardening the network against ping sweeps or Smurf attacks.

When a firewall drops ICMP Type 3 Code 4 messages, the sender never gets the memo that the packet was too large. The sender's TCP stack just knows it sent a packet and never received an ACK. It waits, then retransmits. And retransmits. And retransmits.

This is the "Black Hole." Small packets (like the TCP handshake or a small ls command) pass through because they are under the MTU limit. Large packets (the payload of your file transfer) are dropped silently. The connection stays "open" in the eyes of the OS, but no data moves.

Visualizing the Failure with tcpdump

If you suspect PMTUD issues, the first thing you should do is look at the traffic. Here is how I usually catch this in the wild. Run this on the client machine while attempting a large transfer:

# Capture only TCP packets with the SYN flag or ICMP packets
tcpdump -ni eth0 'tcp[tcpflags] & (tcp-syn) != 0 or icmp'

If you see a series of outgoing large packets followed by "Fragmentation needed" ICMP messages that never reach your application, or worse, if you see the outgoing packets but *zero* ICMP responses, you've found your black hole.

You can also use ping to manually find the MTU of a path. On Linux, use the -M do flag to set the DF bit and -s to specify the size (note: -s is the payload size, so add 28 bytes for headers):

# Try to send a packet that results in a 1500 byte total size
ping -M do -s 1472 1.1.1.1

If you get Frag needed and DF set, keep lowering the size until it passes. If it simply times out at 1472 but works at 1400, someone is dropping your ICMP messages.

The "Dirty" Fix: TCP MSS Clamping

Sometimes you can't control the broken firewall in the middle of the internet. If you are a network administrator or you control the gateway/router, you can use a technique called TCP MSS Clamping.

MSS Clamping is a "hack" that intercepts the initial TCP 3-way handshake (the SYN packets). The router modifies the MSS value inside the SYN packet to a lower value, forcing both ends of the connection to agree on a smaller maximum segment size from the start, regardless of what their local MTU is.

If you are using iptables on a Linux-based router, this is the magic command:

# Clamp MSS to PMTU (automatically detects the MTU of the outgoing interface)
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
         -j TCPMSS --clamp-mss-to-pmtu

Or, if you want to be explicit (e.g., for a WireGuard tunnel with a 1420 MTU):

# 1420 MTU - 40 bytes (IP+TCP headers) = 1380 MSS
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
         -j TCPMSS --set-mss 1380

On a Cisco router, the equivalent would be:

interface GigabitEthernet0/1
 ip tcp adjust-mss 1360

Proactive Discovery in Code

If you're writing a high-performance networking application in Python or Go, you shouldn't just pray that PMTUD works. You can actually query the kernel for its discovered MTU for a specific destination.

In Linux, the kernel maintains a routing cache that includes the learned PMTU. Here’s how you can peek at it using Python:

import socket
import struct

def get_pmtu(destination_ip):
    # Create a UDP socket (we don't even need to send data)
    s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    try:
        s.connect((destination_ip, 80))
        
        # IP_MTU is defined as 14 in the Linux headers
        # It retrieves the MTU currently known for the path
        IP_MTU = 14
        pmtu = s.getsockopt(socket.IPPROTO_IP, IP_MTU)
        return pmtu
    except Exception as e:
        return f"Could not determine PMTU: {e}"
    finally:
        s.close()

print(f"Current Path MTU to 8.8.8.8: {get_pmtu('8.8.8.8')}")

This is incredibly useful for logging. If a user reports "hangs," your application can log the PMTU the kernel is seeing. If you see a value like 1492 or 1280, you know there's encapsulation happening.

Why the Cloud Makes This Worse

In the "old days" of physical data centers, you controlled the switches. In the cloud, you are wrapped in layers of abstraction.

1. VPC Overlays: AWS VPCs and Google Cloud Networks use encapsulation (like Geneve or VXLAN). While they usually present a 1500-byte MTU to your instance, the actual physical underlying network is often running jumbo frames (9000 bytes) to accommodate the overhead.
2. VPN Gateways: If you connect your on-premise data center to AWS via a Transit Gateway or a Site-to-Site VPN, the MTU on that tunnel is almost certainly 1390 or 1427 bytes.
3. Load Balancers: AWS NLBs (Network Load Balancers) can sometimes preserve the source IP but modify the path characteristics, leading to subtle PMTUD failures if your security groups are too restrictive.

The Golden Rule for Cloud Security Groups: Always allow ICMP Type 3, Code 4. If you use a "blanket deny" for ICMP on your AWS Security Group, you are intentionally breaking your own network's ability to recover from packet size mismatches.

Linux Kernel Tuning for PMTUD

Sometimes the kernel's behavior regarding PMTUD needs a nudge. You can inspect and change these via sysctl.

# Check if PMTUD is enabled (1 = enabled)
sysctl net.ipv4.ip_no_pmtu_disc

# If for some reason you need to disable it (rare, but used in some weird ISP setups)
# sysctl -w net.ipv4.ip_no_pmtu_disc=1

There is also a setting for TCP MTU Probing (RFC 4821). This is a more robust way to find the MTU that doesn't rely on ICMP. It works by sending actual TCP data packets of varying sizes and seeing if they get ACKed.

# 0: Disabled
# 1: Disabled by default, enabled when ICMP black hole is detected
# 2: Always enabled
sysctl -w net.ipv4.tcp_mtu_probing=1

Setting tcp_mtu_probing=1 is a great safety net. If the kernel notices that TCP segments are timing out and not being acknowledged, it will automatically try smaller MSS values to "probe" for the working MTU, bypassing the need for ICMP "Fragmentation Needed" messages entirely.

The "Don't Do This" Fix: Lowering Interface MTU

I often see people fix this by just setting their server's eth0 MTU to 1400.

# The "Quick and Dirty" that I don't recommend
ip link set dev eth0 mtu 1400

While this works, it’s a sledgehammer approach. You are reducing performance for *all* traffic, including traffic within your local network or VPC that could have easily handled 1500 or 9000 bytes. It's much better to fix the ICMP filtering or use MSS clamping at the bottleneck than to globally cripple your server's throughput.

Summary Checklist for Troubleshooting

If you have a connection that hangs on large payloads:

1. Test with `ping`: ping -M do -s 1472 <target>. If it fails, keep reducing 1472 until it works.
2. Check ICMP: Ensure your firewalls and security groups allow ICMP Type 3, Code 4.
3. Monitor with `tcpdump`: Look for "Length" in your packet captures and see where they stop being ACKed.
4. Try MTU Probing: Enable net.ipv4.tcp_mtu_probing=1 on the sending server.
5. Last Resort: Use MSS Clamping on the router/gateway to force a lower MSS during the TCP handshake.

The internet is not a uniform pipe. It is a jagged series of tunnels and legacy hardware. Understanding Path MTU Discovery is the difference between an application that "works on my machine" and one that survives the chaotic reality of modern networking. Stop blaming your code for the "random" hangs; start looking at your packet sizes.