loke.dev
Header image for How to Saturate a 10Gbps Link Without Bypassing the Kernel

How to Saturate a 10Gbps Link Without Bypassing the Kernel

An engineering guide to overcoming the 16-bit TCP receive window limit and achieving line-rate throughput on high-bandwidth, high-latency networks.

· 8 min read

I spent an entire afternoon staring at an iperf3 window that refused to budge past 1.2 Gbps on a brand-new 10Gbps cross-connect. The hardware was high-end, the fiber was clean, and the CPU was idling, yet the throughput felt like it was stuck in 2012. It’s a common frustration that leads many engineers to reach for DPDK or XDP to bypass the kernel entirely, but for 90% of use cases, the Linux networking stack is perfectly capable of hitting line rate—you just have to stop treating it like a black box.

The primary culprit is rarely the hardware. It’s usually a combination of the Bandwidth-Delay Product (BDP) and the fact that default Linux kernel parameters are tuned for the "average" internet user, not for high-throughput data center workloads.

The Math of the 64KB Ceiling

Before touching a single config file, we have to talk about the 16-bit limit. In the original TCP specification, the "Receive Window" field is only 16 bits wide. This means the largest window a receiver could ever advertise was 65,535 bytes (64KB).

On a local network with sub-millisecond latency, this isn't a problem. But as soon as you add distance, the Bandwidth-Delay Product bites you. BDP is the amount of data "in flight" required to fill the pipe.

BDP = Bandwidth (bits/sec) * Round Trip Time (seconds)

If you have a 10Gbps link with a 20ms RTT (roughly New York to Chicago), your BDP is:
10,000,000,000 * 0.020 = 200,000,000 bits, or about 25 Megabytes.

If your TCP window is stuck at 64KB, the sender will transmit 64KB, then stop and wait 20ms for an acknowledgment before sending more. You’ll be lucky to hit 25Mbps, regardless of your 10G hardware.

Scaling the Window

RFC 1323 (now RFC 7323) introduced Window Scaling. This uses a "shift count" in the TCP header options to multiply the 16-bit window value by a power of two. Most modern Linux distributions have this enabled by default, but it’s the first thing you should verify.

Check it with:

sysctl net.ipv4.tcp_window_scaling

If it’s 0, your 10G link is effectively a 10M link for anything over a few milliseconds of latency. Set it to 1.

Pushing the Kernel's Memory Limits

Even with scaling enabled, the Linux kernel won't just hand out 25MB of buffer space to every connection. It has strict limits to prevent a single rogue process from eating all the system memory. To saturate a 10G link, we need to raise the ceiling on both the core network buffers and the TCP-specific buffers.

The rmem (receive memory) and wmem (send memory) values are crucial. There are two sets: the core limits (the absolute max the kernel allows) and the ipv4.tcp limits (which allow for "autotuning").

Add these to your /etc/sysctl.conf:

# Increase the maximum total buffer space for all protocols
net.core.rmem_max = 134217728 
net.core.wmem_max = 134217728

# TCP memory tuning: [min, default, max] in bytes
# Here we allow TCP to scale up to 64MB per socket if needed
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Increase the maximum number of packets in the receive queue
net.core.netdev_max_backlog = 10000

Why these numbers? 64MB (67108864) is usually enough to cover the BDP of most transcontinental 10G links. If you're doing 10G over satellite or trans-Pacific (150ms+ RTT), you might need to push that to 128MB or 256MB.

After editing, apply with sysctl -p.

BBR: The Congestion Control Game Changer

The default TCP congestion control algorithm in most kernels is CUBIC. It’s fine, but it’s loss-based. CUBIC interprets a dropped packet as a sign of network congestion and immediately slashes its sending rate. On a 10G link, a tiny bit of random packet loss (which happens on long-haul fiber) will cause CUBIC to "sawtooth," preventing it from ever reaching line rate.

BBR (Bottleneck Bandwidth and Round-trip propagation time), developed by Google, focuses on actual throughput and RTT rather than just packet loss. It is significantly more aggressive and efficient for high-bandwidth links.

To enable BBR, you first need to change the queuing discipline (qdisc) to Fair Queuing (fq), which BBR requires for pacing.

# Enable BBR
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

To make it permanent, add them to /etc/sysctl.conf. You can verify the change with sysctl net.ipv4.tcp_congestion_control. If you’re running a kernel older than 4.9, you're out of luck here—but if you're trying to hit 10Gbps, you shouldn't be on a kernel that old anyway.

Dealing with the Hardware: NIC Offloading

If your CPU is spiking while trying to hit 10G, the kernel is likely spending too much time processing individual packets. Every packet requires an interrupt, and at 10Gbps, that's a lot of interrupts.

First, check if your NIC is offloading tasks like checksum calculation and segmentation to the hardware.

# View offload settings
ethtool -k eth0

Look for generic-segmentation-offload (GSO), tcp-segmentation-offload (TSO), and generic-receive-offload (GRO). If they are off, turn them on.

ethtool -K eth0 gso on tso on gro on

Jumbo Frames: If you control the entire network path (e.g., inside a single data center), increasing the MTU from 1500 to 9000 is the single biggest "easy win" for 10G. It reduces the number of packets the kernel has to process by a factor of six.

ip link set dev eth0 mtu 9000

*Warning: If any switch or hop in the path doesn't support 9000, your packets will be dropped or fragmented, destroying performance.*

Coding for Throughput: The Application Layer

Even with a perfectly tuned kernel, your application can still be the bottleneck. A common mistake is using the default buffer sizes in your language's socket implementation.

Here is a Python example showing how to explicitly set the socket buffer sizes to match our kernel tuning. Even though the kernel does "autotuning," manually setting SO_SNDBUF or SO_RCVBUF can be necessary for certain high-performance binary protocols.

import socket

def create_high_performance_socket():
    # Create a TCP socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    
    # 32MB buffer size
    buffer_size = 32 * 1024 * 1024
    
    # Set the Send and Receive buffers
    # Note: The kernel will actually double the value you set to allow for metadata overhead
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, buffer_size)
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, buffer_size)
    
    # Enable TCP_NODELAY to disable Nagle's algorithm if sending small messages
    # Though for 10G throughput, Nagle usually isn't the primary bottleneck
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    
    # Check what the kernel actually gave us
    actual_send_buf = sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
    print(f"Actual Send Buffer: {actual_send_buf / 1024 / 1024:.2f} MB")
    
    return sock

In C, it looks very similar:

#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>

int setup_socket(int port) {
    int sockfd = socket(AF_INET, SOCK_STREAM, 0);
    int window_size = 32 * 1024 * 1024; // 32MB

    // Set send buffer
    if (setsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &window_size, sizeof(window_size)) < 0) {
        perror("Error setting SO_SNDBUF");
    }

    // Set receive buffer
    if (setsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &window_size, sizeof(window_size)) < 0) {
        perror("Error setting SO_RCVBUF");
    }

    return sockfd;
}

One major caveat: If you manually set SO_SNDBUF or SO_RCVBUF in your code, Linux disables TCP autotuning for that socket. This means the window won't shrink if the network gets congested or expand if more bandwidth becomes available. Generally, it's better to let the kernel autotune and simply raise the global net.ipv4.tcp_rmem and wmem limits instead of hardcoding values in your app.

Interrupt Coalescing and RSS

On a multi-core system, you might find that one CPU core is pegged at 100% (handling interrupts) while the others are doing nothing. This is a classic bottleneck.

Receive Side Scaling (RSS) distributes the processing of network packets across multiple CPU cores. Most 10G NICs support this. You can check the interrupt distribution with:

cat /proc/interrupts | grep eth0

If all the interrupts are hitting CPU0, you might need to run irqbalance or manually set the smp_affinity for your NIC’s IRQs.

Another lever is Interrupt Coalescing. This tells the NIC to wait a few microseconds or until it has a few packets before firing an interrupt.

# See current settings
ethtool -c eth0

# Set rx-usecs to 100
ethtool -C eth0 rx-usecs 100

Lower rx-usecs means lower latency; higher rx-usecs means higher throughput and lower CPU usage. For 10Gbps line rate, you usually want to bump this up to give the CPU some breathing room.

Monitoring the Bottleneck

To see if your tuning is actually working, don't just look at the throughput. Look at the TCP state. The ss tool (socket statistics) is much better for this than the older netstat.

# Look at the internal TCP stats for a specific connection
ss -ti '( dport = :8080 )'

Output will look like this:

cubic wscale:7,7 rto:201 rtt:0.12/0.03 mss:1448 pmtu:1500 rcvspace:65536...

Look for cwnd (Congestion Window). If cwnd multiplied by mss (Maximum Segment Size) is significantly lower than your BDP, the kernel is still holding back. If rcvspace is small, your receive buffers are the bottleneck.

Summary Checklist for 10G

If you are struggling to hit 10Gbps, follow this order of operations:

1. Check BDP: Is your RTT > 1ms? If so, you must increase buffer sizes.
2. Verify Window Scaling: sysctl net.ipv4.tcp_window_scaling must be 1.
3. Raise Memory Ceilings: Set rmem_max and wmem_max to at least 64MB.
4. Switch to BBR: Change the qdisc to fq and congestion control to bbr.
5. Enable NIC Offloads: Use ethtool to ensure TSO, GSO, and GRO are on.
6. MTU 9000: If you control the hardware end-to-end, use Jumbo Frames.
7. Check Interrupts: Ensure irqbalance is running or affinity is set so one core isn't dying while others sleep.

The Linux kernel is an incredible piece of engineering, but it is built to be "safe." For high-performance networking, you have to move past the safe defaults and tell the kernel exactly how much memory you're willing to trade for speed. Bypass the kernel if you're building a world-class load balancer or a high-frequency trading platform—but for almost everything else, just tuning the stack is more than enough.