loke.dev
Header image for A Narrow Window for the 10Gbps Stream

A Narrow Window for the 10Gbps Stream

Your high-speed connection is physically capped by the product of latency and buffer size, regardless of how much bandwidth you pay for.

· 8 min read

You are paying for a 10Gbps connection, but your single-stream data transfers are likely peaking at 300Mbps, and there isn't a single thing your ISP can do to fix it. This isn't a marketing scam or a "best effort" service clause. It is a fundamental law of networking physics. Your high-speed connection is physically capped by the product of latency and buffer size, a concept known as the Bandwidth-Delay Product (BDP). If your software isn't tuned to handle the distance, your expensive fiber line is essentially a massive firehose being emptied through a cocktail straw.

I’ve spent years watching developers pull their hair out because a 100GB database migration between cloud regions is taking twelve hours despite having "massive" interconnects. They check the CPU, they check the disk I/O, they check the NIC—everything looks idle. The culprit is almost always a tiny, invisible window that governs how much data can be "in flight" at any given time.

The Mathematical Wall: Bandwidth-Delay Product

TCP is a chatty protocol. It doesn't just scream data into the void; it sends a chunk, waits for the receiver to say "I got it," and then sends more. The amount of data you can send before you must stop and wait for an acknowledgment (ACK) is called the TCP Window Size.

The Bandwidth-Delay Product (BDP) tells you exactly how big that window needs to be to keep the pipe full. The formula is deceptively simple:

$$BDP (bits) = Total Bandwidth (bits/sec) \times Round Trip Time (seconds)$$

Let's look at a real-world scenario. You are transferring data from New York to London.
- Bandwidth: 10 Gbps (10,000,000,000 bits per second)
- RTT (Latency): 75ms (0.075 seconds)

$$BDP = 10,000,000,000 \times 0.075 = 750,000,000 \text{ bits}$$
$$750,000,000 / 8 = 93.75 \text{ MB}$$

This means that to utilize your full 10Gbps link, your system needs to keep 93.75 MB of data in flight at all times. If your TCP window size (your buffer) is capped at the old default of 64KB or even a modern 2MB, you will never, ever hit 10Gbps. You will hit a hard ceiling regardless of how "fast" your internet is.

The 64KB Ghost

Back in the early days of the internet, the TCP header reserved only 16 bits for the window size. $2^{16}$ is 65,535 bytes (64KB). In 1988, 64KB was a massive amount of data. Today, on a 10Gbps link with 75ms latency, a 64KB window limits your throughput to about 6.9 Mbps.

Modern systems use TCP Window Scaling (RFC 1323) to get around this, allowing windows up to 1GB. But here is the catch: just because your OS *can* scale the window doesn't mean it *will*. The kernel is often conservative to save memory, and many applications don't know how to ask for more.

Checking your actual throughput capacity

If you want to see what your theoretical cap is based on your current buffers, you can use this simple Python script.

def calculate_max_throughput(window_size_mb, rtt_ms):
    """
    Calculates the maximum theoretical throughput for a TCP stream.
    window_size_mb: The TCP receive window in Megabytes
    rtt_ms: The Round Trip Time in milliseconds
    """
    window_bits = window_size_mb * 1024 * 1024 * 8
    rtt_seconds = rtt_ms / 1000.0
    
    # Throughput = WindowSize / Latency
    bps = window_bits / rtt_seconds
    gbps = bps / 1_000_000_000
    
    return gbps

# Example: Default Linux window (approx 4MB) on a 100ms cross-country link
print(f"Max throughput: {calculate_max_throughput(4, 100):.2f} Gbps")
# Result: 0.34 Gbps (Nowhere near 10Gbps!)

Tuning the Linux Kernel for High Speed

If you are running a server that needs to push high-speed data over long distances, the default Linux networking stack is probably holding you back. You need to tell the kernel that it’s okay to use more memory for buffers.

We do this via sysctl. Here are the settings I typically apply when I need to saturate a 10Gbps link over a long-haul connection:

# Increase the maximum amount of memory for TCP buffers
# format: min, default, max (values in bytes)

# Read buffers
sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 134217728'

# Write buffers
sudo sysctl -w net.ipv4.tcp_wmem='4096 65536 134217728'

# Max OS-level receive/send buffer sizes
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.wmem_max=134217728

# Enable TCP Window Scaling (should be on by default, but verify)
sudo sysctl -w net.ipv4.tcp_window_scaling=1

Why these numbers? 134,217,728 bytes is 128MB. Remember our math earlier? NYC to London needed ~94MB for a 10Gbps link. By setting the max to 128MB, we give the TCP stack enough room to breathe and scale the window as needed.

The Congestion Control Factor: BBR vs. CUBIC

Having a big enough window is only half the battle. You also have to worry about how the kernel reacts when it loses a packet.

The traditional algorithm, CUBIC, is pessimistic. It treats packet loss as a sign of congestion and immediately cuts its sending rate in half. On a 10Gbps link, cutting speed in half because of a tiny bit of line noise is catastrophic. It takes a long time for CUBIC to ramp back up to full speed.

Enter BBR (Bottleneck Bandwidth and Round-trip propagation time). Developed by Google, BBR doesn't look at packet loss as the primary indicator of congestion. Instead, it measures how fast the pipe actually is and tries to keep it full.

I’ve seen BBR turn a "broken" 20Mbps cross-continental link into a 800Mbps link with a single command:

# Check current congestion control
sysctl net.ipv4.tcp_congestion_control

# Switch to BBR
sudo modprobe tcp_bbr
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

If you are moving data over the public internet, BBR is almost always the right choice. It is much more aggressive about reclaiming that "Narrow Window."

Measuring the Bottleneck with iperf3

Don't guess. Measure. To see if you are hitting a BDP limit, you need iperf3. It is the gold standard for network testing because it allows you to manipulate the window size on the fly.

First, run a standard test:
iperf3 -c remote_server_ip

If it’s slow, try increasing the window size manually in the client:

# -w 32M sets the window size to 32 Megabytes
iperf3 -c remote_server_ip -w 32M

If the speed jumps significantly when you increase -w, you have a BDP/Buffer issue. If it stays slow, you might actually have a physical line issue, CPU bottleneck, or your ISP is indeed throttling you.

Parallel Streams: The Developer's Cheat Code

Sometimes, you can't tune the kernel. Maybe you’re running in a locked-down container or a restricted environment. In that case, the solution is Parallelism.

If one TCP stream is capped at 500Mbps because of the window size, open twenty streams. Total throughput is cumulative. This is why tools like aria2, rclone (with --transfers), and even s5cmd for S3 are so much faster than standard cp or scp.

Here is a quick example of how you might implement multi-stream logic in a Go application to bypass BDP limits:

package main

import (
	"fmt"
	"io"
	"net/http"
	"sync"
)

func downloadChunk(url string, start, end int64, wg *sync.WaitGroup) {
	defer wg.Done()

	client := &http.Client{}
	req, _ := http.NewRequest("GET", url, nil)
	// Using Range headers to download a specific part of the file
	rangeHeader := fmt.Sprintf("bytes=%d-%d", start, end)
	req.Header.Add("Range", rangeHeader)

	resp, err := client.Do(req)
	if err != nil {
		fmt.Println("Error:", err)
		return
	}
	defer resp.Body.Close()

	// In a real app, you'd write this to a specific offset in a file
	_, _ = io.ReadAll(resp.Body)
	fmt.Printf("Downloaded chunk %d-%d\n", start, end)
}

func main() {
	var wg sync.WaitGroup
	fileSize := int64(100 * 1024 * 1024) // 100MB
	chunks := int64(4)
	chunkSize := fileSize / chunks

	url := "http://example.com/largefile"

	for i := int64(0); i < chunks; i++ {
		wg.Add(1)
		start := i * chunkSize
		end := start + chunkSize - 1
		if i == chunks-1 {
			end = fileSize - 1
		}
		go downloadChunk(url, start, end, &wg)
	}
	wg.Wait()
}

By splitting the file into four chunks and downloading them simultaneously, you are effectively quadrupling your window size without touching a single sysctl setting.

When Low Latency is Not Possible

We often talk about "low latency" as the goal for gaming or voice calls. But for high-throughput data, low latency is a *requirement* for efficiency.

If you have a 10Gbps link but your RTT is 200ms (say, Singapore to London), your BDP is a staggering 250MB. Finding a system that will reliably allocate a 250MB buffer for a single socket is rare. In these "Long Fat Pipes" (LFPs), TCP starts to break down.

This is where protocols like QUIC (HTTP/3) or UDP-based transfer tools (Aspera, Signiant) come in. They don't use the standard TCP windowing logic. QUIC, for example, handles packet loss much more gracefully and doesn't suffer from "Head-of-Line blocking" in the same way. If you are building a modern stack, moving toward HTTP/3 isn't just about speed—it's about surviving the physics of long-distance networking.

The "Gotchas" of Large Buffers

Before you go and set every server to 512MB buffers, there are consequences:

1. Memory Consumption: Every TCP connection will reserve some memory. If you have 10,000 concurrent users and you've set a massive max buffer, you can OOM (Out Of Memory) your server very quickly. Large buffers are for *fat streams*, not for *many streams*.
2. Bufferbloat: If your buffers are too large at the router level, packets can sit in the queue for a long time instead of being dropped. This increases latency and makes interactive sessions (SSH) feel like you're typing through molasses.
3. CPU Overhead: Managing massive windows and reassembling large chunks of data out-of-order requires more CPU cycles.

Summary Checklist

If you find yourself with a 10Gbps pipe that feels like a 1Gbps pipe:

1. Calculate your BDP: $Bandwidth \times RTT$. Is your window size (usually net.ipv4.tcp_rmem max) larger than that number?
2. Check for Scaling: Ensure net.ipv4.tcp_window_scaling is set to 1.
3. Switch to BBR: Get away from CUBIC for long-haul transfers.
4. Go Parallel: Use tools that support multiple concurrent streams.
5. Use iperf3: Prove where the bottleneck is before changing code.

Bandwidth is just the width of the road. Latency is the speed limit. And the TCP window is the size of the truck you're allowed to drive. If you want to move 10Gbps, you don't just need a wide road; you need a massive truck or a fleet of small ones. Don't let your configuration keep your data in the slow lane.