loke.dev
Header image for What Nobody Tells You About TCP Slow Start Restart: Why Your 'Warm' Connections Are Still Latency-Prone

What Nobody Tells You About TCP Slow Start Restart: Why Your 'Warm' Connections Are Still Latency-Prone

Your open sockets are lying to you; unless you disable the kernel's idle reset, your most critical request bursts are being throttled by a congestion window that 'forgot' how fast your network actually is.

· 9 min read

I spent three nights in 2019 staring at a Grafana dashboard, trying to figure out why our internal microservices were reporting 200ms p99 latency for a 2KB payload over a "warm" persistent connection. The connection was established. The handshakes were long gone. The CPU was idling. Yet, every time a service sat quiet for more than a few seconds, the very next request behaved as if it were traversing a dial-up modem. It turned out the kernel was essentially "forgetting" the speed of the wire because it was trying to be polite.

That politeness is a feature called TCP Slow Start Restart (SSR).

If you’re building high-performance systems, you’ve likely been told that persistent connections (keep-alive) are the holy grail of low latency. You pay the three-way handshake tax once, and then you're in the clear. But that's a half-truth. Your open sockets are lying to you; unless you tweak how the kernel handles idle periods, your most critical request bursts are being throttled by a congestion window that has reverted to its infancy.

The Congestion Window (CWND) Amnesia

To understand why your warm connections are slow, we have to look at how TCP thinks. TCP doesn't know how much bandwidth is available between Point A and Point B. It has to probe for it. This is Slow Start.

When a connection begins, the kernel sets a cwnd (Congestion Window) — usually around 10 segments (MSS). It sends 10 packets, waits for ACKs, and then doubles the window. This exponential growth continues until it hits the ssthresh (slow start threshold) or starts losing packets.

Here is the kicker: TCP assumes the network is dynamic. If a connection goes idle for a period of time (defined by the retransmission timeout, or RTO), the kernel worries that the network conditions might have changed. Maybe a router along the path got congested? Maybe the link state changed?

To be "safe," the Linux kernel implements RFC 5681. It triggers a Slow Start Restart. It resets your cwnd back to the initial value (often 10). Your massive 100Gbps backbone pipe doesn't matter; the kernel is going to force that next burst of data through a straw until it "re-proves" the path is clear.

Watching the Reset in Real-Time

You can see this happening on your own machine. We don’t need complex probes; the ss (socket statistics) tool in Linux is enough.

Let’s look at a persistent connection. First, check your current system default for SSR:

sysctl net.ipv4.tcp_slow_start_after_idle

If it returns 1, you are currently being throttled on idle connections.

Now, let's watch a socket's internals. Open a terminal and run a command to monitor a specific connection to a remote server (like a database or an API):

# Watch the congestion window (cwnd) for a specific destination
watch -n 0.1 "ss -tin 'dst 192.168.1.50'"

In the output, look for cwnd:. You’ll see it jump up to 50, 100, or higher during a large data transfer. Now, stop the transfer and wait. After a few seconds of silence, you’ll see cwnd snap back down to 10 (or whatever your initcwnd is).

That "snap" is the sound of your tail latency increasing.

Why "Keep-Alive" Isn't Enough

Many developers think that sending an occasional TCP Keep-alive packet (the built-in kernel feature) prevents this. It doesn't.

TCP Keep-alives are tiny probes sent when the connection is idle to ensure the other end is still there. They do not carry data, and more importantly, they do not satisfy the kernel's requirement for "activity" to maintain the congestion window. As far as the congestion control algorithm is concerned, if you aren't sending *payload*, the network path is unverified.

If your application sends a heartbeat every 30 seconds, but the kernel's RTO is 200ms, your cwnd has already been reset 149 times between heartbeats.

The Performance Cost of "Safe" Defaults

Imagine an API call that returns a 100KB JSON blob. With an MSS (Maximum Segment Size) of 1460 bytes, that's roughly 70 packets.

1. Scenario A (Warm CWND): Your cwnd is at 100. You send all 70 packets in one burst. Total time: ~1 RTT.
2. Scenario B (SSR Reset): Your cwnd reset to 10.
* Send 10 packets. (Wait for ACK)
* Send 20 packets. (Wait for ACK)
* Send 40 packets. (Wait for ACK)
* Total time: 3 RTTs.

On a cross-region connection with 60ms latency, SSR just added 120ms of pure, unadulterated overhead to a connection you thought was "warm." This is exactly why "first request" syndrome exists in microservices.

Fix #1: Killing SSR at the Kernel Level

The most direct way to fix this is to tell the Linux kernel to stop being so paranoid. If you trust your network path (e.g., inside a data center or VPC), you can disable Slow Start Restart globally.

# Disable SSR immediately
sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0

# To make it permanent, add this to /etc/sysctl.conf
echo "net.ipv4.tcp_slow_start_after_idle = 0" | sudo tee -a /etc/sysctl.conf

Should you always do this? No. If your traffic is going over the public internet to mobile devices on flaky 4G/5G connections, SSR is actually your friend. It prevents you from blasting a massive burst of packets into a congested cell tower that can no longer handle the rate you achieved five minutes ago.

But for service-to-service communication, SSR is almost always a net negative.

Fix #2: The BPF Approach (Fine-Grained Tracing)

If you aren't sure if SSR is hitting you, you can use bpftrace to verify. This is much more accurate than ss because it captures the exact moment the reset happens.

Here is a one-liner to trace when the kernel shrinks the congestion window due to idleness:

# This requires bpftrace installed
kprobe:tcp_slow_start_after_idle {
    $sk = (struct sock *)arg0;
    $inet = (struct inet_connection_sock *)arg0;
    printf("SSR triggered for %s:%d\n", 
           ntop($sk->__sk_common.skc_daddr), 
           $sk->__sk_common.skc_dport);
}

If you run this and see your internal service IPs popping up, you’ve found your latency ghost.

Fix #3: Application-Level Padding (The "Dirty" Hack)

Sometimes you don't have root access to the underlying host (looking at you, serverless and some managed Kubernetes flavors). In those cases, you have to keep the cwnd open manually.

This involves sending "dummy" data. Not a TCP Keep-alive, but actual bytes at the application layer. In HTTP/2 or HTTP/3, you can use PING frames (which do count as activity in some implementations) or simply send a small, meaningless header update.

However, a more robust (if slightly "heavy") way is to ensure your heartbeat or health check happens more frequently than the RTO.

import time
import requests

# A naive example of keeping a connection truly 'hot'
session = requests.Session()

def keep_warm(url):
    while True:
        # We send a small HEAD request every 500ms 
        # to ensure the kernel doesn't trigger SSR
        try:
            session.head(url)
        except Exception:
            pass
        time.sleep(0.5)

Is this efficient? Not really. It wastes cycles and bandwidth. But if your p99 requirements are sub-10ms and you can't touch the sysctl, this is the price of admission.

The Interaction with BDP (Bandwidth Delay Product)

The reason SSR is so devastating is related to the Bandwidth Delay Product. BDP is Bandwidth * Round Trip Time. It represents how much data can be "in flight" on the wire.

On a 10Gbps link with 10ms RTT, your BDP is about 12.5 Megabytes. To saturate this link, your cwnd needs to grow to roughly 8,500 segments.

If you let SSR reset you to a cwnd of 10, you are utilizing 0.1% of your available bandwidth capacity on that first burst. You have to wait for multiple round trips of exponential growth to get back to full utilization. For a small API response, the transfer is over before you even hit 10% utilization. You are essentially paying for a Ferrari but only being allowed to drive it in 1st gear because you stopped at a red light for a few seconds.

Beyond SSR: The initcwnd Problem

While we are talking about the kernel being conservative, we should mention initcwnd. Even before a connection goes idle, it starts small. On older Linux kernels, the default initcwnd was 2 segments. Modern kernels (post-2011) use 10, following RFC 6928.

If you are dealing with large initial bursts, even 10 is too small. You can change this per-route without a global sysctl. This is particularly useful for edge servers sending large assets.

# Check current settings
ip route show

# Increase initcwnd and initrwnd for the default gateway
sudo ip route change default via <gateway_ip> dev eth0 initcwnd 32 initrwnd 32

By bumping initcwnd to 32, you can send ~46KB in the first RTT instead of ~14KB. Combined with tcp_slow_start_after_idle = 0, your connections become truly "instant" upon reuse.

The Cloud Load Balancer Trap

Here is something nobody tells you: Your cloud provider’s Load Balancer might be doing SSR even if your servers aren't.

If you have an AWS ALB or an NGINX ingress controller between your services, that middlebox maintains two separate TCP connections:
1. Client -> Load Balancer
2. Load Balancer -> Backend

Even if you tune your Backend's kernel, the Load Balancer might still have SSR enabled. If the LB -> Backend connection sits idle, the LB will throttle the burst to the backend. This is why many high-performance architectures favor "Sidecar" proxies (like Envoy in a Service Mesh) where you have more granular control over the TCP stack settings, or even bypass standard L7 load balancing for direct L4 routing where possible.

Summary Checklist for Latency-Sensitive Systems

If you are fighting "spiky" latency on warm connections, follow this workflow:

1. Audit the sysctl: Run sysctl net.ipv4.tcp_slow_start_after_idle. If it’s 1, that’s your prime suspect.
2. Verify with `ss`: Watch the cwnd value of your active sockets during idle periods. If it drops to 10 regularly, SSR is active.
3. Trace the reset: Use bpftrace to confirm the kernel is explicitly calling tcp_slow_start_after_idle.
4. Tuning:
* Set net.ipv4.tcp_slow_start_after_idle = 0 on internal service hosts.
* Increase initcwnd to 30-50 for high-bandwidth internal routes.
5. Application Layer: If you can't tune the kernel, increase heartbeat frequency or use dummy traffic to prevent the RTO timer from triggering the reset.

TCP is a 50-year-old protocol designed for a world where wires were unreliable and bandwidth was a luxury. The "Slow Start Restart" policy is a relic of that era—a safety mechanism that, in the modern data center, usually does more harm than good. Don't let your kernel's cautiousness sabotage your system's performance. Turn off the amnesia and let your warm connections stay hot.