loke.dev
Header image for The Network RTT Is a Lie: Using the TCP_INFO Struct to Audit Your Kernel-Level Socket Health

The Network RTT Is a Lie: Using the TCP_INFO Struct to Audit Your Kernel-Level Socket Health

High-level timers are blind to kernel-level congestion, but querying the TCP_INFO struct allows your application to audit its own network health in real-time.

· 6 min read

Your application’s latency metrics are lying to you. Every time you wrap a curl call or a database query in a stopwatch timer, you aren't measuring network speed; you're measuring the sum of your own application's overhead, kernel scheduling, and the network’s actual performance. By the time your application code realizes a packet has been delayed, the kernel has already known about it for several milliseconds, tried to fix it, and likely failed.

If you are relying on application-layer Round Trip Time (RTT) to diagnose network health, you are looking at a filtered, distorted version of reality. To see what’s actually happening, you have to go deeper—down into the tcp_info struct living in the Linux kernel.

The Blind Spot of High-Level Timers

When we measure RTT at the application level—say, in Python or Go—we usually do something like this:

import time
import requests

start = time.perf_counter()
requests.get("https://api.example.com")
end = time.perf_counter()
print(f"Latency: {end - start}s")

This is easy to write, but it’s fundamentally flawed for fine-grained debugging. This timer includes:
1. The time it takes for your language runtime to schedule the thread.
2. The time spent in the kernel's networking stack before the first bit hits the wire.
3. The actual transit time.
4. The remote server’s processing time (which is the biggest variable).
5. The time it takes for the response to climb back up the stack to your app.

If your "latency" spikes, you have no idea if the network is congested, if the remote CPU is pegged, or if your local machine is experiencing context-switch hell. The kernel, however, keeps a meticulous ledger for every single socket. It knows exactly how many packets were retransmitted, the smoothed RTT (SRTT), and the size of the congestion window.

Entering the Kernel: The tcp_info Struct

In Linux, the kernel maintains a struct tcp_info for every TCP connection. This struct is the "Source of Truth." It's defined in /usr/include/linux/tcp.h and it contains fields that most developers never see, but SREs at places like Cloudflare or Google live by.

Here is what the simplified version looks like:

struct tcp_info {
    __u8    tcpi_state;
    __u8    tcpi_ca_state;
    __u32   tcpi_rto;             /* Retransmission timeout */
    __u32   tcpi_rtt;             /* Smoothed RTT in microseconds */
    __u32   tcpi_rttvar;          /* RTT variance */
    __u32   tcpi_snd_cwnd;        /* Sending congestion window */
    __u32   tcpi_total_retrans;   /* Total retransmits for the lifetime of the socket */
    // ... many more fields
};

By querying this struct directly from your application, you can differentiate between "the network is slow" and "the remote server is taking a long time to think."

How to Extract the Truth (C Implementation)

To get this data, you use the getsockopt system call with the TCP_INFO flag. Here is a practical example in C that opens a connection and then audits its own health.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <unistd.h>

void print_socket_stats(int sockfd) {
    struct tcp_info info;
    socklen_t len = sizeof(info);

    if (getsockopt(sockfd, IPPROTO_TCP, TCP_INFO, &info, &len) == -1) {
        perror("getsockopt");
        return;
    }

    // tcpi_rtt is in microseconds
    printf("\n--- Kernel-Level Socket Health ---\n");
    printf("Smoothed RTT: %u us\n", info.tcpi_rtt);
    printf("RTT Variance: %u us\n", info.tcpi_rttvar);
    printf("Total Retransmits: %u\n", info.tcpi_total_retrans);
    printf("Congestion Window: %u\n", info.tcpi_snd_cwnd);
    printf("Unacked Packets: %u\n", info.tcpi_unacked);
    printf("Lost Packets: %u\n", info.tcpi_lost);
}

int main() {
    int sockfd = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in serv_addr;

    serv_addr.sin_family = AF_INET;
    serv_addr.sin_port = htons(80);
    inet_pton(AF_INET, "1.1.1.1", &serv_addr.sin_addr);

    if (connect(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
        perror("Connection Failed");
        return -1;
    }

    // Send some data to populate the stats
    char *msg = "GET / HTTP/1.1\r\nHost: 1.1.1.1\r\n\r\n";
    send(sockfd, msg, strlen(msg), 0);

    // Give the kernel a moment to receive an ACK and update RTT
    usleep(100000); 

    print_socket_stats(sockfd);

    close(sockfd);
    return 0;
}

Why this matters

In the code above, info.tcpi_rtt is the actual time it took for a packet to be acknowledged by the remote NIC. This value is stripped of application-layer scheduling noise. If your tcpi_total_retrans starts incrementing while your tcpi_rtt remains stable, you know you have packet loss (likely a bad switch or cable), but the path itself isn't necessarily congested.

Implementing in Go: Real-World Usage

Most of us aren't writing raw C for our web services. Fortunately, Go makes it relatively easy to drop down into the syscall layer to grab this info. If you're running a high-throughput microservice, you can use a "middleware" style approach to log socket health when a request takes longer than expected.

package main

import (
	"fmt"
	"net"
	"os"
	"syscall"
	"unsafe"
)

// We need to mirror the kernel's struct for the fields we care about.
// Note: This is Linux specific!
type TCPInfo struct {
	State       uint8
	CAState     uint8
	Retransmits uint8
	Probes      uint8
	Backoff     uint8
	Options     uint8
	_           [2]byte // padding

	Rto    uint32
	Ato    uint32
	SndMss uint32
	RcvMss uint32

	Unacked uint32
	Sacked  uint32
	Lost    uint32
	Retrans uint32
	Fackets uint32

	/* Times are in msecs */
	LastDataSent uint32
	LastAckSent  uint32
	LastDataRecv uint32
	LastAckRecv  uint32

	/* Metrics */
	Pmtu        uint32
	RcvSsthresh uint32
	Rtt         uint32
	Rttvar      uint32
	SndSsthresh uint32
	SndCwnd     uint32
	Advmss      uint32
	Reordering  uint32
}

func getTCPInfo(conn *net.TCPConn) (*TCPInfo, error) {
	raw, err := conn.SyscallConn()
	if err != nil {
		return nil, err
	}

	var info TCPInfo
	var innerErr error
	
	err = raw.Control(func(fd uintptr) {
		infoLen := uint32(unsafe.Sizeof(info))
		_, _, errno := syscall.Syscall6(
			syscall.SYS_GETSOCKOPT,
			fd,
			syscall.IPPROTO_TCP,
			syscall.TCP_INFO,
			uintptr(unsafe.Pointer(&info)),
			uintptr(unsafe.Pointer(&infoLen)),
			0,
		)
		if errno != 0 {
			innerErr = error(errno)
		}
	})

	if err != nil {
		return nil, err
	}
	if innerErr != nil {
		return nil, innerErr
	}

	return &info, nil
}

func main() {
	addr, _ := net.ResolveTCPAddr("tcp", "google.com:80")
	conn, _ := net.DialTCP("tcp", nil, addr)
	
	conn.Write([]byte("GET / HTTP/1.0\r\n\r\n"))

	info, err := getTCPInfo(conn)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error: %v\n", err)
		return
	}

	fmt.Printf("Kernel RTT: %dms\n", info.Rtt/1000)
	fmt.Printf("Congestion Window: %d\n", info.SndCwnd)
}

The "Silent Killer": tcpi_retrans and tcpi_lost

In a standard application, you usually don't know a packet was lost until the entire request times out or takes significantly longer. But the kernel knows immediately when it has to retransmit.

If you are seeing 500ms response times from a database, check your tcpi_retrans. If it’s high, it doesn't matter how much you optimize your SQL queries; your network layer is flapping.

One of the most valuable use cases for TCP_INFO is Dynamic Load Balancing. Imagine a load balancer that doesn't just look at the number of active connections, but actually queries the kernel to see which upstream backend has the lowest tcpi_rtt and the fewest retransmissions. You can route around "gray failures"—nodes that are still alive but are experiencing NIC issues or are connected to a flaky Top-of-Rack switch.

Deciphering the Congestion Window (tcpi_snd_cwnd)

If you've ever wondered why a 10Gbps link is only giving you 100Mbps of throughput, the answer is usually found in tcpi_snd_cwnd.

TCP uses a "Congestion Window" to determine how many unacknowledged packets can be in flight. If the kernel detects loss, it slashes this window to prevent further congestion. By monitoring tcpi_snd_cwnd, your application can detect that the network path is "narrowing" before the actual throughput drops significantly.

For example, if you are building a video streaming service, you could use TCP_INFO to proactively downsample the bitstream if you see the tcpi_snd_cwnd shrinking, rather than waiting for the player's buffer to empty.

The Catch: It's Not Universal

Before you go and rewrite your entire monitoring stack, there are a few "gotchas" that I've learned the hard way:

1. Platform Locked: TCP_INFO is a Linux-ism. While BSD (and macOS) have similar concepts (like TCP_CONNECTION_INFO), the struct members and units of measurement are different. If you’re writing cross-platform code, you’ll need a lot of #ifdef or build tags.
2. Snapshot in Time: getsockopt provides a snapshot. If a burst of retransmissions happens and then stops, you might miss it if you don't poll frequently enough. However, fields like tcpi_total_retrans are cumulative, which helps mitigate this.
3. Kernel Versions: Over the years, the tcp_info struct has grown. If your code is compiled against a modern kernel header but runs on an ancient 3.x kernel, you might get truncated data or errors. Always check the size of the returned struct.
4. The "Smoothed" RTT: tcpi_rtt is an average. A single massive spike might be averaged out. If you need to see every single jitter peak, you need to look at eBPF (but that's a whole other blog post).

Moving Beyond Blind Monitoring

We spend so much time instrumenting our code with OpenTelemetry and Prometheus, yet we treat the network as a black box that just works until it doesn't.

Querying TCP_INFO isn't something you need for every simple CRUD app. But if you're building high-performance systems, real-time data pipelines, or distributed databases, stop guessing. Stop using Stopwatch.Start(). Ask the kernel; it's already done the math for you.

When the network RTT is a lie, the tcp_info struct is the only way to find the truth.