A Surgical Look at the TCP Socket

When you call write() on a Linux socket, the data doesn't just "go to the network." It enters a complex, state-managed purgatory known as the socket send buffer. From there, the Linux TCP stack decides—based on a dizzying array of variables like congestion windows, retransmission timers, and path MTU—exactly when and how that data is serialized into packets. Most developers treat this as a black box, but the Linux kernel actually provides a surgical window into this process through the TCP_INFO socket option.

If you are hunting for a "micro-stutter" in a distributed system or wondering why your high-bandwidth link is plateauing at 20% utilization, you don't need a packet sniffer. You need to ask the socket itself what it's thinking.

The Anatomy of `struct tcp_info`

The primary way to extract internal TCP state is via getsockopt(..., IPPROTO_TCP, TCP_INFO, ...). This returns a struct tcp_info, defined in linux/tcp.h. It is a dense collection of counters and gauges that represent the kernel's current "best guess" about the health of the connection.

Here is what the basic data retrieval looks like in C:

#include <stdio.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
#include <netinet/in.h>

void print_socket_stats(int fd) {
    struct tcp_info info;
    socklen_t len = sizeof(info);

    if (getsockopt(fd, IPPROTO_TCP, TCP_INFO, &info, &len) == 0) {
        printf("RTT: %u us\n", info.tcpi_rtt);
        printf("RTT Var: %u us\n", info.tcpi_rttvar);
        printf("CWND: %u\n", info.tcpi_snd_cwnd);
        printf("Retransmits: %u\n", info.tcpi_retransmits);
        printf("Total Retrans: %u\n", info.tcpi_total_retrans);
    }
}

This struct isn't static; it's a living snapshot. Each field tells a story about a specific layer of the TCP state machine. To understand why your application is hesitating, you need to look at three specific pillars: Latency (RTT), Throughput (CWND), and Loss (Retransmissions).

RTT: The Pulse of the Connection

The tcpi_rtt field is the "Smooth Round Trip Time" (SRTT). It’s not a single measurement; it’s an exponentially weighted moving average of the time between sending a segment and receiving an ACK.

If tcpi_rtt is significantly higher than the baseline ping between your two servers, your packets are likely sitting in a queue somewhere—either in a switch, a router, or the NIC’s ring buffer. This is Bufferbloat.

However, RTT alone isn't enough. You must also look at tcpi_rttvar (RTT Variance).
- Low RTT, Low RTTVar: A clean, stable path.
- Low RTT, High RTTVar: Jitter. This usually indicates cross-traffic on a shared link or CPU scheduling delays on the receiver side.
- High RTT, Low RTTVar: A long physical path or a persistently full queue that isn't fluctuating.

I’ve seen cases where a Go service was experiencing 200ms "network" latency to a database in the same rack. The tcpi_rtt was 100 microseconds, but the application-level latency was 200ms. This immediately proved the network was innocent; the delay was the application's garbage collector pausing the world before the data could be read from the socket.

The Congestion Window (CWND): The Gas Pedal

If you want to know why a 10Gbps link is only moving data at 100Mbps, tcpi_snd_cwnd is your first stop.

The tcpi_snd_cwnd value is measured in segments (usually 1448 or 1460 bytes). It represents the maximum amount of data the kernel will allow in flight before receiving an ACK.

Throughput is roughly calculated as:
Throughput = (CWND * MSS) / RTT

If your CWND is stuck at a low value, the kernel's Congestion Control algorithm (like CUBIC or BBR) thinks the network is congested. You can check tcpi_ca_state to see what the kernel is currently doing:

If you see your socket flipping between Open and Loss, you have a "noisy" link. If it's stuck in Recovery, you're likely dropping packets at a rate faster than TCP can recover.

Measuring Loss Without Packet Captures

Packet loss is the ultimate performance killer. In high-speed networks, even a 0.1% loss rate can cause throughput to collapse.

The tcp_info struct provides two critical counters:
1. tcpi_retrans: Number of segments currently being retransmitted.
2. tcpi_total_retrans: A cumulative counter of all retransmissions over the life of the socket.

Tracking tcpi_total_retrans over time is far more effective than running tcpdump. If you sample this value every second, you can identify exactly when loss events occur.

# A conceptual Python snippet using the 'socket' and 'struct' modules
import socket
import struct

# The format string for struct tcp_info varies by kernel, 
# but this is the general approach for RTT and Retrans
def get_tcp_info(sock):
    # TCP_INFO is opt 11 on Linux
    opt = sock.getsockopt(socket.IPPROTO_TCP, 11, 192) # 192 bytes is usually enough
    # Unpack tcpi_rtt (index 64-68) and tcpi_total_retrans (index 104-108)
    # Note: offsets vary based on architecture and kernel version
    rtt = struct.unpack_udata('I', opt[64:68])[0]
    total_retrans = struct.unpack_udata('I', opt[104:108])[0]
    return rtt, total_retrans

Moving Beyond C: Extracting Metrics in Go

While C is the native language of the kernel, most of us are building systems in Go, Rust, or Python. Go’s syscall package makes it relatively easy to dip into these metrics without writing CGO.

Here’s a practical example of how to pull tcp_info from a standard net.TCPConn in Go:

package main

import (
	"fmt"
	"net"
	"syscall"
	"unsafe"
)

// Simplified struct for Linux tcp_info
type TCPInfo struct {
	State       uint8
	_           uint8
	_           uint8
	_           uint8
	_           uint8
	_           uint8
	_           uint8
	_           uint8
	Rto         uint32
	Ato         uint32
	Snd_mss     uint32
	Rcv_mss     uint32
	Unacked     uint32
	Sacked      uint32
	Lost        uint32
	Retrans     uint32
	Fackets     uint32
	Last_data_sent uint32
	Last_ack_sent  uint32
	Last_data_recv uint32
	Last_ack_recv  uint32
	Pmtu        uint32
	Rcv_ssthresh uint32
	Rtt         uint32
	Rttvar      uint32
	Snd_ssthresh uint32
	Snd_cwnd    uint32
	Advmss      uint32
	Reordering  uint32
}

func getTCPInfo(conn *net.TCPConn) (*TCPInfo, error) {
	raw, err := conn.SyscallConn()
	if err != nil {
		return nil, err
	}

	var info TCPInfo
	size := unsafe.Sizeof(info)
	var errno syscall.Errno

	err = raw.Control(func(fd uintptr) {
		_, _, errno = syscall.Syscall6(syscall.SYS_GETSOCKOPT, fd, syscall.IPPROTO_TCP, syscall.TCP_INFO, uintptr(unsafe.Pointer(&info)), uintptr(unsafe.Pointer(&size)), 0)
	})

	if err != nil {
		return nil, err
	}
	if errno != 0 {
		return nil, errno
	}

	return &info, nil
}

func main() {
	// Example usage
	addr, _ := net.ResolveTCPAddr("tcp", "google.com:80")
	conn, _ := net.DialTCP("tcp", nil, addr)
    
    // Send some data to populate the stats
    fmt.Fprintf(conn, "GET / HTTP/1.0\r\n\r\n")

	info, _ := getTCPInfo(conn)
	fmt.Printf("RTT: %dms, CWND: %d, Unacked: %d\n", info.Rtt/1000, info.Snd_cwnd, info.Unacked)
}

*Note: The exact memory layout of tcp_info can change between kernel versions. In production code, you should use a library like github.com/mikioh/tcpinfo or verify the offsets against your target environment.*

Why This Matters: The Tail Latency Ghost

I once worked on a cache-invalidation service that occasionally spiked from 2ms to 500ms latency. We looked at CPU, disk I/O, and memory. Everything was flat. We ran tcpdump, but the volume of traffic was so high that the dumps were unmanageable.

By instrumenting the application to log TCP_INFO when a request took longer than 100ms, we found the culprit. The tcpi_retransmits field was incrementing exactly when the latency spiked. However, tcpi_total_retrans for the whole machine was low.

The issue? A specific Top-of-Rack switch had a failing buffer on a single port. Only that one socket was suffering. Standard monitoring tools like netstat -s (which show global counters) smoothed over the error. The socket-level introspection allowed us to point the finger directly at the hardware.

The "Nagle" and "Delayed ACK" Interaction

If you see high RTTs but your network is physically fast, check tcpi_ato (ACK timeout).

TCP is a chatty protocol. To save bandwidth, Linux often uses Delayed ACKs—it waits up to 200ms to see if it can piggyback an ACK on a data packet going the other way. If your application is also using Nagle’s Algorithm (which buffers small writes to send one large packet), the two can dead-lock each other.

The sender waits for an ACK before sending more data (Nagle), and the receiver waits for more data before sending an ACK (Delayed ACK). They stare at each other for 200ms until a timer expires.

If you see tcpi_ato creeping up and your throughput is miserable, you’ve likely found a Nagle/Delayed ACK conflict. The solution? setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, 1) to disable Nagle.

Visualizing the Pipeline: Unacked and Sacked

The fields tcpi_unacked and tcpi_sacked tell you about the "in-flight" data pipeline.
- tcpi_unacked: Segments sent but not yet acknowledged.
- tcpi_sacked: Segments that the receiver has explicitly said they received via Selective ACKs (SACK).

If unacked is high and sacked is low, the data is likely still in flight.
If sacked is high but unacked is also high, it means the receiver has some later packets but is missing an earlier one (a hole in the stream). This is a clear indicator of packet loss and reordering.

Real-world Gotchas

1. Units Matter: Most time-based fields in tcp_info (like tcpi_rtt) are in microseconds on modern kernels, but older kernels used milliseconds. Always check your headers.
2. Permissions: You don't need root to call getsockopt(TCP_INFO) on your own sockets. This makes it safe to include in user-space application monitoring.
3. Kernel Versioning: The tcp_info struct is appended to over time. If you use a struct definition from a 5.15 kernel on a 4.4 kernel, the tail end of your struct will contain garbage. Always check the len returned by getsockopt.

Summary: Your Socket is a Sensor

The network stack isn't just a transport layer; it’s one of the most sophisticated telemetry systems on your server. By looking at TCP_INFO, you stop guessing whether the network is "slow" and start understanding *why* it's slow.

Is it a tiny tcpi_snd_cwnd? Is it a spike in tcpi_retrans? Or is it just an application failing to read from the buffer fast enough? The answer is already in the kernel, waiting for you to call getsockopt.

The next time you’re debugging a "ghost in the machine," stop looking at the dashboard and start looking at the socket. The math doesn't lie.

The Anatomy of struct tcp_info