What Nobody Tells You About Receive Side Scaling (RSS): Why Your Multi-Core Server Is Silently Choking on a Single Interrupt

It’s a strange feeling when you’ve just dropped twenty thousand dollars on a server with enough cores to run a small country, only to watch a single thread on CPU0 gasp for air while the other 63 cores sit around doing absolutely nothing. You’re looking at top, seeing a massive spike in %si (software interrupts), and wondering why your 10Gbps link is capping out at 1.2Gbps. You’ve been told that Linux scales linearly with hardware. In reality, your network card is likely treating your multi-core behemoth like a single-core Pentium III from 1999.

This is the "Single Interrupt Trap," and it’s the most common performance killer in modern data centers. The culprit is almost always a misunderstanding of Receive Side Scaling (RSS).

The Illusion of Parallelism

When a packet hits your Network Interface Card (NIC), the hardware has to tell the CPU, "Hey, I have data for you." It does this via an interrupt. In the old days, there was one wire for interrupts. One wire meant one CPU handled the packet, moved it from the NIC’s buffer into system memory, and let the kernel stack process it.

Modern NICs use MSI-X (Message Signaled Interrupts), which theoretically allows for hundreds of independent interrupt vectors. But here’s the rub: if your NIC isn't configured to spread those interrupts across multiple queues, or if your OS doesn't know where to send them, every single packet—no matter how many cores you have—will land on the same CPU. This creates a bottleneck where the processing power of your entire machine is limited by the clock speed of a single core.

Step 1: Diagnosing the Chokehold

Don't guess. Look at /proc/interrupts. This is the source of truth for how your hardware is talking to your kernels.

Run this command while your server is under load:

watch -n1 "cat /proc/interrupts | grep 'eth0' | awk '{print \$1, \$2, \$3, \$4, \$NF}'"

(Replace eth0 with your actual interface name, like ens1f0 or p2p1).

If you see something like this, you have a problem:

          CPU0       CPU1       CPU2       CPU3
 48:  154829302          0          0          0   IR-PCI-MSI-edge      eth0-TxRx-0

Notice how CPU0 is doing all the heavy lifting while CPU1-3 are literally doing zero. This is exactly what it looks like when RSS is either disabled or misconfigured. You are "pinned" to a single core.

How RSS Actually Works (The "Nobody Tells You" Part)

RSS isn't just "randomly" throwing packets at cores. It uses a hashing algorithm (usually Toeplitz) to ensure that packets belonging to the same flow (same source/destination IP and port) always end up on the same CPU.

Why? Because if Packet A goes to CPU0 and Packet B (from the same stream) goes to CPU1, they might be processed out of order. Out-of-order packets trigger TCP retransmissions and window shrinking, which destroys performance.

The Gotcha: If you are testing your 100Gbps link with a single iperf3 stream, RSS cannot help you. A single flow stays on a single core by design to maintain packet order. To see RSS in action, you need multiple concurrent flows or multiple iperf clients.

Step 2: Checking Your Hardware Queues

Before you can scale, your hardware must support multiple queues. You can check this with ethtool:

# Check how many queues are currently active vs maximum supported
ethtool -l eth0

You might see output like this:

Channel parameters for eth0:
Pre-set maximums:
Combined:          64
Current hardware settings:
Combined:          1

This is a tragedy. Your NIC is capable of 64 queues, but it's only using 1. You can fix this immediately without a reboot:

# Set the number of combined queues to 16
sudo ethtool -L eth0 combined 16

Now, check /proc/interrupts again. You should see 16 different eth0-TxRx entries. But wait—they might still all be firing on CPU0. This leads us to the dark art of IRQ Affinity.

Step 3: Mastering IRQ Affinity

The Linux kernel has a service called irqbalance. In theory, it’s supposed to distribute interrupts across cores. In practice, for high-performance networking, irqbalance is often either too slow or makes poor decisions, like moving interrupts between NUMA nodes (more on that later).

To truly own your performance, you need to manually map interrupts to cores. This is done via bitmasks in /proc/irq/.

Let's say your NIC interrupt is ID 48. To see which CPUs it can run on:

cat /proc/irq/48/smp_affinity
# Output: ffffffff (This means "all cores allowed")

The mask ffffffff is a hexadecimal representation of a bitmask. If you want an interrupt to only run on CPU0, the mask is 1. For CPU1, it’s 2. For CPU2, it’s 4. For CPU3, it’s 8.

If you have a NIC with 4 queues and you want to map them to the first 4 cores, you would do something like this:

# Queue 0 -> CPU 0 (mask 1)
echo 1 > /proc/irq/48/smp_affinity
# Queue 1 -> CPU 1 (mask 2)
echo 2 > /proc/irq/49/smp_affinity
# Queue 2 -> CPU 2 (mask 4)
echo 4 > /proc/irq/50/smp_affinity
# Queue 3 -> CPU 3 (mask 8)
echo 8 > /proc/irq/51/smp_affinity

The NUMA Trap: Don't Cross the Streams

If you are on a dual-socket server, this is where you can really mess things up. Every NIC is physically connected to a specific PCIe bus, which is physically wired to a specific CPU socket.

If your NIC is attached to Socket 0, but you pin your interrupts to cores on Socket 1, every single packet has to travel across the QPI/UPI (the interconnect between CPUs). This adds latency and eats up memory bandwidth.

To find out which node your NIC is on:

cat /sys/class/net/eth0/device/numa_node
# Output: 0

If it returns 0, only pin your IRQs to the cores associated with Node 0. You can find those cores using lscpu.

When Hardware Fails You: RPS and RFS

Sometimes you’re stuck with a cheap NIC (looking at you, budget cloud providers) that doesn't support multiple hardware queues. This is where Receive Packet Steering (RPS) comes in.

RPS is essentially a software implementation of RSS. A single CPU receives the interrupt, but then it immediately hands off the packet to other CPUs for processing. It’s not as efficient as hardware RSS because the "distributor" CPU still has to do some work, but it’s much better than a single-core bottleneck.

To enable RPS, you have to write a bitmask of the CPUs you want to share the load to the rps_cpus file for each receive queue.

# Spread load for queue 0 across CPUs 0-7 (hex mask 0xff)
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus

The Hashing Problem: Why Your Traffic Won't Balance

Even with RSS and IRQ affinity perfectly set, you might still see one core pegged while others are idle. This usually happens for one of two reasons:

1. Encapsulation (The VXLAN/GRE Problem): If you are running a tunnel, the NIC might only see the outer header. Since the outer IP and Port are often the same for all traffic, the hash results in the same value every time. High-end modern NICs can do "Inner Header RSS," but many can't.
2. Elephant Flows: You have one massive backup job or database replication stream that accounts for 90% of your traffic. Because RSS preserves flow order, that one stream can never be split across cores.

To see if your hashing is working, check the ethtool statistics:

ethtool -S eth0 | grep rx_queue_

If rx_queue_0_packets is 100 million and rx_queue_1_packets is 100, your hashing algorithm is failing you. You can try changing the hash fields:

# Include L4 ports in the hash for TCP over IPv4
ethtool -N eth0 rx-flow-hash tcp4 sdfn

*Note: sdfn stands for Source IP, Destination IP, Source Port, Destination Port.*

Monitoring Like a Pro

The best tool for visualizing this is mpstat from the sysstat package. It gives you a per-core breakdown of where time is being spent.

mpstat -P ALL 1

Look specifically at %soft. This is the time the CPU spends handling software interrupts (the networking stack). If you see %soft at 90% on CPU0 and 0% on CPU15, your RSS configuration is broken. If you see %soft evenly distributed across all cores, you have achieved networking nirvana.

Summary Checklist for a Multi-Core Server

If you're setting up a high-performance Linux box, do these things in order:

1. Stop `irqbalance`: For predictable performance, manage affinities yourself.
2. Maximize Queues: Use ethtool -L to match the number of queues to the number of physical cores (or at least spread them out).
3. Align to NUMA: Ensure the cores handling the interrupts are on the same physical socket as the NIC.
4. Verify Hashing: Use ethtool -S to ensure traffic is actually hitting different queues.
5. Use RPS/RFS as a Backup: If your hardware is limited, use the kernel's software steering to prevent a single-core meltdown.

The "silent choke" is a byproduct of how we've scaled hardware faster than we've updated our default configurations. A 64-core server is just a collection of 64 very fast, very bored processors until you actually give them a way to share the workload. Don't let your NIC be the gatekeeper that holds your hardware hostage.