The SoftIRQ Is a Single-Core Trap

I was staring at a 64-core monster of a server last week that was effectively dying, even though top reported an aggregate CPU usage of only 4%. To the uninitiated, the machine looked idle. To the person managing the API gateway, it looked like a flickering lightbulb—requests were timing out, and the tail latency was measured in seconds, not milliseconds.

If you look closer at a system under heavy network load, you’ll often see one single core pinned at 100% "si" (software interrupt) while the other 63 cores are essentially taking a nap. This is the SoftIRQ trap. It’s a bottleneck that can cap the throughput of your high-performance Go or Rust service long before you’ve actually exhausted your hardware's compute power.

The "Bottom Half" Problem

When a network packet hits your NIC, the hardware triggers a physical interrupt (IRQ). The CPU has to stop what it’s doing and handle it immediately. However, if the CPU did *everything*—parsing headers, checksumming, routing—inside that hardware interrupt, the system would lock up. Hardware interrupts are high-priority and "expensive" because they disable other interrupts.

To solve this, Linux splits the work into two halves:
1. Top Half (Hard IRQ): The kernel does the bare minimum. It acknowledges the hardware and schedules the heavy lifting for later.
2. Bottom Half (SoftIRQ): The kernel handles the actual packet processing (the NET_RX action) in a slightly lower-priority context.

The "trap" arises because, by default, the kernel often tries to handle the SoftIRQ on the same CPU core that received the hardware interrupt. If your network card is pinning all incoming traffic to a single hardware queue, a single CPU core becomes the janitor for the entire server's traffic.

Identifying the Bottleneck

You can't fix what you can't see. Most people use htop or top, but they look at the big green bar (User) or the blue bar (Low Priority). You need to look for the tiny orange or purple sliver labeled si.

Better yet, use mpstat from the sysstat package to see the per-core breakdown:

# Check CPU stats every 1 second
mpstat -P ALL 1

In a bottlenecked system, you'll see something like this:

02:14:01 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:14:02 PM  all    1.20    0.00    0.80    0.05    0.10    1.50    0.00    0.00    0.00   96.35
02:14:02 PM    0    0.50    0.00    1.10    0.00    2.50   95.90    0.00    0.00    0.00    0.00
02:14:02 PM    1    1.10    0.00    0.70    0.00    0.00    0.10    0.00    0.00    0.00   98.10
...

CPU 0 is gasping for air (%soft at 95.9%), while the others are chilling. This is the definitive signature of a SoftIRQ bottleneck.

To see exactly what kind of SoftIRQs are firing, check /proc/softirqs:

watch -d -n 1 "cat /proc/softirqs"

Look at the NET_RX line. If the numbers are incrementing rapidly on only one column, you've found your culprit.

Why Does This Happen?

Modern NICs are smart, but the default configuration is often "safe" rather than "performant." There are three main reasons you hit this wall:

1. Single RX Queue: Your NIC might only have one queue for receiving packets.
2. IRQ Affinity: The hardware interrupts are all being routed to a single core.
3. Lack of Scaling Logic: The kernel isn't configured to spread the "SoftIRQ" load across other cores once it leaves the hardware layer.

Level 1: Multi-Queue NICs and RSS

Receive Side Scaling (RSS) is the hardware-level solution. It uses a hash of the packet's source/destination IP and port to distribute traffic across multiple hardware queues. Each queue has its own IRQ, which can be handled by a different CPU.

First, check if your NIC supports multiple queues:

# Replace 'eth0' with your interface name
ethtool -l eth0

You’re looking for the "Combined" count. If Current is 1 but Maximum is 16, you’re leaving performance on the table. You can increase it like this:

sudo ethtool -L eth0 combined 16

Level 2: Forcing IRQ Affinity

Even with multiple queues, the kernel might still be routing all those interrupts to CPU 0 because of a poorly configured irqbalance daemon.

You can manually map an IRQ to a core. First, find the IRQ numbers for your queues:

grep eth0 /proc/interrupts

You'll see something like:

 125: ... PCI-MSI 524288-edge      eth0-rx-0
 126: ... PCI-MSI 524289-edge      eth0-rx-1

To pin IRQ 125 to CPU 0 and IRQ 126 to CPU 1, you write a bitmask to smp_affinity. A bitmask of 1 (binary 0001) is CPU 0; a bitmask of 2 (binary 0010) is CPU 1.

# Pin IRQ 125 to CPU 0
echo 1 | sudo tee /proc/irq/125/smp_affinity
# Pin IRQ 126 to CPU 1
echo 2 | sudo tee /proc/irq/126/smp_affinity

Doing this manually is a pain. Most production systems use irqbalance, but for extreme performance, I prefer to disable irqbalance and use a custom script to pin queues to specific NUMA nodes to avoid cross-socket memory latency.

Level 3: Receive Packet Steering (RPS)

Sometimes your hardware is old, or you're running in a virtualized environment (like a small EC2 instance) where you only have one RX queue. This is where Receive Packet Steering (RPS) comes in.

RPS is a software implementation of RSS. When a packet arrives on a single core, the kernel calculates a hash and then delegates the "SoftIRQ" work to other cores. This adds a tiny bit of overhead (to move the data between cores), but it's much better than dropping packets.

To enable RPS, you need to write a bitmask of the CPUs you want to participate to the rps_cpus file of the receive queue.

If you have 8 cores and want to use all of them for queue 0:

# 'f' is 1111 in hex, 'ff' is 11111111 (8 cores)
echo "ff" | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus

Level 4: Receive Flow Steering (RFS)

RPS is great for spreading the load, but it's blind to where your application is actually running. If RPS sends a packet to CPU 2, but your Nginx worker is running on CPU 5, you're going to suffer from cache misses.

Receive Flow Steering (RFS) tries to be smarter by routing the SoftIRQ to the CPU where the application is actually consuming the data.

To enable RFS, you first set the global flow table size:

sudo sysctl -w net.core.rps_sock_flow_entries=32768

Then you set the flow count per queue:

echo 4096 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

The "ksoftirqd" Problem

When the SoftIRQ load becomes too heavy for the kernel to handle in the context of the interrupt, it wakes up a per-CPU kernel thread called ksoftirqd/n (where n is the CPU number).

If you see ksoftirqd/0 taking 100% CPU in top, it means the kernel has given up on trying to handle packets in the background and is now dedicating a specific process to it. This is usually a sign that you have reached the absolute limit of what that single core can process, often because of a complex firewall (iptables/nftables) or deep packet inspection.

Optimization Tip: Check your `net_dev_budget`

If you have plenty of CPU but ksoftirqd is still struggling, you might be hitting the "budget." The kernel limits how many packets it processes in one SoftIRQ cycle to ensure the CPU can eventually return to user processes.

# View current budget (default is usually 300)
sysctl net.core.netdev_budget

If you're on a 10Gbps+ link, increasing this to 600 or 1000 can reduce context switching between the kernel and ksoftirqd, though it might slightly increase latency for user apps.

sudo sysctl -w net.core.netdev_budget=600

A Practical Example: The "Nginx Meltdown"

Imagine an Nginx proxy handling 50,000 requests per second. You notice that even though you have 32 cores, Nginx is sluggish. You check mpstat and see:

- CPU 0: 98% si
- CPU 1-31: 2% usr, 0% si

This happened to me because a cloud provider's virtual NIC only presented one RX queue. The fix wasn't "get a bigger CPU"—it was enabling RPS.

I ran this simple loop to enable RPS across all available cores:

#!/bin/bash
# Enable RPS on all RX queues for eth0
rfc=$(cat /proc/cpuinfo | grep processor | wc -l)
# Generate a hex mask for all CPUs
mask=$(printf '%x' $(( (1 << $rfc) - 1 )))

for x in /sys/class/net/eth0/queues/rx-*; do
    echo $mask > $x/rps_cpus
done

Immediately, the si load shifted from one core to being evenly distributed across all 32. Latency dropped from 200ms to 10ms.

Gotchas and Edge Cases

The Iptables Tax

Every time a packet moves through the SoftIRQ, it hits the netfilter stack. If you have 500 iptables rules, that single-core bottleneck happens much sooner. If you find yourself stuck in SoftIRQ hell, consider moving your firewalling to eBPF/XDP. XDP (Express Data Path) allows you to drop or route packets *before* the SoftIRQ is even triggered, right at the driver level.

Busy Polling

For ultra-low latency (high-frequency trading or real-time systems), you might actually want to *avoid* SoftIRQs entirely. You can enable "busy polling," which allows the application to pull packets directly from the NIC.

# Allow global busy poll (microseconds)
sudo sysctl -w net.core.busy_poll=50
sudo sysctl -w net.core.busy_read=50

This increases CPU usage significantly (it'll show as 100% usr because the app is spinning), but it eliminates the interrupt latency.

GRO and LRO

Generic Receive Offload (GRO) is your friend. It aggregates small packets into a larger "super-packet" before passing them to the stack. This reduces the number of times the SoftIRQ has to fire. Most modern drivers have this on by default, but check it:

ethtool -k eth0 | grep generic-receive-offload

If it's off, turn it on unless you have a very specific reason not to (like building a bridge/router where packet preservation is critical).

Summary of the Strategy

If your server is choking but the CPU looks idle, follow this checklist:

1. Check `mpstat -P ALL`: Look for a single core with high %soft.
2. Check `ethtool -l`: Ensure you are using all available hardware queues.
3. Check `/proc/interrupts`: See if IRQs are balanced. If not, check irqbalance or manually set smp_affinity.
4. Enable RPS/RFS: If you have more cores than hardware queues (common in VMs), use software steering.
5. Audit your firewall: If si is high despite balancing, your iptables or nftables rules might be too complex for the packet rate.

The SoftIRQ isn't a bug; it's a fundamental part of how Linux maintains system stability under load. But in the era of 100Gbps networking and 128-core CPUs, the "one core per queue" default is a trap that will catch you if you don't actively manage the traffic distribution. Don't let a single core's exhaustion throttle a massive machine.

The SoftIRQ Is a Single-Core Trap

The "Bottom Half" Problem

Identifying the Bottleneck

Why Does This Happen?

Level 1: Multi-Queue NICs and RSS

Level 2: Forcing IRQ Affinity

Level 3: Receive Packet Steering (RPS)

Level 4: Receive Flow Steering (RFS)

The "ksoftirqd" Problem

Optimization Tip: Check your `net_dev_budget`

A Practical Example: The "Nginx Meltdown"

Gotchas and Edge Cases

The Iptables Tax

Busy Polling

GRO and LRO

Summary of the Strategy

Related Articles

The Night My Container Refused to Exit: How I Finally Mastered the Linux PID 1 Init Process

The Overcommit Bet

The Night My NVMe Drive Ignored Me: How I Finally Unmasked the Linux I/O Scheduler

Related Articles

The Night My Container Refused to Exit: How I Finally Mastered the Linux PID 1 Init Process

The Overcommit Bet

The Night My NVMe Drive Ignored Me: How I Finally Unmasked the Linux I/O Scheduler

The "Bottom Half" Problem

Identifying the Bottleneck

Why Does This Happen?

Level 1: Multi-Queue NICs and RSS

Level 2: Forcing IRQ Affinity

Level 3: Receive Packet Steering (RPS)

Level 4: Receive Flow Steering (RFS)

The "ksoftirqd" Problem

Optimization Tip: Check your net_dev_budget

A Practical Example: The "Nginx Meltdown"

Gotchas and Edge Cases

The Iptables Tax

Busy Polling

GRO and LRO

Summary of the Strategy

Related Articles

The Night My Container Refused to Exit: How I Finally Mastered the Linux PID 1 Init Process

The Overcommit Bet

The Night My NVMe Drive Ignored Me: How I Finally Unmasked the Linux I/O Scheduler

Related Articles

The Night My Container Refused to Exit: How I Finally Mastered the Linux PID 1 Init Process

The Overcommit Bet

The Night My NVMe Drive Ignored Me: How I Finally Unmasked the Linux I/O Scheduler

Optimization Tip: Check your `net_dev_budget`