loke.dev
Header image for 3 Ways the Linux 'Completely Fair Scheduler' Is Sabotaging Your Node.js Throughput

3 Ways the Linux 'Completely Fair Scheduler' Is Sabotaging Your Node.js Throughput

Is your OS-level task scheduler secretly clearing your L1 cache through aggressive thread-hopping?

· 7 min read

I spent three days last month staring at a Grafana dashboard, trying to figure out why a simple Express API was hitting a latency ceiling at exactly 400 requests per second. The CPU usage sat at a comfortable 45%, the event loop lag was negligible, and there wasn't a hint of memory pressure. Every "best practice" blog post told me to scale out, but adding more pods didn't change the per-instance throughput—it just added more instances hitting the same invisible wall.

It wasn't a Node.js problem. It was a Linux problem. Specifically, it was the Completely Fair Scheduler (CFS) trying to be "fair" to a process that needed to be selfish.

If you’re running Node.js on Linux—especially inside Docker or Kubernetes—the OS is likely making decisions that actively degrade your performance. Here are three ways the CFS is sabotaging your throughput and what you can actually do to stop it.

1. The L1/L2 Cache Migration Tax

The "Fair" in Completely Fair Scheduler refers to how Linux distributes CPU time across all running processes. CFS uses a concept called vruntime (virtual runtime) to track how much time a process has spent on a core. If your Node.js process has been busy and another process (even a background daemon) has been idle, CFS will often yank your process off a core to give the other one its "fair share."

The problem? Node.js is effectively a single-threaded execution model for your JavaScript code. When CFS decides to move your process from CPU 0 to CPU 3 to balance the load, your L1 and L2 caches don't come with it.

The Impact of "Thread Hopping"

When your process lands on a new core, it's starting "cold." All the hot variables, frequent function addresses, and the V8 heap objects that were cached in the L1/L2 layers of CPU 0 are now miles away. The CPU has to reach out to the much slower L3 cache or, god forbid, the main RAM.

I ran a test using perf to see this in action. Look at the difference in cache-misses when the scheduler is allowed to move the process versus when it's pinned.

# Monitoring cache-misses for a Node.js process
perf stat -e cache-references,cache-misses,context-switches -p <NODE_PID> sleep 10

On a high-traffic instance, I’ve seen cache misses jump by 30-40% simply because the scheduler thought it was being "fair" by bouncing the process between cores. This manifests as jittery p99 latencies that seem to have no correlation to your code's complexity.

The Fix: CPU Affinity

You can tell the Linux kernel to keep its hands off your process using taskset. By setting CPU affinity, you're essentially telling the scheduler, "This process belongs to Core 0. Do not move it."

In a production environment, you might start your Node.js app like this:

# Pin the process to CPU core 0
taskset -c 0 node server.js

In the world of containers, this is handled via --cpuset-cpus. By restricting the process to specific cores, you ensure the cache stays "warm," drastically reducing the latency spikes caused by cross-core migration.

2. The CFS Bandwidth Control Throttling Trap

This is the most common silent killer in Kubernetes environments. When you set a CPU limit in K8s (e.g., limits: cpu: "500m"), you aren't actually limiting the *speed* of the CPU. You are triggering the CFS Bandwidth Control mechanism.

CFS works in periods, usually 100ms by default. If you limit a process to 0.5 CPU, the kernel gives you 50ms of execution time for every 100ms period.

The Micro-Stall Scenario

Node.js is incredibly fast at clearing its event loop. Imagine a burst of 50 requests arrives. Your Node.js process screams through them, consuming its entire 50ms allotment in the first 20ms of the period.

The result: The kernel throttles your process for the remaining 80ms. Even if the host CPU is 90% idle, your Node.js process is frozen. It's not "busy"; it's literally paused by the OS.

You can check if your app is being throttled by looking at the cgroup stats:

# Check for throttled periods in a container
cat /sys/fs/cgroup/cpu/cpu.stat

You'll see output like this:

nr_periods 1000
nr_throttled 250
throttled_time 15000000000

If nr_throttled is climbing, you are losing throughput to the scheduler's accounting, not your code's efficiency.

Code Example: Measuring Throttling from Within Node

You can actually monitor this within your application to trigger alerts.

const fs = require('fs');

function checkThrottling() {
  // Path for cgroup v1; v2 will be slightly different
  const path = '/sys/fs/cgroup/cpu/cpu.stat';
  
  if (fs.existsSync(path)) {
    const stats = fs.readFileSync(path, 'utf8');
    const throttledLine = stats.split('\n').find(l => l.startsWith('nr_throttled'));
    console.log(`Current OS Throttling Count: ${throttledLine}`);
  }
}

// Check every 10 seconds
setInterval(checkThrottling, 10000);

The Fix: Don't Use Hard Limits

In Kubernetes, the current consensus among high-performance teams is to avoid CPU limits and only use CPU requests. This allows the process to burst into unused CPU cycles on the host without being arbitrarily paused by the CFS quota.

If you must use limits, ensure your cpu.cfs_period_us is tuned. Lowering the period can make throttling more granular and less painful, but it increases the overhead of the scheduler itself.

3. The "Fairness" of the Libuv Thread Pool

While your JavaScript code runs on a single thread, Node.js uses a thread pool (libuv) for heavy lifting like file I/O, crypto, and compression. By default, this pool has four threads.

The CFS tries to be fair to all of them. If you have a 4-core machine and 4 libuv threads plus 1 main event loop thread, the CFS will try to distribute these 5 threads across the 4 cores.

The Context Switching Overhead

Because the main thread and the worker threads often share data (like Buffer objects), the CFS's tendency to spread them out across different physical CPU packages (NUMA nodes) can be disastrous. If the main thread is on CPU 0 (Socket A) and the worker thread doing the fs.readFile is on CPU 8 (Socket B), the data must travel across the Inter-Connect (QPI/UPI), which is significantly slower than local L3 cache access.

Furthermore, the "fair" scheduling of these threads can lead to involuntary context switching. This happens when the kernel forcedly stops your thread to let another one run.

You can monitor this using pidstat:

# Watch context switches for your Node process every 1 second
pidstat -w -p <NODE_PID> 1

Look at cswch/s (voluntary) vs nvcswch/s (involuntary). If your nvcswch/s is high, the OS is frequently interrupting your Node.js execution.

The Fix: UV_THREADPOOL_SIZE and Topology Awareness

First, don't over-provision your thread pool. If you're on a 2-core machine, having 128 libuv threads is a recipe for context-switching suicide.

Second, use GOMAXPROCS-style logic or manual pinning to keep your workers near your main thread. While you can't easily pin individual libuv threads from JS, you can pin the entire process to a specific NUMA node using numactl.

# Run Node.js on a single NUMA node (all threads share the same local memory)
numactl --cpunodebind=0 --membind=0 node server.js

This ensures that when the CFS moves threads around, it at least keeps them within the same neighborhood where memory access is fast.

Measuring the "Real" Cost

If you suspect the CFS is your bottleneck, don't just take my word for it. Use a tool like bcc-tools to measure run-queue latency. This tells you exactly how long a process was "Ready" to run but was stuck waiting for the scheduler to give it a turn.

# Measure how long tasks spend waiting for a CPU
sudo /usr/share/bcc/tools/runqlat

If you see a lot of latency in the microsecond range while your CPU usage isn't at 100%, the CFS is holding you back.

Summary: A Checklist for High Throughput

The Linux kernel is designed to be a generalist. It’s built to make a desktop feel responsive while a background update is running. It is not, by default, tuned for a high-throughput, single-threaded event loop like Node.js.

To reclaim your throughput:
1. Pin your processes: Use taskset or cpuset-cpus to keep your L1/L2 caches hot.
2. Audit your limits: Check /sys/fs/cgroup/cpu/cpu.stat for throttling. If you see it, raise your limits or switch to a request-only model.
3. Watch context switches: Use pidstat to ensure the OS isn't constantly interrupting your event loop.
4. Stay local: Use numactl on multi-socket servers to avoid the memory latency tax.

Node.js is fast, but it’s only as fast as the OS allows it to be. Stop letting the scheduler be "fair" to your performance.