Pool Starvation

I spent three days once trying to figure out why a Node.js microservice was hitting 500ms latency spikes when the CPU was sitting comfortably at 15% and the memory usage was a flat line. I had read all the blog posts about the "Event Loop" and I was convinced I wasn't blocking it. I wasn't doing any heavy for loops, I wasn't JSON parsing giant objects, and I certainly wasn't using readFileSync.

It felt like a ghost was haunting the machine. Every time we increased the load slightly, the response times for cryptographic hashing and disk writes didn't just scale linearly—they hit a brick wall. I eventually discovered that while I wasn't blocking the *Event Loop*, I was starving the *Thread Pool*.

Node.js is famous for being "single-threaded," but that's a half-truth that often leads to performance disasters. Under the hood, Node uses a library called libuv to handle things the operating system doesn't provide a great asynchronous interface for. By default, this library uses a tiny, secret pool of only four threads. When you exhaust those four threads, your "asynchronous" app starts behaving like it’s stuck in a traffic jam.

The Lie of the Single Thread

We are told that Node.js is non-blocking and uses an event-driven architecture. This is true for network I/O. When you make an HTTP request or receive a TCP connection, Node uses the operating system's native non-blocking mechanisms (like epoll on Linux, kqueue on macOS, or IOCP on Windows). These don't require extra threads; one thread can watch thousands of connections.

But not everything is a network socket. Some things are "expensive" or don't have consistent non-blocking APIs across different operating systems. For these, libuv maintains a internal thread pool.

Here are the four horsemen of the Thread Pool:

1. File I/O: Most filesystem operations (fs).
2. Crypto: Functions like crypto.pbkdf2, crypto.scrypt, or crypto.randomBytes.
3. Zlib: Compression and decompression tasks.
4. DNS Lookups: Specifically dns.lookup (because of how the underlying getaddrinfo(3) works).

If your app does a lot of these, you aren't running in a single-threaded environment. You are running in a 1+4 environment (one event loop thread + four background workers).

Watching the Starvation in Real-Time

The best way to understand pool starvation is to trigger it. We can use the crypto.pbkdf2 function because it’s intentionally CPU-intensive and explicitly offloaded to the libuv thread pool.

Try running this script on your machine:

const crypto = require('crypto');
const start = Date.now();

function runHash(id) {
  crypto.pbkdf2('secret', 'salt', 100000, 64, 'sha512', () => {
    console.log(`Request ${id} finished in ${Date.now() - start}ms`);
  });
}

// Let's fire off 6 "asynchronous" requests
runHash(1);
runHash(2);
runHash(3);
runHash(4);
runHash(5);
runHash(6);

If Node were truly "asynchronous" in the way many beginners imagine, all six would finish at roughly the same time. But when you run this, you’ll see something like this:

Request 1 finished in 85ms
Request 3 finished in 88ms
Request 2 finished in 90ms
Request 4 finished in 92ms
Request 5 finished in 175ms
Request 6 finished in 178ms

Notice the jump? Requests 1 through 4 finished in about 90ms. Requests 5 and 6 took nearly double that time.

Why? Because the thread pool has exactly four slots. Request 5 and 6 literally could not start until Request 1 and 2 finished and vacated their spots in the pool. This is Pool Starvation. Your code thinks it's being non-blocking by using a callback, but the underlying system is queuing tasks because it’s out of workers.

Why the Filesystem is a Sneaky Culprit

You might think, "I don't do heavy crypto, so I'm safe." But filesystem operations are the most common way to accidentally choke your pool.

While network sockets are truly non-blocking on modern kernels, file systems are a different story. Implementing truly non-blocking file I/O is notoriously difficult and inconsistent across OS platforms. To provide a consistent fs.readFile API, libuv simply wraps the synchronous, blocking system calls in a thread pool task.

Imagine a high-traffic web server that logs every request to a file, reads a config file, and serves some static assets.

const fs = require('fs');
const http = require('http');

http.createServer((req, res) => {
  // Every request triggers a thread pool task
  fs.readFile('./large-template.html', (err, data) => {
    // If 4 people hit this at once, the 5th person's 
    // DNS lookup or Crypto call is now queued.
    res.end(data);
  });
}).listen(3000);

If you have a sudden burst of five concurrent file reads, and then a sixth user tries to connect via a hostname (requiring a dns.lookup), that DNS lookup—which also uses the thread pool—will be stuck waiting for a file read to finish. To the user, it looks like the network is slow. In reality, your internal workers are just busy.

The DNS Trap

This is the one that catches people off guard. There are two ways to resolve a hostname in Node: dns.resolve and dns.lookup.

- dns.resolve uses a fast, asynchronous network-based approach that does not use the thread pool.
- dns.lookup uses the underlying OS's getaddrinfo call, which is synchronous. Therefore, Node has to run it in the thread pool.

Most Node.js functions (like http.get or mysql.createConnection) use dns.lookup by default because it respects your /etc/hosts file and local configuration. If your thread pool is saturated by fs calls, your outgoing API requests will suddenly start "hanging" at the DNS resolution phase. You'll blame the external API, but the fault is your own thread pool.

Fixing the Bottleneck: UV_THREADPOOL_SIZE

If you find yourself genuinely needing more than four threads, Node allows you to increase the limit. You can set the UV_THREADPOOL_SIZE environment variable. The maximum is 1024.

You have to set this *before* the Node.js process starts, or at the very least, before the thread pool is first used.

On Linux/macOS:

UV_THREADPOOL_SIZE=16 node app.js

Inside your script (must be at the very top):

process.env.UV_THREADPOOL_SIZE = 16;
const crypto = require('crypto'); // Thread pool is initialized here

If we take our previous pbkdf2 example and set the size to 6, the output changes drastically:

Request 1 finished in 95ms
Request 2 finished in 96ms
Request 3 finished in 97ms
Request 4 finished in 98ms
Request 5 finished in 99ms
Request 6 finished in 100ms

Everything finishes together. The "bottleneck" is gone.

The Cost of More Threads

You might be tempted to just set UV_THREADPOOL_SIZE=128 and call it a day. Don't do that.

Threads aren't free. Each thread in the pool carries a memory overhead (typically 1MB for the stack, though this varies by OS). More importantly, having 128 threads trying to do CPU-heavy work (like pbkdf2 or zlib compression) on a machine with only 4 CPU cores will lead to context switching thrashing.

The CPU will spend more time swapping threads in and out of execution than actually doing the work. You’ll see your total execution time increase, even if the "wait time" in the queue decreases.

A good rule of thumb: If your work is CPU-bound (Crypto, Zlib), you shouldn't set the pool size much higher than your actual core count. If your work is I/O-bound (Filesystem, DNS), you can afford to set it higher, as those threads spend most of their time sleeping and waiting for the hardware.

Monitoring: How do you know you're starving?

This is the hard part. Node.js doesn't give you a process.threadPoolUsage() function out of the box. You have to be a bit more creative to detect this.

One strategy is to measure the "latency" of a known thread pool operation. If you periodically run a small, fast crypto.randomBytes call and it usually takes 1ms, but suddenly starts taking 50ms, you know the thread pool is backed up.

setInterval(() => {
  const checkStart = Date.now();
  crypto.randomBytes(16, (err, buf) => {
    const delta = Date.now() - checkStart;
    if (delta > 20) {
      console.warn(`Thread pool delay detected: ${delta}ms. Check for starvation!`);
    }
  });
}, 1000).unref(); // unref so it doesn't keep the process alive

This acts like a "canary in a coal mine." If the canary dies (the delay increases), you know your background workers are swamped.

Worker Threads: A Different Beast

It's vital to distinguish between the Libuv Thread Pool and the Worker Threads API (worker_threads).

- Libuv Thread Pool: Managed automatically by Node. Used for internal I/O and specific APIs. You don't manage the threads; you just trigger tasks that go there.
- Worker Threads: Managed by *you*. You spawn them to run your own JavaScript code in parallel.

If you have heavy calculation logic written in JavaScript (like processing a huge array of data), increasing UV_THREADPOOL_SIZE won't help you. JavaScript always runs on the main event loop unless you explicitly move it into a Worker Thread.

The Libuv pool is for the "built-in" heavy lifting. Worker Threads are for "your" heavy lifting.

Architectural Alternatives

Instead of just cranking up the thread pool size, consider if you can avoid the pool entirely.

1. For Filesystem operations: Can you use Streams? While fs.createReadStream still uses the thread pool, it does so in smaller, more manageable chunks rather than trying to shove a 500MB file into a single thread pool task.
2. For Crypto: If you are doing massive amounts of hashing, consider offloading that to a dedicated service or using a WebAssembly implementation of the hash function. Wasm can often run on the main thread (if the chunks are small) or in a Worker Thread, bypassing the 4-thread libuv limit.
3. For DNS: Use an IP address for internal service communication to bypass the need for getaddrinfo lookups entirely. Or, use a caching DNS wrapper.

Summary: The Hidden Limit

Node.js's scalability comes from its ability to handle thousands of concurrent network connections on a single thread. But that "single thread" narrative creates a blind spot.

When your application starts performing tasks that require the libuv thread pool, you are no longer operating in a world of infinite concurrency. You are operating in a world with a concurrency of four.

If you're seeing inexplicable latency while the CPU is idle:
1. Check if you're doing heavy File I/O.
2. Check if you're doing heavy Crypto or Compression.
3. Check if your outgoing requests are bottlenecked by DNS lookups.
4. Experiment with UV_THREADPOOL_SIZE.
5. Don't forget that threads have a cost; more isn't always better.

Understanding the thread pool isn't just a "deep dive" technical curiosity—it's a requirement for building production-grade Node.js systems that don't fall over the moment they're asked to do more than just proxy JSON.