loke.dev
Header image for The Day My Code Stopped Guessing: How eBPF Finally Unlocked the Truth About My Node.js Production Latency

The Day My Code Stopped Guessing: How eBPF Finally Unlocked the Truth About My Node.js Production Latency

Moving beyond high-level flame graphs to explore how kernel-level tracing can pinpoint specific system-call bottlenecks that the V8 profiler is fundamentally blind to.

· 9 min read

The Day My Code Stopped Guessing: How eBPF Finally Unlocked the Truth About My Node.js Production Latency

Most developers believe that a high-resolution Flame Graph is the final word in performance tuning. We’ve been taught that if a Node.js process is slow, we should fire up perf or the built-in V8 profiler, generate a beautiful forest of red and orange bars, and look for the widest peaks. But here is the uncomfortable truth: flame graphs are often lying to you. Or, more accurately, they are only telling you what the CPU is doing while it’s busy. They are functionally blind to the time your code spends doing absolutely nothing—waiting for the kernel to finish a task you didn't even know you asked for.

I spent three weeks chasing a P99 latency spike in a high-throughput Express.js gateway. Every tool in the standard arsenal—clinic.js, autocannon, Chrome DevTools, and even custom process.hrtime() hooks—insisted the event loop was healthy. The CPU usage was a comfortable 40%. Yet, every few minutes, a cluster of requests would stall for 200ms or more.

I was guessing. I was tweaking UV_THREADPOOL_SIZE, adjusting garbage collection flags, and rewriting "suspicious" middleware. None of it worked because the bottleneck wasn't in my JavaScript. It wasn't even in the V8 engine. It was in the syscall layer, and I needed eBPF to see it.

The Blind Spot of User-Space Profiling

Standard Node.js profilers work by sampling the stack trace at regular intervals. If the V8 engine is busy executing a function, the profiler records it. If the process is "off-CPU"—meaning it's waiting for disk I/O, a network response, or a mutex lock—the profiler often has nothing to sample.

In Node.js, the libuv event loop manages asynchronous tasks. When you call fs.readFile, the work is offloaded to a thread pool or the OS. To your profiler, that time simply vanishes into a "black box" called the kernel.

Consider this innocent-looking snippet that simulates a common production pattern: logging to a file and hitting a database.

const fs = require('fs');
const http = require('http');

const server = http.createServer((req, res) => {
  // A simple structured log
  const logData = { timestamp: Date.now(), url: req.url, headers: req.headers };
  
  // Potential hidden bottleneck: 
  // Even with 'appendFile', under high pressure, the kernel's write 
  // buffer might behave in ways the V8 profiler can't explain.
  fs.appendFile('/var/log/app.log', JSON.stringify(logData) + '\n', (err) => {
    if (err) console.error(err);
    
    // Simulate some business logic
    setTimeout(() => {
      res.writeHead(200);
      res.end('OK');
    }, 10);
  });
});

server.listen(3000);

On my machine, perf showed that 90% of the time was spent in node::InternalCallbackScope. It told me nothing about why the fs.appendFile callback was occasionally taking 150ms to fire.

Enter eBPF: Observing Without Interfering

eBPF (Extended Berkeley Packet Filter) allows us to run sandboxed programs inside the Linux kernel without changing the kernel source code or loading a dangerous module. For a Node.js developer, this is a superpower. It means we can hook into sys_enter_write or sys_enter_connect and measure exactly how long the kernel takes to fulfill those requests.

The beauty of eBPF is that it doesn't care about your application's internal state. It sits at the boundary between your code and the hardware.

To start investigating my latency, I stopped looking at my JavaScript and started looking at the syscalls. I used bpftrace, a high-level tracing language for eBPF.

The First Clue: Measuring Syscall Latency

I wanted to see the distribution of time spent in every write syscall made by my Node.js process. Here is a simple bpftrace script that does exactly that:

/* trace_writes.bt */
tracepoint:syscalls:sys_enter_write
/pid == $1/
{
  @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_write
/pid == $1 && @start[tid]/
{
  $duration_us = (nsecs - @start[tid]) / 1000;
  @write_latency_us = hist($duration_us);
  delete(@start[tid]);
}

You run this by passing the PID of your Node.js process:
sudo bpftrace trace_writes.bt $(pgrep -n node)

When I ran this against my "slow" production-like environment, I saw something disturbing:

@write_latency_us:
[16, 32)             1285 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)              432 |@@@@@@@@@@@@@@@@@                                   |
[64, 128)              56 |@@                                                  |
...
[131072, 262144)       12 |@                                                   |

Most writes were taking 16-32 microseconds. But there was a long tail. Twelve writes took over 131 milliseconds!

Why would a simple log append take 131ms? V8 couldn't tell me. It just knew the callback hadn't fired yet. But the kernel knew. It was waiting on the file system's journal commit or competing for I/O priority.

Diving Deeper: The "Hidden" Sync Calls

In Node.js, we often assume that as long as we use the Async versions of functions, we are safe from blocking the event loop. This is a half-truth. While you aren't blocking the main thread, you might be saturating the libuv thread pool, which leads to the same outcome: your application stops responding.

I suspected something was calling stat or open synchronously, perhaps a deep dependency buried in node_modules. I wrote another bpftrace script to catch the specific filenames being accessed and the time they took.

/* trace_opens.bt */
tracepoint:syscalls:sys_enter_openat
/pid == $1/
{
  @filename[tid] = str(args->filename);
  @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_openat
/pid == $1 && @start[tid]/
{
  $ret = args->ret;
  $duration_ms = (nsecs - @start[tid]) / 1000000;
  if ($duration_ms > 5) {
    printf("Slow open: %s took %d ms (FD: %d)\n", @filename[tid], $duration_ms, $ret);
  }
  delete(@filename[tid]);
  delete(@start[tid]);
}

The output was the "Aha!" moment:
Slow open: /etc/resolv.conf took 42 ms (FD: 12)

It turned out that a specific database driver was re-reading /etc/resolv.conf on every new connection attempt during a DNS flap. Because it was using fs.readFileSync (a synchronous call), it was halting the entire event loop. No JavaScript profiler showed this as "busy" time because the thread was simply blocked in a read syscall, waiting for the disk.

Visualizing the Off-CPU Time

To really solve the mystery, I needed to see a merged view of user-space and kernel-space. This is where eBPF-powered tools like bcc (BPF Compiler Collection) come in. Specifically, offcputime.

offcputime summarizes the time a thread spent blocked and shows the stack trace that led to it. This effectively bridges the gap between your Node.js code and the kernel.

If you have BCC installed, you can run:
sudo offcputime-bpfcc -p $(pgrep -n node) 5

This captures 5 seconds of "sleeping" time. The output is a series of stack traces. Here’s a simplified version of what I found:

    finish_task_switch
    __schedule
    schedule
    schedule_timeout
    io_schedule_timeout
    do_wait_for_common
    __wait_for_common
    wait_for_common
    wait_for_completion
    dev_wait_for_completion
    ...
    entry_SYSCALL_64_after_hwframe
    __write
    uv__fs_write
    uv__worker
    thread_start
    [unknown]

Reading this from the bottom up:
1. A libuv worker thread started.
2. It called uv__fs_write (the internal C++ part of fs.appendFile).
3. It triggered a write syscall.
4. The kernel took over and eventually reached io_schedule_timeout.

This confirmed that the thread was being put to sleep by the kernel because the I/O subsystem was congested. The reason my P99 was spiking wasn't "bad code"—it was I/O contention at the OS level that was bubbling up through the libuv thread pool.

Fixing the Truth, Not the Symptom

Once eBPF showed me the truth, the fix was no longer a guess. I realized our logging volume was too high for the disk's IOPS limit during peak hours.

I implemented three specific changes based on the eBPF data:
1. Userspace Buffering: Instead of calling fs.appendFile for every log entry, I used a writable stream with a high highWaterMark to buffer logs in memory and flush them in larger, fewer chunks. This reduced the number of write syscalls significantly.
2. DNS Caching: I moved DNS resolution to a local nscd (Name Service Caching Daemon) to stop the constant synchronous reads of /etc/resolv.conf.
3. Thread Pool Tuning: I increased UV_THREADPOOL_SIZE from the default of 4 to 16, but only *after* I had data showing that the worker threads were indeed the bottleneck.

The Improved Logger Pattern

const fs = require('fs');

class BufferedLogger {
  constructor(path) {
    this.stream = fs.createWriteStream(path, {
      flags: 'a',
      highWaterMark: 64 * 1024 // Buffer 64KB before forcing a syscall
    });
  }

  log(data) {
    const entry = JSON.stringify(data) + '\n';
    const canWrite = this.stream.write(entry);
    
    // If the internal buffer is full, we handle backpressure
    if (!canWrite) {
      this.stream.once('drain', () => {
        // Resume processing if necessary
      });
    }
  }
}

const logger = new BufferedLogger('/var/log/app.log');

By buffering in user-space, we trade a little bit of memory for a massive reduction in kernel context switches and I/O wait time.

Why You Should Care About eBPF Now

The era of "just throw more RAM at it" is ending as we move toward highly dense containerized environments (Kubernetes, Fargate). In these environments, you aren't just competing with your own code; you're competing with every other container on that kernel for the same I/O and CPU resources.

If you only use Node-specific tools, you are looking at the world through a keyhole. You see your code, but you don't see the noisy neighbor on the same host causing your epoll_wait to take 10ms longer than it should.

Practical Steps to Get Started

You don't need to be a kernel engineer to use eBPF.

1. Install `bcc-tools` and `bpftrace`: Most modern Linux distributions (Ubuntu 20.04+, Amazon Linux 2, Fedora) have these in their default repos.
2. Use `execsnoop`: To see if your Node app is spawning hidden shell processes (a common cause of latency).
3. Use `ext4slower` (or your FS equivalent): To see which file writes are taking longer than 10ms.
4. Try `bcc/biolatency`: To see a histogram of disk I/O latency.

Here is a quick one-liner to see if your Node.js process is being slowed down by the CPU scheduler (Run-Queue Latency):

sudo runqlat-bpfcc -p $(pgrep -n node)

If you see a lot of time in the higher millisecond buckets, your process is ready to run, but the kernel is too busy to give it a CPU core. No amount of JavaScript optimization will fix that; you need to look at your orchestration or instance sizing.

The Gotchas

eBPF is powerful, but it's not a silver bullet.
- Kernel Version: You generally need a Linux kernel >= 4.1. For the most advanced features, 5.4+ is preferred.
- Permissions: You need CAP_SYS_ADMIN or root. This makes it hard to run inside a standard restricted container unless you have control over the host.
- Overhead: While eBPF is designed to be extremely low overhead, complex scripts that trigger on high-frequency events (like every single function call in a tight loop) *can* slow down your system. Stick to syscalls or tracepoints for production debugging.

Final Thoughts

The day I started using eBPF was the day I stopped treated the "System" row in my performance reports as a magical constant. We often treat the kernel as an infallible, zero-latency machine that executes our commands instantly. It isn't. It's a complex piece of software with its own locks, queues, and bottlenecks.

If you are chasing a latency bug in Node.js and the standard tools are coming up empty, stop looking at your JavaScript. Reach for bpftrace. Look at the syscalls. Look at the off-cpu time. The truth is in the kernel; you just have to know how to ask for it.