The 'Failing' Health Check: How I Finally Rescued My Node.js Pods from the Kubernetes Liveness Trap

There’s a specific kind of frustration that comes with watching a Kubernetes pod restart for the fifth time in ten minutes when you know for a fact the code isn’t crashing. It’s that slow-motion train wreck where your metrics show the app is handling thousands of requests, but Kubernetes—convinced the container has "failed"—mercilessly pulls the plug.

I spent a week chasing this ghost in a high-concurrency Node.js microservice. Everything looked fine under light load, but as soon as the traffic spiked, the pods would start cycling. It wasn't an OOM (Out of Memory) error. It wasn't a syntax error. It was the Liveness Trap.

The Lie We Tell Kubernetes

When we set up a livenessProbe in a K8s manifest, we’re telling the kubelet: "If this endpoint doesn't return a 200 OK within 3 seconds, kill me and start over."

For most languages, that’s fine. But Node.js is a single-threaded beast. If your event loop is saturated—meaning it's busy processing a massive JSON payload or a complex calculation—it literally cannot get around to answering the /healthz ping.

Here is what a typical "standard" health check looks like in Express:

const express = require('express');
const app = express();

// The "Trap"
app.get('/healthz', (req, res) => {
  res.status(200).send('OK');
});

app.listen(3000);

Under heavy load, that /healthz request gets stuck in the event loop queue behind 500 other expensive operations. Kubernetes waits, times out, and decides your app is "dead." It kills the pod. Now, the remaining pods have to pick up the extra traffic, which makes *their* event loops lag, causing *them* to fail their health checks.

Congratulations, you’ve just triggered a cascading failure.

Measuring the Pulse: Event Loop Lag

To fix this, I realized I needed to stop checking if the server was *alive* and start checking if it was *responsive*. The key metric here is Event Loop Lag.

If the loop takes 200ms to get back to a simple callback, the app is struggling. If it takes 2 seconds, it's basically unusable for real-time traffic, but it's still "alive."

Here’s a simple way to measure lag without adding heavy dependencies:

let eventLoopLag = 0;

function monitorLag() {
  const start = Date.now();
  setImmediate(() => {
    const delta = Date.now() - start;
    eventLoopLag = delta;
    // Check again in a second
    setTimeout(monitorLag, 1000).unref(); 
  });
}

monitorLag();

By using setImmediate, we’re measuring how long it actually takes for the event loop to cycle back to our code.

The "Smart" Health Check

Instead of a binary "yes/no," your health check should be aware of the internal pressure. But here’s the trick: Don't use a liveness probe to report load.

If your liveness probe returns a 500 error because the event loop is slow, Kubernetes will kill the pod. That’s usually the *last* thing you want when you're under high load. You want to stop receiving *new* traffic, not die.

This is why we have readinessProbes.

Here’s how I restructured the health check logic:

const MAX_LAG = 300; // milliseconds

app.get('/ready', (req, res) => {
  if (eventLoopLag > MAX_LAG) {
    // We are too busy! Tell K8s to stop sending traffic,
    // but keep the process running to finish current work.
    return res.status(503).send('Busy');
  }
  res.status(200).send('Ready');
});

app.get('/live', (req, res) => {
  // Liveness should be very simple. 
  // Only fail if the process is actually broken.
  res.status(200).send('Alive');
});

Decoupling the Probe in Kubernetes

The manifest needs to reflect this philosophy. I see a lot of developers copy-pasting the same settings for both probes. That’s a mistake. Your liveness probe should be extremely forgiving, while your readiness probe should be the one doing the heavy lifting.

livenessProbe:
  httpGet:
    path: /live
    port: 3000
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 5 # Be very patient before killing the pod
readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  periodSeconds: 5
  successThreshold: 1
  failureThreshold: 2 # Quickly take the pod out of rotation if it's lagging

The "Startup" Savior

If your Node.js app takes a while to boot (e.g., connecting to a legacy database, pre-warming a cache), you might find your pods getting killed before they even finish starting.

Don't just increase initialDelaySeconds. If you set it to 60 seconds and your app starts in 5, you're wasting 55 seconds during a deployment. Use a startupProbe instead. It disables liveness and readiness checks until it passes.

startupProbe:
  httpGet:
    path: /live
    port: 3000
  failureThreshold: 30
  periodSeconds: 10 # Gives the app 300s to start

The Result

After decoupling the probes and monitoring event loop lag, the "flapping" pods stopped. When traffic spikes, the readinessProbe fails for a few seconds, the pod gets taken out of the LoadBalancer pool, the event loop catches up, and it re-enters the pool.

No restarts. No 502 errors for the users. Just a system that actually understands how Node.js breathes.

Pro-tip: If you’re using a framework like Fastify, they have built-in plugins (like under-pressure) that handle this event loop monitoring for you. If you're on Express, the manual lag check above is a lightweight lifesaver.

Stop letting Kubernetes bully your busy pods into an early grave. Just because a process is slow doesn't mean it's dead.