
The 'Failing' Health Check: How I Finally Rescued My Node.js Pods from the Kubernetes Liveness Trap
Discover why high-concurrency Node.js apps often trigger Kubernetes liveness failures and how to decouple your health checks from event loop lag.
There’s a specific kind of frustration that comes with watching a Kubernetes pod restart for the fifth time in ten minutes when you know for a fact the code isn’t crashing. It’s that slow-motion train wreck where your metrics show the app is handling thousands of requests, but Kubernetes—convinced the container has "failed"—mercilessly pulls the plug.
I spent a week chasing this ghost in a high-concurrency Node.js microservice. Everything looked fine under light load, but as soon as the traffic spiked, the pods would start cycling. It wasn't an OOM (Out of Memory) error. It wasn't a syntax error. It was the Liveness Trap.
The Lie We Tell Kubernetes
When we set up a livenessProbe in a K8s manifest, we’re telling the kubelet: "If this endpoint doesn't return a 200 OK within 3 seconds, kill me and start over."
For most languages, that’s fine. But Node.js is a single-threaded beast. If your event loop is saturated—meaning it's busy processing a massive JSON payload or a complex calculation—it literally cannot get around to answering the /healthz ping.
Here is what a typical "standard" health check looks like in Express:
const express = require('express');
const app = express();
// The "Trap"
app.get('/healthz', (req, res) => {
res.status(200).send('OK');
});
app.listen(3000);Under heavy load, that /healthz request gets stuck in the event loop queue behind 500 other expensive operations. Kubernetes waits, times out, and decides your app is "dead." It kills the pod. Now, the remaining pods have to pick up the extra traffic, which makes *their* event loops lag, causing *them* to fail their health checks.
Congratulations, you’ve just triggered a cascading failure.
Measuring the Pulse: Event Loop Lag
To fix this, I realized I needed to stop checking if the server was *alive* and start checking if it was *responsive*. The key metric here is Event Loop Lag.
If the loop takes 200ms to get back to a simple callback, the app is struggling. If it takes 2 seconds, it's basically unusable for real-time traffic, but it's still "alive."
Here’s a simple way to measure lag without adding heavy dependencies:
let eventLoopLag = 0;
function monitorLag() {
const start = Date.now();
setImmediate(() => {
const delta = Date.now() - start;
eventLoopLag = delta;
// Check again in a second
setTimeout(monitorLag, 1000).unref();
});
}
monitorLag();By using setImmediate, we’re measuring how long it actually takes for the event loop to cycle back to our code.
The "Smart" Health Check
Instead of a binary "yes/no," your health check should be aware of the internal pressure. But here’s the trick: Don't use a liveness probe to report load.
If your liveness probe returns a 500 error because the event loop is slow, Kubernetes will kill the pod. That’s usually the *last* thing you want when you're under high load. You want to stop receiving *new* traffic, not die.
This is why we have readinessProbes.
Here’s how I restructured the health check logic:
const MAX_LAG = 300; // milliseconds
app.get('/ready', (req, res) => {
if (eventLoopLag > MAX_LAG) {
// We are too busy! Tell K8s to stop sending traffic,
// but keep the process running to finish current work.
return res.status(503).send('Busy');
}
res.status(200).send('Ready');
});
app.get('/live', (req, res) => {
// Liveness should be very simple.
// Only fail if the process is actually broken.
res.status(200).send('Alive');
});Decoupling the Probe in Kubernetes
The manifest needs to reflect this philosophy. I see a lot of developers copy-pasting the same settings for both probes. That’s a mistake. Your liveness probe should be extremely forgiving, while your readiness probe should be the one doing the heavy lifting.
livenessProbe:
httpGet:
path: /live
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 5 # Be very patient before killing the pod
readinessProbe:
httpGet:
path: /ready
port: 3000
periodSeconds: 5
successThreshold: 1
failureThreshold: 2 # Quickly take the pod out of rotation if it's laggingThe "Startup" Savior
If your Node.js app takes a while to boot (e.g., connecting to a legacy database, pre-warming a cache), you might find your pods getting killed before they even finish starting.
Don't just increase initialDelaySeconds. If you set it to 60 seconds and your app starts in 5, you're wasting 55 seconds during a deployment. Use a startupProbe instead. It disables liveness and readiness checks until it passes.
startupProbe:
httpGet:
path: /live
port: 3000
failureThreshold: 30
periodSeconds: 10 # Gives the app 300s to startThe Result
After decoupling the probes and monitoring event loop lag, the "flapping" pods stopped. When traffic spikes, the readinessProbe fails for a few seconds, the pod gets taken out of the LoadBalancer pool, the event loop catches up, and it re-enters the pool.
No restarts. No 502 errors for the users. Just a system that actually understands how Node.js breathes.
Pro-tip: If you’re using a framework like Fastify, they have built-in plugins (like under-pressure) that handle this event loop monitoring for you. If you're on Express, the manual lag check above is a lightweight lifesaver.
Stop letting Kubernetes bully your busy pods into an early grave. Just because a process is slow doesn't mean it's dead.


