Misaligned by a Millisecond: Solving the Node.js Keep-Alive Race Condition

You’ve checked your logs, and everything looks pristine. CPU usage is low, memory is stable, and your database isn't breaking a sweat, yet your monitoring tool keeps firing off intermittent 502 "Bad Gateway" alerts like a glitchy smoke detector.

These "phantom" errors are the bane of any SRE’s existence because they are notoriously hard to reproduce in a staging environment. If you’re running Node.js behind a cloud load balancer (like AWS ALB or GCP GCLB), there’s a high probability you’re falling victim to a race condition that happens in the blink of an eye—or more accurately, in the span of a single millisecond.

The Anatomy of a Stealthy 502

Here is the scenario: To save time and resources, your Load Balancer (LB) keeps "keep-alive" connections open to your Node.js server. This allows it to reuse the same TCP connection for multiple requests instead of doing the expensive handshake every single time.

Everything works great until the connection sits idle.

Your Load Balancer has an idle timeout (on AWS, the default is 60 seconds). Your Node.js server also has an idle timeout (keepAliveTimeout). The disaster happens when the Node.js server decides to close the connection at the exact same moment the Load Balancer decides to send a new request through it.

The Load Balancer sends the request, but before it reaches the app, Node.js sends a FIN packet to close the socket. The Load Balancer gets hit with a connection reset, panics, and serves the user a 502.

The Default Node.js Trap

In older versions of Node.js, the default keepAliveTimeout was 5 seconds. Modern versions have changed this, but it’s often still lower than the 60-second default of many cloud load balancers.

If your Load Balancer is waiting 60 seconds but your Node.js server closes the connection after 5 seconds, you are essentially leaving a 55-second window where the Load Balancer might try to use a connection that Node.js is about to kill.

Let's look at a standard, vulnerable HTTP server:

const http = require('http');

const server = http.createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/plain' });
  res.end('Hello World\n');
});

// By default, this might be 5000ms (5 seconds)
// This is a recipe for 502s if your LB timeout is 60s
server.listen(3000);

The Fix: Outlasting the Load Balancer

The rule of thumb is simple: Your Node.js server must wait longer than your Load Balancer before closing an idle connection.

If the Load Balancer is the one to close the connection first, it knows the connection is dead and won't try to send data through it. If Node.js closes it first without the LB knowing, you get the race condition.

Here is how you actually configure the server to be "LB-safe":

const http = require('http');

const server = http.createServer((req, res) => {
  res.status(200).send('Fixed the race condition!');
});

// 1. Set keepAliveTimeout to be higher than your LB timeout.
// If AWS ALB is 60 seconds, set this to 65 seconds.
server.keepAliveTimeout = 65000; 

// 2. Ensure headersTimeout is set higher than keepAliveTimeout.
// Node.js docs recommend setting this higher than keepAliveTimeout.
server.headersTimeout = 66000; 

server.listen(3000, () => {
  console.log('Server running on port 3000');
});

Why two different timeouts?

You might wonder why we’re touching headersTimeout. Node.js uses this to limit how long it waits to receive all the HTTP headers from the client.

A weird quirk in the Node.js implementation (specifically in the _http_server.js core) requires headersTimeout to be larger than keepAliveTimeout to ensure that the timer for the next request’s headers is correctly reset after a keep-alive response is sent. If you don't do this, you might find your connections being dropped even more aggressively.

Real-world Framework Examples

If you are using Express, the logic remains the same because app.listen() returns the underlying HTTP server object.

Express.js Implementation

const express = require('express');
const app = express();

app.get('/', (req, res) => res.send('Stable Connection'));

const server = app.listen(3000);

// Fix the race condition
server.keepAliveTimeout = 65000;
server.headersTimeout = 66000;

Fastify Implementation

Fastify handles things a bit differently since it encapsulates the server creation, but you can still get to the raw server instance or set it via options in newer versions.

const fastify = require('fastify')({
  // In newer versions, you can pass these directly
  serverFactory: (handler) => {
    const server = require('http').createServer(handler);
    server.keepAliveTimeout = 65000;
    server.headersTimeout = 66000;
    return server;
  }
});

fastify.listen({ port: 3000 });

How do you know if this is your problem?

If you aren't sure if your 502s are caused by this millisecond misalignment, check your Load Balancer logs (like AWS S3 access logs).

Look for:
1. ELB State: A target_status_code of - or 502.
2. Request Processing Time: A value of -1 or a near-zero response time.
3. Frequency: Errors that occur primarily during periods of low-to-medium traffic (high traffic keeps connections active, so they don't hit the idle timeout as often).

Final Thoughts

The cloud is full of "invisible" defaults that work 99% of the time but fail spectacularly when they clash. While it feels trivial to bump a timeout from 5 seconds to 65 seconds, it’s often the difference between a "flaky" infrastructure and a rock-solid one.

If you’re running on AWS, 65 seconds is your magic number. It gives the Load Balancer plenty of time to be the one to hang up the phone first, ensuring your Node.js process isn't pulling the rug out from under your traffic.