loke.dev
Header image for Why Does Your High-Performance Cache Still Trigger a Production Meltdown?

Why Does Your High-Performance Cache Still Trigger a Production Meltdown?

Standard TTLs are a trap: learn how the 'Thundering Herd' problem turns a simple cache miss into a full-scale infrastructure collapse.

· 8 min read

The dashboard was a sea of green right until the clock struck midnight. Suddenly, the database CPU spiked to 98%, the connection pool exhausted, and the API latency shot into the multi-second range—all because a single, high-traffic cache key reached its expiration time.

We often treat caching as a silver bullet for performance. We think: "If I put this data in Redis with a 10-minute TTL, my database is safe." For 9 minutes and 59 seconds, you’re right. But that one second where the cache expires can be the difference between a smooth-running system and a total production meltdown.

This isn't a failure of the cache itself. It’s a failure of the strategy. In high-concurrency environments, the transition from "cached" to "expired" creates a vacuum that the Thundering Herd is all too happy to fill.

The Anatomy of a Cache Stampede

When a frequently accessed key expires, the very next request sees a cache miss. In a naive implementation, that request goes to the database to fetch the fresh data. But in a system processing 5,000 requests per second, that first request hasn't finished its database trip before 100 other requests also realize the cache is empty.

Now, instead of one database query, you have 100. Or 1,000. Each one of these requests is trying to perform the exact same expensive calculation or query. This is a Cache Stampede.

The database, suddenly hit with a massive spike in concurrent, heavy queries, slows down. As it slows down, those queries take longer to finish. Because they take longer to finish, even *more* incoming requests miss the cache and pile onto the database. It’s a classic positive feedback loop that ends in a 503 Service Unavailable.

Here is what the "naive" (and dangerous) code usually looks like:

// The "Meltdown" Pattern
async function getProductData(productId) {
    let data = await cache.get(`product:${productId}`);
    
    if (!data) {
        // Here is the gap. 
        // 1,000 concurrent users can reach this line at the same time.
        data = await db.query("SELECT * FROM products WHERE id = ?", [productId]);
        await cache.set(`product:${productId}`, data, { ttl: 3600 });
    }
    
    return data;
}

Strategy 1: The Promises-as-a-Shield (SingleFlight)

If you’re running a language with a shared memory space (like Go or Node.js), you can solve a lot of this locally. The goal is simple: if one request is already fetching the data, make everyone else wait for that specific request to finish.

In Go, this is famously handled by the singleflight group. It’s one of those tools that feels like magic the first time you use it.

import "golang.org/x/sync/singleflight"

var g singleflight.Group

func getProductData(productId string) (interface{}, error) {
    // Check cache first
    data, err := cache.Get(productId)
    if err == nil {
        return data, nil
    }

    // Wrap the DB call in a singleflight 'Do'
    // The key ensures only one execution happens for this productId
    v, err, shared := g.Do(productId, func() (interface{}, error) {
        // Only one goroutine actually enters here
        return db.FetchProduct(productId)
    })

    if err != nil {
        return nil, err
    }

    // Once the first one finishes, cache it
    if !shared {
        cache.Set(productId, v)
    }

    return v, nil
}

By using a "key-based lock" on the application side, you consolidate those 1,000 queries into exactly one. The other 999 goroutines block until the first one returns, then they all receive the same result.

The Catch: This works beautifully for a single application instance. But if you have 50 instances of your service running behind a load balancer, you still might hit the database with 50 concurrent queries. That’s better than 5,000, but in some systems, even 50 heavy queries at once is enough to cause a stutter.

Strategy 2: Stale-While-Revalidate (The Soft TTL)

The most robust way to prevent a stampede is to never let the cache truly be "empty." Instead of a hard expiration where the data vanishes, we use a "Soft TTL" and a "Hard TTL."

When the Soft TTL expires, the system returns the stale data to the user immediately but triggers a background process to refresh the cache.

Think about it: for most applications, serving data that is 61 seconds old instead of 60 seconds old is perfectly acceptable if it prevents a site-wide outage.

interface CacheEntry {
    value: any;
    expiresAt: number; // The "Soft" TTL
}

async function getProductData(productId: string) {
    const entry: CacheEntry = await cache.get(`product:${productId}`);
    const now = Date.now();

    if (!entry) {
        // Cold start - we have to hit the DB
        return await refreshAndCache(productId);
    }

    if (now > entry.expiresAt) {
        // Data is stale! 
        // 1. Fire and forget the refresh logic
        // 2. Return stale data immediately
        refreshAndCache(productId).catch(err => console.error(err));
        return entry.value; 
    }

    return entry.value;
}

async function refreshAndCache(productId: string) {
    // Use a lock here (like Redlock) to ensure only one 
    // instance is refreshing globally
    const lock = await distributedLock.acquire(`lock:${productId}`);
    if (lock) {
        const freshData = await db.query(...);
        await cache.set(`product:${productId}`, {
            value: freshData,
            expiresAt: Date.now() + 60000 // 1 minute soft TTL
        }, { ttl: 86400000 }); // 24 hour hard TTL
        await lock.release();
        return freshData;
    }
}

With this approach, the user never waits for the database. The "Thundering Herd" is replaced by a single, calm background worker.

Strategy 3: Probabilistic Early Recomputation (PER)

This one sounds like it belongs in a research paper, but it’s surprisingly elegant for distributed systems where you don't want to manage complex locking logic.

The idea is to make the cache expiration "fuzzy." Instead of every request seeing the expiration at exactly the same time, each request performs a small calculation. As the expiration time approaches, the *probability* that a request will decide to refresh the cache increases.

If you have 100 requests per second, and the key is 5 seconds from expiring, maybe only one of those requests will "randomly" decide it's time to refresh.

The formula (often called X-Fetch) looks like this:

currentTime - (gap * beta * log(random())) > expiry

- gap: The time it took to fetch the data last time.
- beta: A configurable constant (usually 1.0) to make it more or less aggressive.
- random(): A float between 0 and 1.

import math
import random
import time

def get_with_per(key, beta=1.0):
    entry = cache.get(key)
    if not entry:
        return recompute_and_store(key)

    # entry['expiry'] is the timestamp when it should expire
    # entry['delta'] is the time it took to compute the value last time
    
    # The "magic" probabilistic check
    if (time.time() - (entry['delta'] * beta * math.log(random.random()))) > entry['expiry']:
        return recompute_and_store(key)
    
    return entry['value']

By introducing entropy, you ensure that recomputation is distributed over time. The "herd" doesn't stampede because they aren't all hearing the same starting pistol.

Strategy 4: The Jitter Factor

If you aren't ready to implement X-Fetch, at the very least, you should be using Jitter.

A common cause of production meltdowns is "The Midnight Alignment." This happens when you have a cron job or a bulk process that loads 100,000 items into the cache at once, all with a TTL of exactly one hour. At 1:00 AM, all 100,000 keys expire at the exact same millisecond.

The solution is to add a bit of random noise to your TTLs.

# Don't do this
ttl = 3600 

# Do this
ttl = 3600 + random.randint(0, 300) # Adds 0-5 minutes of "jitter"

It’s a low-tech solution that solves a high-concurrency problem. By spreading the expirations out, you transform a massive "spike" of database demand into a manageable "hump."

Why Your Connection Pool Is Part of the Problem

When we talk about cache stampedes, we focus on the cache and the database. But the Connection Pool is usually where the system actually breaks.

When a stampede happens, your application tries to open hundreds of connections to the database simultaneously. If your pool maxes out at 50 connections, the other 950 requests sit in a queue inside your application memory.

This consumes:
1. File Descriptors: Every connection is a socket.
2. Memory: Every queued request holds onto its stack and scope.
3. Event Loop Time: In Node.js or Python (asyncio), managing thousands of pending callbacks can lead to event loop lag.

If you don't have a timeout on how long a request can wait for a connection from the pool, your application will simply hang. It won't crash; it will just stop responding.

Pro-tip: Always set a connection_timeout or acquire_timeout. It is better to fail fast and return a 503 to 10% of users than to let the whole system grind to a halt and fail for 100% of users.

The "Negative" Cache (Caching the Void)

There is a specific type of meltdown that occurs when users (or attackers) request data that *doesn't exist*.

Imagine a request for /api/products/999999. Your code checks Redis. It's a miss. It checks the DB. It's a miss. Because there's no data, you don't store anything in Redis.

The next second, 1,000 requests come in for that same non-existent ID. Every single one of them hits the database because you aren't caching the *absence* of data.

This is solved by Negative Caching:

async function getProduct(id) {
    const cached = await cache.get(`id:${id}`);
    if (cached === null) return null; // We explicitly cached that this doesn't exist
    if (cached) return cached;

    const dbResult = await db.fetch(id);
    if (!dbResult) {
        // Cache the "null" for a short time (e.g., 5 minutes)
        await cache.set(`id:${id}`, null, { ttl: 300 });
        return null;
    }
    
    await cache.set(`id:${id}`, dbResult);
    return dbResult;
}

In high-scale systems, caching the "Not Found" result is just as important as caching the "Found" result.

Layers are Your Friend

Finally, consider that Redis (or Memcached) shouldn't be your only layer. For the most "radioactive" keys—those that would surely take down the system if they expired—use In-Process Caching (L1).

Keep the most popular 1,000 keys in your application's memory (using something like lru-cache in Node or groupcache in Go).

- L1 (Local Memory): 0ms latency, no network call.
- L2 (Redis): 1-2ms latency, network call.
- L3 (Database): 100ms+ latency, disk I/O.

When you have multiple layers, the "herd" has to break through two different gates before it can hit your database. You can sync the L1 cache via Pub/Sub or just let it have a very short, jittered TTL.

Summary: How to Stop the Meltdown

If you want to move beyond "hope-based caching," you need to acknowledge that cache expiration is an event that must be managed, not just a timer that runs out.

1. Use SingleFlight/Locking to ensure only one worker fetches data per key.
2. Implement Soft TTLs to serve stale data while refreshing in the background.
3. Add Jitter to your TTLs to prevent synchronized expirations.
4. Cache "Null" results to prevent your DB from being hammered by invalid lookups.
5. Set strict timeouts on your connection pools so a cache miss doesn't lead to a memory leak.

A high-performance cache is only as good as its failure mode. If your system relies on the cache being 100% full at all times to survive, you haven't built a cache—you've built a time bomb. Manage the misses, and the hits will take care of themselves.