The Compaction Stall: What Nobody Tells You About Linux Transparent Huge Pages

I remember debugging a Redis cluster a few years back that was behaving like a moody teenager. Most of the time, it was lightning fast—sub-millisecond responses, as expected. But every few minutes, like clockwork, a handful of requests would skyrocket to 500ms or even a full second. We checked the network; it was quiet. We checked the disks; they were idle. We even blamed the garbage collector, but Redis doesn't have one. It was maddening.

It turned out to be a "feature" called Transparent Huge Pages (THP). Specifically, it was the kernel trying to be helpful and failing spectacularly in a way that most monitoring tools don't surface.

If you've ever seen a database performance guide, you've seen the advice: "Disable THP." But very few of those guides explain *why* beyond a hand-wavy mention of latency. If you want to build truly high-performance systems, you need to understand the Compaction Stall.

The TLB Tax and the Promise of 2MB

To understand why the kernel stalls, we have to understand what it's trying to optimize.

Modern CPUs use a Translation Lookaside Buffer (TLB) to cache the mapping between virtual memory addresses (what your app sees) and physical memory addresses (the actual RAM chips). This cache is tiny. When your application has a large memory footprint—say, a 64GB Postgres buffer pool—the CPU spends a significant amount of time "walking the page tables" because the mapping it needs isn't in the TLB cache.

By default, Linux uses 4KB pages. If you have 64GB of RAM, that’s 16,777,216 pages the kernel has to manage. That’s a lot of overhead.

Huge Pages allow the kernel to use 2MB (or even 1GB) pages instead. With 2MB pages, that same 64GB of RAM only requires 32,768 pages. The TLB miss rate drops, and performance usually goes up by 10-15%.

Transparent Huge Pages (THP) was designed to make this "transparent." Instead of you manually managing hugetlbfs, the kernel would look at your memory usage and automatically promote contiguous 4KB pages into 2MB Huge Pages in the background. It sounds like a free lunch. It isn't.

When Fragments Become Walls

The problem is fragmentation. RAM isn't a neat, continuous block; it's a patchwork. As your system runs, it allocates and deallocates 4KB pages. Over time, your memory looks like a game of Tetris played by someone who isn't very good at it. You might have 10GB of free RAM, but you might not have a single contiguous 2MB block anywhere.

When a process wants a Huge Page but the kernel can't find a contiguous 2MB chunk of free memory, it triggers Memory Compaction.

This is where the nightmare begins. The kernel starts scanning your RAM, moving used 4KB pages around to "defragment" them and create a hole large enough for a 2MB page.

The Direct Compaction Stall

There are two ways the kernel does this.

1. khugepaged: A background thread that scans memory and collapses pages. This is usually fine.
2. Direct Compaction: This is the killer. When a process (like your database) needs to allocate memory and the kernel is set to always use THP, the process stops. It hangs. It waits while the kernel synchronously tries to defragment memory to satisfy that 2MB allocation.

Your application thread isn't doing work. It isn't waiting on I/O. It’s sitting in D state (uninterruptible sleep) while the kernel shuffles memory blocks. This is the Compaction Stall.

Hunting the Stall: How to See It

You won't see this in top. You won't see it in iostat. To find it, you have to look at /proc/vmstat.

Try running this on a production server that’s been up for a while:

grep -E 'compact_stall|compact_fail|compact_success' /proc/vmstat

You'll get output like this:

compact_stall 4582
compact_fail 12093
compact_success 8321

-   compact_stall: This is the number of times a process was made to wait for direct compaction. If this number is growing, your p99s are suffering.
-   compact_fail: The kernel tried to defragment memory but gave up because it was too fragmented. This is double-bad: you paid the latency tax of the stall, but you didn't even get the Huge Page benefit.
-   compact_success: The kernel successfully defragmented memory.

If you want to see this in real-time, you can use watch:

watch -n 1 "grep -E 'compact_stall|compact_fail' /proc/vmstat"

If these numbers jump when your application spikes in latency, you've found your smoking gun.

The "Always" Trap

Most Linux distributions (RHEL/CentOS used to be famous for this) ship with THP set to always. You can check your setting here:

cat /sys/kernel/mm/transparent_hugepage/enabled

If it says [always] madvise never, you are at risk.

The always setting means the kernel will attempt to use THP for every process. For a database that allocates huge chunks of memory, the kernel will aggressively try to find 2MB blocks. If your memory is fragmented, every new allocation becomes a potential landmine of direct compaction.

Monitoring with eBPF

If you want to be really sophisticated and see exactly how long these stalls are taking, you can use bcc (eBPF) tools. The compactsnoop tool is perfect for this.

# You might need to install bcc-tools first
sudo /usr/share/bcc/tools/compactsnoop

This will show you which process is stalling, how long it stalled for, and whether it succeeded. It’s the difference between "I think it's THP" and "I know it's THP."

Fixing the Mess: Tuning Strategies

So, do you just turn it off? Maybe. But there's a middle ground.

1. The Nuclear Option: `never`

If you are running a database like Redis, MongoDB, or Postgres, the standard advice is usually correct: disable it entirely.

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Note: These changes aren't persistent. You'll need to add them to your grub config or a systemd unit.

2. The Sophisticated Option: `madvise`

This is often the best "Goldilocks" setting.

echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

When set to madvise, the kernel won't use Huge Pages for everything. It will only use them for memory regions where the application specifically asked for them using the madvise(MADV_HUGEPAGE) system call.

This gives the best of both worlds: your database (which probably doesn't handle THP well) gets 4KB pages, but other performance-critical applications that are THP-aware can still opt-in.

3. Tuning the Defrag Policy

This is the setting that actually controls the "Stall."

cat /sys/kernel/mm/transparent_hugepage/defrag

The options usually are:
-   always: If we don't have a huge page, stall the process until we make one. (The source of your pain).
-   defer: Wake up khugepaged in the background to defragment, but don't stall the current process. Give it a 4KB page for now.
-   defer+madvise: Only stall for processes that explicitly called madvise, otherwise defer.
-   never: Never defragment.

For most production workloads, defer or defer+madvise is significantly safer than always.

A Practical Example: The Latency Test

If you want to see the impact of THP fragmentation yourself, you can use a small C program to allocate a lot of small chunks, free every other one to create holes, and then try to allocate a large block. Or, more simply, use the transhuge-stress tool if your distro provides it.

But you can simulate the "pressure" using dd and some memory-heavy tasks while watching vmstat.

Here is a simple script to check your current THP stats and recommend an action:

#!/bin/bash

THP_ENABLED=$(cat /sys/kernel/mm/transparent_hugepage/enabled | grep -o '\[.*\]' | tr -d '[]')
THP_DEFRAG=$(cat /sys/kernel/mm/transparent_hugepage/defrag | grep -o '\[.*\]' | tr -d '[]')

echo "Current THP Status: $THP_ENABLED"
echo "Current Defrag Policy: $THP_DEFRAG"

STALLS=$(grep compact_stall /proc/vmstat | awk '{print $2}')

if [ "$STALLS" -gt 0 ]; then
    echo "Warning: Your system has experienced $STALLS compaction stalls."
    if [ "$THP_DEFRAG" == "always" ]; then
        echo "Recommendation: Change defrag policy to 'defer' or 'madvise' to reduce p99 latency."
    fi
else
    echo "No compaction stalls detected yet. Your memory might not be fragmented."
fi

Why isn't this fixed?

You might wonder why the Linux kernel developers keep always as a default if it causes such issues. The reason is that for many generic workloads—compiling code, video encoding, or scientific computing—the performance gain from 2MB pages outweighs the occasional stall.

But databases are different. They manage their own buffers. They have specific access patterns. They are extremely sensitive to the "long tail" of latency. A 10ms stall in the kernel is an eternity for a database that promises 500μs responses.

The "Cost" of 4KB Pages

Is there a downside to disabling THP? Yes. You will see a slight increase in CPU usage (usually 1-5%) because of the increased TLB misses.

In my experience, almost every SRE would trade a 2% increase in average CPU usage for a 50% reduction in p99 latency. Predictability is the currency of production systems.

Summary: A Checklist for Your Servers

If you are managing high-performance Linux servers, don't just blindly follow a "disable THP" guide. Understand the state of your system:

1. Check the current state: cat /sys/kernel/mm/transparent_hugepage/enabled
2. Audit your history: Check /proc/vmstat for compact_stall. If it's in the thousands, you have a problem.
3. Monitor the tail: Use eBPF compactsnoop to correlate stalls with application latency spikes.
4. Tune specifically:
- For dedicated DB servers: enabled=never.
- For mixed workloads: enabled=madvise and defrag=defer.
5. Make it persistent: Ensure your sysfs changes survive a reboot.

The "Compaction Stall" is one of those classic Linux internals gotchas. It’s a mechanism designed for efficiency that, under the wrong circumstances, becomes a bottleneck. Now that you know how to see it, you don't have to guess why your p99s are spiking anymore.