loke.dev
Header image for The Cadence of the Dirty Page

The Cadence of the Dirty Page

The Linux kernel’s strategy for flushing memory to disk is often the hidden culprit behind mystery I/O latency spikes in high-throughput applications.

· 8 min read

You’ve likely experienced that agonizing moment where a high-performance service suddenly flatlines for three seconds, even though your CPU usage is low and your NVMe drives are capable of 5GB/s. It feels like the system just decided to take a nap, and usually, the culprit is the Linux kernel finally deciding it’s time to pay its debts.

In the world of Linux I/O, we live on borrowed time. When your application calls write(), the kernel doesn't immediately spin up the disk platters or signal the flash controller. Instead, it copies that data into the page cache—a slice of RAM—marks the page as "dirty," and tells your application, "All set, I've got it from here." This lie is what makes modern computing feel fast. But like any debt, these dirty pages eventually have to be settled. The "cadence" of that settlement—how often and how aggressively the kernel flushes those pages to disk—determines whether your application runs like a precision watch or a stuttering engine.

The Architecture of the Lie

To understand the latency spikes, we have to look at the Page Cache. Every time you read from or write to a file, the kernel caches that data in memory. This is why the second time you run grep on a large file, it finishes instantly.

When you write, the data stays in RAM. At this point, the page is dirty. It exists in memory but not on the "backing store" (the disk). The kernel is now in a race against power failure. If the system crashes now, that data is gone. To prevent catastrophe, the kernel employs a set of background threads (historically pdflush, now per-device writeback threads) to periodically move this data to permanent storage.

The problem isn't that the flushing happens; the problem is the *thresholds* that trigger it.

The Cliff: Background vs. Hard Throttling

There are two primary numbers that govern this behavior, and if you haven't tuned them, your high-throughput app is likely dancing on the edge of a cliff.

1. `vm.dirty_background_ratio`: This is the "gentle nudge." When dirty memory hits this percentage of total system memory, the kernel starts waking up flusher threads to write data to disk in the background. Your application usually won't feel this.
2. `vm.dirty_ratio`: This is the "emergency brake." When dirty memory hits this percentage, the kernel decides that things have gotten out of hand. It stops the background threads and forces the process actually doing the writing to perform the I/O itself.

This is where the "mystery" latency comes from. If your application is writing logs at 500MB/s and you hit the dirty_ratio, your write() call—which usually takes microseconds—will suddenly block for seconds while the kernel forces your thread to wait for the physical hardware to catch up.

Observing the Pressure

Before we tune anything, we need to see the cadence in action. You can see your current dirty page status by looking at /proc/meminfo.

watch -n 0.1 "grep -E '^(Cached|Dirty|Writeback):' /proc/meminfo"

If you run a heavy write operation, you'll see Dirty climb. If it hits your dirty_ratio, you'll see Writeback spike as the kernel desperately tries to clear the backlog.

Let's look at a more programmatic way to monitor this. Here is a simple Python script that tracks the rate of "dirtying" vs "clearing" to help you visualize the rhythm of your specific workload.

import time

def get_dirty_stats():
    stats = {}
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            if any(key in line for key in ('Dirty:', 'Writeback:', 'Cached:')):
                parts = line.split()
                stats[parts[0].replace(':', '')] = int(parts[1])
    return stats

last_dirty = get_dirty_stats()['Dirty']
print(f"{'Time':<10} {'Dirty (MB)':<12} {'Change (MB)':<12} {'Writeback (MB)':<12}")

while True:
    curr = get_dirty_stats()
    diff = (curr['Dirty'] - last_dirty) / 1024
    print(f"{time.strftime('%H:%M:%S'):<10} {curr['Dirty']/1024:<12.2f} {diff:<12.2f} {curr['Writeback']/1024:<12.2f}")
    last_dirty = curr['Dirty']
    time.sleep(1)

The Danger of Large RAM

On a 4GB Raspberry Pi, the default vm.dirty_ratio of 20% is about 800MB. That’s manageable. But on a modern production server with 512GB of RAM, 20% is over 100GB.

Imagine your application fills up 100GB of dirty pages. The kernel finally snaps and says, "That's enough." It then forces your application to wait while it flushes 100GB to disk. Even with a fast NVMe drive writing at 2GB/s, your application could be blocked for 50 seconds.

This is why, for high-throughput systems, using percentages is often a mistake. You want to use the "bytes" version of these tunables.

Switching to Absolute Bytes

Instead of percentages, we can set hard limits in bytes. This ensures that the amount of unwritten data never exceeds a volume that the hardware can't handle in a reasonable timeframe (e.g., 500ms).

# Set background flushing to start at 256MB
sysctl -w vm.dirty_background_bytes=268435456

# Force synchronous flushing at 512MB
sysctl -w vm.dirty_bytes=536870912

By narrowing the gap between "background" and "forced," you create a more frequent, "faster" cadence. The disk stays busier more often, but the spikes in latency are shaved off.

The Role of Expiry and Writeback Intervals

The thresholds aren't the only thing controlling the cadence. There's also a "timer" component. Even if you haven't hit the dirty_background_ratio, the kernel won't let data sit in RAM forever.

* `vm.dirty_expire_centisecs` (Default: 3000, or 30 seconds): This defines how old data can be before it *must* be written out.
* `vm.dirty_writeback_centisecs` (Default: 500, or 5 seconds): This defines how often the flusher threads wake up to check if there is work to do.

If you have a bursty application, the default 30-second expiry can be a killer. You might have a massive burst of writes that stays in RAM for 29 seconds, and then suddenly the kernel decides all of it is "expired" at once.

To create a smoother flow, you can decrease these intervals:

# Wake up the flusher every 1 second
sysctl -w vm.dirty_writeback_centisecs=100

# Data can only stay dirty for 5 seconds
sysctl -w vm.dirty_expire_centisecs=500

Practical Example: Simulating a Stall

Let’s look at a small C program that writes a lot of data quickly. We can use this to see how the kernel reacts to being overwhelmed.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>

#define CHUNK_SIZE 1024 * 1024 // 1MB
#define TOTAL_SIZE 1024LL * 1024LL * 2048LL // 2GB

int main() {
    int fd = open("test_data.bin", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd < 0) { perror("open"); return 1; }

    char *buf = malloc(CHUNK_SIZE);
    for (int i = 0; i < CHUNK_SIZE; i++) buf[i] = 'A';

    struct timespec start, end;
    
    for (long long written = 0; written < TOTAL_SIZE; written += CHUNK_SIZE) {
        clock_gettime(CLOCK_MONOTONIC, &start);
        
        if (write(fd, buf, CHUNK_SIZE) != CHUNK_SIZE) {
            perror("write");
            break;
        }

        clock_gettime(CLOCK_MONOTONIC, &end);
        double elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
        
        // If the write takes more than 10ms, log it
        if (elapsed > 0.01) {
            printf("Latency spike: %.4f seconds at %lld MB\n", elapsed, written / (1024 * 1024));
        }
    }

    close(fd);
    free(buf);
    return 0;
}

If you run this on a system with a very large dirty_ratio, the first 1GB of writes will likely show near-zero latency. Then, as soon as the threshold is hit, you'll see the latency numbers jump from 0.0001s to 0.5s or higher. This is the kernel reclaiming control.

When To Be Opinionated: SSDs vs HDDs

The Linux defaults were largely designed in an era when spinning rust was the norm. On an HDD, you want to buffer as much as possible because seeking is expensive. You want to write in huge, contiguous chunks.

But we live in the era of NVMe. On modern SSDs, random I/O is incredibly fast, and the penalty for frequent, smaller writes is much lower than the penalty of a blocked application.

If you are running on high-end flash storage:
1. Lower the thresholds significantly. Don't let 50GB of dirty data accumulate. Keep it under 1GB.
2. Agitator flushing. Set dirty_writeback_centisecs low. You want the disk constantly "ticking" rather than sleeping and then screaming.

The Gotcha: I/O Schedulers

While dirty pages are the software-side of the cadence, the I/O scheduler is the hardware-side gatekeeper. If you are using mq-deadline or kyber, they have their own ideas about how to prioritize writes.

If you find that even after tuning vm.dirty_bytes you still see spikes, look at /sys/block/<device>/queue/scheduler. For NVMe, none is often the best choice, as the hardware handles its own internal parallelism much better than the kernel can. For SATA SSDs, mq-deadline is usually fine, but ensure the write_starvation parameters aren't letting the flusher threads hog the entire bus.

Checking for "Stuck" Threads

Sometimes, the cadence isn't just slow; it's broken. If a process is stuck in "Uninterruptible Sleep" (the D state in top), it's often waiting on this exact mechanism.

You can use stack traces from the kernel to see exactly where the flusher is hung:

# Get the PID of a process in 'D' state
ps aux | grep ' D '

# Peek at its kernel stack
cat /proc/<PID>/stack

If you see balance_dirty_pages in that stack, you have found the smoking gun. The kernel is actively throttling that process because there are too many dirty pages in the system.

Tuning for the "Noisy Neighbor"

One of the most frustrating scenarios is when *Application A* is doing heavy I/O, and it causes *Application B* (your latency-sensitive API) to stall. Because the Page Cache is global, one process hitting the dirty_ratio can impact the entire system.

In cgroups v2, you can actually set I/O weights and limits per container, but dirty page management remains a bit of a global headache. The best defense here is to use O_DIRECT for heavy logging or data ingestion if you can. O_DIRECT bypasses the page cache entirely. It’s slower for the writing process (since it’s synchronous), but it prevents that process from "polluting" the global page cache and triggering a flush that stalls your other apps.

Summary of the "Smooth" Config

If you're looking for a starting point to eliminate the "big flush" spikes on a modern production server with 64GB+ of RAM and fast storage, try these settings in /etc/sysctl.conf:

# Start background writes early (at 256MB)
vm.dirty_background_bytes = 268435456

# Force sync writes at 1GB (don't let it grow to 10% of 128GB!)
vm.dirty_bytes = 1073741824

# Wake up the flusher threads every second
vm.dirty_writeback_centisecs = 100

# Don't let data sit around for more than 10 seconds
vm.dirty_expire_centisecs = 1000

Final Thoughts

The Linux kernel is a master of optimization, but its default settings prioritize "average throughput" over "consistent latency." In the world of high-performance systems, we usually care more about the 99th percentile latency than the raw throughput.

The cadence of the dirty page is a heartbeat. If that heart beats once every 30 seconds and moves a massive amount of blood, your system will feel erratic. If you tune it to beat once a second and move a small amount of blood, the system stays steady.

Don't let the kernel lie to you until it’s too late to pay the debt. Force it to settle its accounts early and often. Your users (and your monitoring alerts) will thank you.