loke.dev
Header image for Why Does Your App Suffer from Mystery Latency Spikes Every 30 Seconds?

Why Does Your App Suffer from Mystery Latency Spikes Every 30 Seconds?

An investigation into the Linux kernel’s dirty page management and how aggressive background writebacks can paralyze application I/O even when CPU usage is low.

· 9 min read

Why Does Your App Suffer from Mystery Latency Spikes Every 30 Seconds?

We’ve been conditioned to believe that more RAM is always a net positive for system performance. "Throw hardware at the problem" is the mantra of the modern DevOps era. But in the world of high-throughput Linux applications, there is a point where having a massive amount of memory actually becomes a liability. If your application handles a decent volume of writes and you start seeing inexplicable latency spikes that recur on a predictable, rhythmic cycle—specifically every 30 seconds—you aren't dealing with a code bug. You are likely a victim of the Linux kernel’s generosity.

The kernel tries to be helpful by buffering disk I/O in RAM. This "dirty page" management is supposed to smooth out performance, but on modern systems with massive memory pools, the default settings often lead to a phenomenon I call the "I/O Heart Attack."

The Anatomy of the 30-Second Stall

Imagine your app is humming along at 5ms p99 latency. Suddenly, for a two-second window, every request takes 500ms. Then, as quickly as it started, the latency drops back to 5ms. You check your CPU usage; it’s at 15%. You check the network; it’s clear. You check the application logs; nothing.

If you look at your system metrics with a fine-toothed comb, you’ll notice a pattern. These spikes occur exactly 30 seconds apart.

This happens because Linux manages data written to the disk using a system of Dirty Pages. When your application calls write(), the kernel doesn't immediately spin up the disk or send a command to the NVMe controller. That would be slow. Instead, it copies that data into page cache (RAM) and marks those pages as "dirty." The call returns instantly, and your app thinks the work is done.

The problem arises when it’s time to pay the piper. The kernel has a background thread (the writeback worker) that wakes up periodically to flush these dirty pages to the physical disk. By default, one of the primary triggers for this flush is the age of the data.

In most Linux distributions, the dirty_expire_centisecs parameter is set to 3000. That’s exactly 30 seconds.

Visualizing the Problem

Before we tune anything, we need to see the beast in action. You can monitor the state of your system's dirty memory by looking at /proc/meminfo.

Try running this in a terminal while your application is under load:

watch -d -n 0.1 "grep -E '^(Dirty|Writeback):' /proc/meminfo"

You will see the Dirty value climb and climb. Eventually, the Writeback value will spike as the kernel begins pushing that data to the disk. If your Dirty value is allowed to grow into the gigabytes, the eventual flush will saturate your I/O stack, creating a bottleneck that stalls any thread attempting a synchronous write—or even a simple fsync().

Why "More RAM" Makes It Worse

On a machine with 8GB of RAM, the default settings are usually fine. But let’s look at a modern production server with 256GB of RAM.

Linux uses two primary thresholds to decide when to flush dirty pages:
1. vm.dirty_background_ratio: The percentage of system memory that can be dirty before the kernel starts flushing data in the background. Default is often 10%.
2. vm.dirty_ratio: The absolute maximum percentage of memory that can be dirty. If you hit this, the kernel stops your application from writing entirely and forces it to help with the flush (synchronous I/O). Default is often 20%.

On a 256GB machine, a 10% dirty_background_ratio means the kernel will allow 25.6 GB of data to sit in RAM before it starts background writebacks.

Here is the kicker: If your disk can only write at 500MB/s, and the kernel suddenly decides to flush 25GB of data, your I/O subsystem is going to be pinned for 50 seconds. During this time, any other I/O operation—like reading a config file, writing a log line, or checking a database index—is forced to wait in a massive queue.

The "Stable Pages" Trap

You might ask, "Why does a background flush affect my application's read latency or small writes?"

It’s due to a kernel feature called Stable Pages. To ensure data integrity and calculate checksums properly, the kernel "locks" a page while it is being written to the disk. If your application tries to modify that same page during the flush, it is put into an uninterruptible sleep (D state) until the I/O is complete.

If you have 20GB of dirty data being flushed, the probability of your application hitting a locked page increases exponentially. This is why you see "Mystery Latency." Your CPU is idle, but your threads are stuck waiting for a page lock to clear.

Diagnosing with perf and fio

If you suspect this is happening, you can use fio to simulate the pressure and perf to see where the kernel is spending its time.

First, let's create a scenario that generates heavy dirty page pressure:

# Simulating a heavy write load to fill dirty pages
fio --name=dirty-test --ioengine=libaio --rw=write --bs=64k --size=10G --numjobs=1 --direct=0 --runtime=60 --time_based

Notice we set --direct=0. This tells fio to use the page cache (buffered I/O), which is exactly what triggers the dirty page issue.

While that is running, use perf to look for wait_on_page_bit or io_schedule:

sudo perf top -U

If you see account_page_dirtied or wait_on_page_writeback at the top of the list, you’ve found your smoking gun. Your application is literally waiting for the kernel to finish its housekeeping.

How to Fix It: Tuning for Latency, Not Throughput

The default Linux settings prioritize throughput. It wants to bundle writes together to maximize the efficiency of the physical hardware. For a desktop user or a batch processing job, this is great. For a low-latency web app or a database, it's a disaster.

To fix the 30-second spikes, we need to change the philosophy: Flush early, flush often.

1. Shift from Percentages to Absolute Bytes

Using dirty_ratio (percentage) is dangerous on high-RAM machines. Instead, use dirty_bytes and dirty_background_bytes. This gives you deterministic control over how much data can stay in flight.

Add these to /etc/sysctl.conf:

# Start flushing as soon as 64MB is dirty
vm.dirty_background_bytes = 67108864

# Force the app to slow down/block if 256MB is dirty
vm.dirty_bytes = 268435456

By setting dirty_background_bytes to a small value (like 64MB or 128MB), you ensure the kernel starts trickling data to the disk almost immediately. The "spikes" disappear because the I/O pressure is smoothed out into a constant, manageable stream.

2. Tighten the Expiration Timer

If you want to ensure no data sits in RAM for more than 5 seconds (reducing the "30-second heart attack" risk), adjust the expiration:

# Reduce the time dirty data can stay in RAM (from 30s to 5s)
vm.dirty_expire_centisecs = 500

# How often the background worker wakes up to check (from 5s to 1s)
vm.dirty_writeback_centisecs = 100

3. Applying Changes Dynamically

You don't need to reboot to test these. You can apply them on the fly:

sysctl -w vm.dirty_background_bytes=67108864
sysctl -w vm.dirty_bytes=268435456
sysctl -w vm.dirty_expire_centisecs=500
sysctl -w vm.dirty_writeback_centisecs=100

The Trade-off: SSD Longevity and Throughput

There is no free lunch. By forcing the kernel to flush more frequently, you are reducing the "write merging" efficiency of the OS.

If your application writes the same block over and over again within a 10-second window, the default Linux settings would only write to the disk once. With our new settings, it might write to the disk five or ten times. On an SSD, this theoretically increases wear. However, in most modern data center environments, the impact on SSD life is negligible compared to the cost of 2-second application hangs.

Real-World Case: The Logging Bottleneck

I once worked on a Java service that processed high-frequency trading data. Every few minutes, we’d see a spike. We traced it back to logback writing to a file.

Even though the logging was supposed to be "async," the internal buffers would eventually fill up, and the call to write() would hit the kernel. Because the kernel was busy flushing a previous 2GB chunk of dirty pages from a separate database process on the same machine, the logging thread would block.

Since the logging thread was blocked, the application's internal ring buffer filled up. Once that filled, the main processing threads blocked. This is a classic backpressure cascade.

The fix wasn't in the Java code. It was setting vm.dirty_background_bytes to 128MB. The latency spikes vanished instantly because the disk never became the "unmovable object" in the system.

Edge Case: Containers and Cgroups

If you are running in Docker or Kubernetes, be careful. Most of these vm.* sysctl settings are global. If you change them on a node, you change them for every container running on that host.

Furthermore, until very recently, the Linux kernel did not have "cgroup-aware" writeback. This meant a single container doing heavy I/O could trigger the global dirty threshold, causing *other* containers on the same host to suffer from writeback stalls. If you are in a multi-tenant environment, monitoring node_filesystem_device_error and node_disk_io_time_seconds_total in Prometheus is vital to see if a noisy neighbor is starving your I/O.

Checking for "Hung Tasks"

When this issue is at its worst, you’ll see entries in dmesg like this:

INFO: task my-app:1234 blocked for more than 120 seconds.

This is the kernel complaining that a process has been in "Uninterruptible Sleep" (the D state) for too long. In almost every case involving I/O, this is a sign that the writeback subsystem is overwhelmed and the application is stuck waiting for the disk queue to clear.

Summary of Best Practices

If your application is sensitive to latency, don't leave the kernel defaults alone. Modern hardware has outpaced the logic used when those defaults were set 15 years ago.

1. Monitor Dirty Pages: Keep an eye on /proc/meminfo. If Dirty grows to several gigabytes, you have a problem waiting to happen.
2. Bytes over Ratios: Use vm.dirty_background_bytes and vm.dirty_bytes to set hard caps that make sense for your disk speed.
3. Smooth the Flow: Reduce vm.dirty_expire_centisecs to prevent the "30-second pileup."
4. Hardware Matters: If you still see spikes after tuning, ensure your RAID controller cache (if you have one) is in write-back mode and has a working battery/capacitor.

Linux is incredibly powerful, but its "one size fits all" defaults are designed for general-purpose computing. For high-performance apps, you have to tell the kernel to stop trying to be so clever with your memory and just get the data to the disk.