loke.dev
Header image for What Nobody Tells You About the Postgres Checkpointer: Why Your P99 Latency Spikes Every 5 Minutes

What Nobody Tells You About the Postgres Checkpointer: Why Your P99 Latency Spikes Every 5 Minutes

Your database performance isn't just about query optimization; it’s about the silent I/O storm that triggers every time your WAL reaches its threshold.

· 8 min read

I remember staring at a CloudWatch dashboard for a high-traffic Rails API three years ago. Every five minutes, like clockwork, our P99 latency would jump from a comfortable 40ms to a painful 600ms, creating a "sawtooth" pattern that looked like a jagged mountain range. We optimized the queries, added indexes, and threw more RAM at the instance, but the spikes remained, mocking our efforts every 300 seconds.

The culprit wasn't a slow query or a lack of memory. It was the Postgres checkpointer—a background process that is often ignored until it starts choking your I/O.

If you’ve seen those rhythmic performance dips and can't find a "heavy" query to blame, you’re likely witnessing a massive I/O storm triggered by your database trying to keep its promises of durability. Here is what is actually happening under the hood and how you can tame it.

The "Write-Ahead" Illusion

Postgres, like most modern databases, doesn't actually write your data to the final data files (heap) the moment you hit COMMIT. That would be incredibly slow because data files are scattered across the disk, and random I/O is the enemy of performance.

Instead, Postgres does two things:
1. It records the change in a Write-Ahead Log (WAL). This is a sequential write, which is very fast.
2. It updates the data in Shared Buffers (memory).

At this point, the page in memory is "dirty"—it’s different from what’s on the disk. If the power goes out, Postgres looks at the WAL, replays the changes, and everything is fine. But you can't let the WAL grow forever. Eventually, you have to reconcile the memory with the disk. This event is called a Checkpoint.

Why Checkpoints Kill Your Performance

During a checkpoint, three major things happen:
1. All dirty buffers in memory are flushed to the data files.
2. A special checkpoint record is written to the WAL.
3. Old WAL files that are no longer needed for recovery are recycled or deleted.

The problem is the "flushing" part. If you have 32GB of Shared Buffers and 20% of them are dirty, the checkpointer suddenly decides to shove 6.4GB of data onto your disk as fast as possible. This creates a massive I/O bottleneck. Your application's INSERT and UPDATE queries, which also need to write to the WAL and sometimes read from disk, get stuck in the queue behind this tidal wave of data.

The Two Triggers: Time vs. Size

Postgres triggers a checkpoint based on whichever comes first: checkpoint_timeout or max_wal_size.

1. checkpoint_timeout

This is the "Time" trigger. By default, it’s 5 minutes. This explains why so many people see spikes every 5 minutes. If your database isn't doing much, this is usually fine. But for a busy system, 5 minutes is a very short window.

2. max_wal_size

This is the "Size" trigger. If your database generates more than this amount of WAL since the last checkpoint, a checkpoint is triggered immediately—even if the timeout hasn't been reached.

You can see which one is haunting you by querying pg_stat_bgwriter:

SELECT 
    checkpoints_timed, 
    checkpoints_req, 
    checkpoint_write_time, 
    checkpoint_sync_time 
FROM pg_stat_bgwriter;

* checkpoints_timed: Checkpoints triggered by checkpoint_timeout.
* checkpoints_req: Checkpoints triggered by max_wal_size being exceeded.

The Golden Rule: You want checkpoints_timed to be much higher than checkpoints_req. If checkpoints_req is high, your max_wal_size is too small, and Postgres is being forced to flush data prematurely because it’s running out of log space.

Smoothing the Spike: checkpoint_completion_target

This is the most misunderstood setting in the Postgres ecosystem.

Imagine you have a checkpoint_timeout of 5 minutes. If checkpoint_completion_target is set to 0.1, Postgres will try to finish writing all dirty buffers in the first 10% of that time (30 seconds). That is an aggressive, violent burst of I/O.

If you set it to 0.9 (the modern recommended default), Postgres will spread the writes over 90% of the interval (4.5 minutes). Instead of a flood, you get a steady stream.

-- Check your current settings
SHOW checkpoint_timeout;
SHOW checkpoint_completion_target;
SHOW max_wal_size;

If you are on an older version of Postgres (pre-14), the default was often 0.5. Change it to 0.9 immediately.

The "Full Page Writes" Problem

Here is the "hidden" reason why performance drops after a checkpoint starts.

To prevent data corruption during a partial page write (e.g., a crash happens while the OS is halfway through writing a 8KB Postgres block), Postgres uses Full Page Writes. The first time a data page is modified *after* a checkpoint, the entire 8KB page is written to the WAL, not just the row change.

This means that right after a checkpoint finishes, your WAL volume actually increases. If you checkpoint too frequently, you spend a huge portion of your time writing full pages to the WAL, which increases disk I/O and network traffic (if you have replicas).

How to Tune for Stability

If you're seeing those P99 spikes, here is the playbook I use to stabilize the system.

Step 1: Increase the WAL capacity

Don't be stingy with max_wal_size. On a modern server with plenty of disk space, 1GB is tiny. I often start with 16GB or 32GB for high-throughput systems.

-- Run this in your terminal or psql
ALTER SYSTEM SET max_wal_size = '16GB';
ALTER SYSTEM SET min_wal_size = '4GB'; -- Keeps a buffer of WAL files ready
SELECT pg_reload_conf();

Step 2: Increase the timeout

If your storage can handle the recovery time (the time it takes Postgres to start up after a crash), increase checkpoint_timeout from 5 minutes to 15 or 30 minutes. This reduces the frequency of the "Full Page Write" penalty.

ALTER SYSTEM SET checkpoint_timeout = '15min';

Step 3: Spread the load

Ensure checkpoint_completion_target is 0.9.

ALTER SYSTEM SET checkpoint_completion_target = 0.9;

Monitoring the "Write Storm" in Real Time

You don't want to wait for users to complain to see if your tuning worked. You can monitor how many buffers are being written by the checkpointer vs. the background writer.

The Background Writer (bgwriter) is the checkpointer’s younger, quieter sibling. It tries to flush dirty buffers to disk in small batches *before* the checkpoint happens, making the checkpoint’s job easier.

Run this query to see if your background writer is doing its job:

SELECT 
    round(100.0 * checkpoints_req / (checkpoints_timed + checkpoints_req), 2) AS forced_checkpoint_percent,
    buffers_checkpoint,
    buffers_clean,
    maxwritten_clean
FROM pg_stat_bgwriter;

* If maxwritten_clean is high, your background writer is hitting its bgwriter_lru_maxpages limit too often. You might need to make it more aggressive.
* If buffers_checkpoint is vastly larger than buffers_clean, your checkpointer is doing all the heavy lifting, and your background writer is basically sleeping on the job.

The Kernel Factor: Don't Let Linux Lie to You

Postgres issues writes, but the Linux kernel has its own caching layer (the Page Cache). Even if Postgres thinks it's spreading writes out over 4 minutes, the kernel might buffer all of those writes in RAM and then decide to flush them all at once to the physical disk.

This is the "Double Spiking" effect. To fix this, you often need to tune Linux kernel parameters like vm.dirty_background_ratio and vm.dirty_ratio.

For a dedicated database server, you want the kernel to start writing to disk sooner so it doesn't build up a massive queue.

# In your /etc/sysctl.conf
# Start background writes when 5% of memory is dirty
vm.dirty_background_ratio = 5
# Force all writes when 10% of memory is dirty
vm.dirty_ratio = 10

The default Linux settings are often optimized for desktops (where you want to buffer a lot to keep the UI snappy), not for databases where consistent I/O is king.

Testing Your Changes with pgbench

Don't just change these settings in production and hope for the best. You can simulate a write-heavy load using pgbench and watch the pg_stat_bgwriter stats.

# Create a test database
createdb test_bench
# Initialize with a scale factor (100 = ~1.5GB of data)
pgbench -i -s 100 test_bench

# Run a heavy write test for 10 minutes
pgbench -c 16 -j 4 -T 600 -N test_bench

While this is running, watch your I/O latency with iostat -x 1 or a tool like netdata. If you see await (disk latency) spike every few minutes, your checkpoint settings still aren't wide enough.

Summary of the "Silent Killer" Checklist

When the P99s start spiking, stop looking at your SQL and start looking at your logs and stats:

1. Check the logs: Does it say "checkpoints are occurring too frequently"? Increase max_wal_size.
2. Check the ratio: Is checkpoints_req high? You are outgrowing your WAL space.
3. Check the target: Is checkpoint_completion_target below 0.9? You are flushing too fast.
4. Check the timeout: Is it still 5 minutes? You are likely suffering from frequent Full Page Writes. Increase it to 15m.
5. Check the OS: Is the Linux kernel holding onto "dirty" pages too long? Tune vm.dirty_background_ratio.

Postgres is a remarkably stable database, but its default configuration is notoriously conservative, designed to run on a toaster. For a modern production workload, the "silent I/O storm" of the checkpointer is a rite of passage for any DBA. Tune it to be slow, steady, and predictable, and your P99s will thank you.