loke.dev
Header image for Why Is Your Postgres WAL Still Inflating During Low-Traffic Periods?

Why Is Your Postgres WAL Still Inflating During Low-Traffic Periods?

An investigation into the 'Full-Page Write' mechanism and how it causes massive, invisible write amplification during routine database checkpoints.

· 11 min read

Most developers assume that if they update a single boolean column in a table, Postgres will write just a few bytes to the Write-Ahead Log (WAL). It’s a logical assumption: if I change one bit of data, the journal should reflect one bit of change. But Postgres doesn't work that way. In reality, that tiny 1-byte update can easily trigger an 8,192-byte write to your disk.

If you’ve ever stared at your CloudWatch metrics or Grafana dashboards during a low-traffic period and wondered why your WAL generation is spiking while your transactions per second (TPS) are flat, you’ve likely bumped into the Full-Page Write (FPW) mechanism. It is one of the most significant sources of write amplification in Postgres, and if you don't understand how it interacts with your checkpoint configuration, you’ll find yourself over-provisioning storage and IOPS for no good reason.

The Mystery of the Bloated WAL

Imagine a scenario. It’s 3:00 AM. Your application is doing almost nothing—just a few background workers updating a last_processed_at timestamp on a handful of rows every minute. Yet, your WAL archiving process is shipping gigabytes of data to S3.

The math doesn't add up. If you're updating 1,000 rows, and each row is roughly 100 bytes, you'd expect maybe 100KB of WAL. Instead, you're seeing 8MB or more.

To understand why, we have to talk about how Postgres handles the fear of "torn pages."

The "Torn Page" Nightmare

Postgres stores data in pages, which are 8KB blocks by default. However, most operating systems and filesystems (like EXT4 or XFS) and hardware (hard drives and SSDs) operate on smaller chunks, often 4KB or 512 bytes.

When Postgres decides to flush a modified page from memory (the Shared Buffers) to disk, it hands that 8KB block to the OS. If the power cuts out or the kernel crashes halfway through that write, you might end up with a page on disk that is half-new and half-old. This is a torn page.

The problem is that a torn page is garbage. It's inconsistent. On restart, the Postgres checksum will fail, and the database will refuse to start, or worse, it will read corrupted data.

To prevent this, Postgres uses the WAL as a safety net. But there’s a catch: standard WAL records only contain the "delta" (the change). If the data on disk is a torn, corrupted mess, you can’t apply a delta to it. You need a clean starting point.

How Full-Page Writes Work

To solve the torn page problem, Postgres follows a specific rule: The first time a data page is modified after a checkpoint, the entire 8KB page is logged to the WAL.

Subsequent changes to that same page, as long as they happen before the *next* checkpoint, only log the delta. But once a checkpoint occurs, the "first-write" flag is reset for every page in the system.

This explains the "inflation" you see.
1. A checkpoint finishes.
2. Your background worker updates one row in Table A.
3. Postgres looks at the page containing that row and says, "This is the first time I've touched this page since the last checkpoint. I'm writing all 8KB to the WAL."
4. Your background worker updates another row in Table B.
5. Another 8KB hits the WAL.

Even if you only changed a few bytes, you just generated 16KB of WAL for two rows.

Proving the Inflation with Code

Let’s see this in action. We can use the pg_current_wal_lsn() function to measure how much WAL is generated by a simple update.

First, let's create a dummy table and fill it with some data.

CREATE TABLE wal_test (
    id serial PRIMARY KEY,
    data text,
    updated_at timestamptz
);

-- Insert 10,000 rows
INSERT INTO wal_test (data, updated_at)
SELECT md5(i::text), now()
FROM generate_series(1, 10000) s(i);

-- Ensure everything is flushed and a checkpoint is triggered
CHECKPOINT;

Now, let's measure how much WAL is generated by updating exactly one row immediately after that checkpoint.

-- Get the starting LSN (Log Sequence Number)
SELECT pg_current_wal_lsn() AS start_lsn;

-- Update one single row
UPDATE wal_test SET updated_at = now() WHERE id = 500;

-- Get the ending LSN and calculate the difference
SELECT 
    pg_current_wal_lsn() AS end_lsn,
    pg_current_wal_lsn() - 'INSERT_START_LSN_HERE'::pg_lsn AS wal_bytes_generated;

In my local environment, that single update generated 8,256 bytes of WAL. That's the 8KB page plus some WAL record overhead.

Now, if I run that same update *again* (before another checkpoint occurs):

SELECT pg_current_wal_lsn() AS start_lsn;

UPDATE wal_test SET updated_at = now() WHERE id = 500;

SELECT 
    pg_current_wal_lsn() AS end_lsn,
    pg_current_wal_lsn() - 'INSERT_START_LSN_HERE'::pg_lsn AS wal_bytes_generated;

The second time, it only generates about 70-100 bytes.

This is the "First-Page Write" effect in a nutshell. If your workload involves touching many different pages infrequently, or if your checkpoints are happening too often, your WAL volume will explode.

Why Does This Happen During "Low-Traffic" Periods?

You might think that low traffic means fewer checkpoints. Not necessarily. Checkpoints are triggered by two main factors:
1. max_wal_size: A threshold of WAL volume.
2. checkpoint_timeout: A time-based threshold.

In low-traffic periods, you aren't hitting max_wal_size, so the database waits for checkpoint_timeout (defaulting to 5 minutes) to trigger a checkpoint.

If you have a cron job that updates 1,000 random rows across a large table every 6 minutes, here is what happens:
* Minute 0: Checkpoint runs.
* Minute 1: Cron job runs. It hits 1,000 distinct pages. Result: ~8MB of WAL generated.
* Minute 5: checkpoint_timeout hits. A new checkpoint occurs.
* Minute 7: Cron job runs again. Since a checkpoint happened since the last run, every page is "dirty" for the first time again. Result: Another ~8MB of WAL generated.

If that cron job ran every 30 seconds instead, only the *first* run after the checkpoint would cause the 8MB spike. The subsequent 9 runs would only generate a few hundred kilobytes total.

Counter-intuitively, spreading your tasks out can sometimes increase your total WAL volume because you're more likely to hit the "first-write after checkpoint" penalty every single time.

Identifying the Culprit: pg_stat_bgwriter

If you suspect FPWs are killing your disk I/O or filling up your disks, you need to check the statistics. Postgres tracks this in the pg_stat_bgwriter view (or pg_stat_checkpointer in Postgres 17+).

I usually look at the ratio of checkpoints requested (due to max_wal_size) vs checkpoints scheduled (due to checkpoint_timeout).

SELECT 
    checkpoints_timed, 
    checkpoints_req, 
    checkpoint_write_time, 
    checkpoint_sync_time,
    buffers_checkpoint,
    buffers_clean
FROM pg_stat_bgwriter;

- checkpoints_timed: High numbers here suggest your checkpoint_timeout is the primary trigger.
- checkpoints_req: High numbers here suggest your max_wal_size is too small, forcing checkpoints to happen too frequently under load.

However, to specifically see the impact of Full-Page Writes, we can look at the pg_stat_wal view (Postgres 13+):

SELECT 
    wal_records, 
    wal_fpi, 
    wal_bytes 
FROM pg_stat_wal;

The wal_fpi column stands for Full Page Images. If wal_fpi is high relative to wal_records, you are suffering from major write amplification.

How to Tame the WAL Inflation

There are three main levers we can pull to fix this.

1. Increase checkpoint_timeout

This is the most effective way to reduce FPW volume. The default is 5 minutes, which is quite aggressive for many modern workloads. If you increase this to 15 or 30 minutes, you increase the window where updates to a page only log the delta instead of the full 8KB.

-- In postgresql.conf
checkpoint_timeout = 15min
max_wal_size = 2GB -- You usually want to increase this too

The Trade-off: Long checkpoint_timeout values mean that in the event of a crash, Postgres has more WAL to replay. This makes your "Mean Time To Recovery" (MTTR) longer. If your database crashes, it might take 5-10 minutes to start back up instead of 1 minute.

2. Enable WAL Compression

Postgres can compress those 8KB full-page images before writing them to the WAL. This doesn't reduce the *number* of writes, but it drastically reduces the *size* of the WAL generated.

Starting with Postgres 15, you have several options for wal_compression.

-- In postgresql.conf
wal_compression = lz4 -- or 'pglz', or 'zstd'

I personally prefer lz4. It has an incredible performance-to-compression ratio. In my experience, enabling lz4 compression can reduce WAL volume by 40-60% with negligible CPU overhead.

If you are on an older version of Postgres, you likely only have on (which uses PGLZ) or off. Even PGLZ is better than nothing.

3. Normalize Your Updates

If you have a "hot" row that gets updated constantly (like a counter), try to keep it in its own small table. If that row is in a table with 50 other columns, every time you update that counter, you're potentially logging a massive 8KB page. If the table is "skinny," you fit more rows per page, which increases the chance that a page is already "dirtied" and you'll only pay the FPW penalty once for many different row updates.

The "Dangerous" Solution: Turning off full_page_writes

You will see a setting in postgresql.conf called full_page_writes = on.

Do not turn this off unless you are 100% sure your hardware/filesystem prevents torn pages.

If you are running on standard SSDs with a standard filesystem (EXT4/XFS), turning this off is a recipe for data corruption. The only scenarios where it's generally safe to disable FPW are:
1. You are using a filesystem like ZFS that uses copy-on-write and has its own mechanisms to prevent torn pages.
2. You are using high-end storage controllers with battery-backed write caches that guarantee atomic 8KB writes.
3. You are running a transient database (like a CI runner) where you don't care if the data survives a crash.

For everyone else, keep it on and use the other tuning methods instead.

Investigating WAL Content with pg_waldump

If you're still not convinced, you can actually look inside your WAL files. This is a bit "low level," but it’s the best way to see exactly what's happening.

You'll need to find where your WAL files live (usually in the pg_wal subdirectory of your data folder) and use the pg_waldump utility.

# View the last few records of a WAL file
pg_waldump /var/lib/postgresql/data/pg_wal/000000010000000000000001 | tail -n 20

Look for records that say rel 1663/13442/16384 fork main blk 0. If you see fpw or fpi mentioned in the output, that's a full-page image.

You can also use the --stats flag to get a summary:

pg_waldump --stats /var/lib/postgresql/data/pg_wal/000000010000000000000001

This will give you a breakdown of how much of that WAL file is composed of Full-Page Images. In many production databases, it’s not uncommon to see FPIs taking up 70% to 90% of the total WAL volume.

The Hidden Cost of wal_log_hints

There's one more sneaky setting that can inflate your WAL: wal_log_hints.

If you use checksums (which you should) or tools like pg_rewind, Postgres might need to set wal_log_hints = on. This tells Postgres to log the entire page to the WAL even for "minor" changes to the page header (like setting a hint bit) that wouldn't normally trigger a WAL record.

If you have a very read-heavy workload and you suddenly see a lot of WAL generation, check if wal_log_hints is enabled. Hint bits are set during reads (to mark a row as visible to others). If wal_log_hints is on, even a SELECT statement can trigger a Full-Page Write if it's the first time that page's hint bits are being modified since a checkpoint.

Summary: A Checklist for WAL Inflation

When your WAL is blowing up and the traffic doesn't justify it:

1. Check Checkpoint Frequency: Use pg_stat_bgwriter. If checkpoints_timed is skyrocketing, your checkpoint_timeout is too low.
2. Measure FPI Ratio: Use pg_stat_wal or pg_waldump to see if Full-Page Images are the bulk of your WAL.
3. Optimize Checkpoints:
* Increase checkpoint_timeout (e.g., to 15m or 30m).
* Increase max_wal_size (e.g., to 4GB or more) to avoid frequent "requested" checkpoints.
* Set checkpoint_completion_target to 0.9 to spread the I/O load.
4. Enable Compression: Set wal_compression = lz4 to shrink the FPIs you *do* have to write.
5. Review Write Patterns: Are you updating rows in a "scatter-shot" way that touches thousands of pages just once per checkpoint cycle? If so, try to batch those updates or perform them more frequently so they "share" a checkpoint window.

Postgres is designed for durability first. The Full-Page Write mechanism is a brilliant, albeit heavy, solution to a hardware limitation. You can't avoid it entirely, but by understanding the rhythm of your checkpoints, you can stop the invisible inflation from eating your disk budget.