What Nobody Tells You About the Postgres Free Space Map: Why Your 'Empty' Pages Are Still Forcing Table Growth

I spent the better part of Tuesday afternoon staring at a Prometheus dashboard, watching a table size metric climb steadily toward a disk quota limit. This was particularly annoying because, just an hour earlier, I’d deleted nearly forty percent of the rows in that same table. By all rights, the database should have been comfortably reusing that empty space. Instead, Postgres was acting like a hoarder, ignoring the gaps I’d just cleared and demanding more disk from the OS.

Most developers understand the high-level concept of "bloat" and how VACUUM is supposed to reclaim space for reuse. But there is a massive gap between "space is available" and "Postgres can actually find it." That gap is managed by a component called the Free Space Map (FSM).

The FSM is often treated as a black box—a sidecar file that magically points the way to empty bytes. In reality, it’s a complex, lossy, binary-tree-based data structure that can, under certain conditions, lie to the executor. If you've ever wondered why your "empty" pages are still forcing table growth, the answer usually lies in the way the FSM compresses reality.

The Fork in the Road

Every Postgres table (or "relation") is more than just a single file. It’s a collection of "forks." There’s the main fork where your data lives, the Visibility Map (_vm) which tracks which pages contain only all-visible tuples, and the Free Space Map (_fsm).

You can see these files in your data directory. If your table's OID is 16384, the FSM will be a file named 16384_fsm.

The FSM exists because scanning the entire main table to find a spot for a new INSERT would be a performance disaster. Instead, Postgres consults the FSM to find a page with enough room. But the FSM doesn't store exact byte counts. It stores a single byte for every 8KB page in the main table.

The Granularity Trap: 32-Byte Buckets

Because the FSM only uses one byte (8 bits) to represent the free space in an 8,192-byte page, it has to be lossy. It doesn't store "there are 442 bytes free." It stores a value from 0 to 255.

To get that value, Postgres divides the page size by 256.
8192 / 256 = 32.

Every unit in the FSM represents a 32-byte "bucket." If you have 60 bytes of free space, the FSM records that as 1 (32 bytes). It rounds *down*. This is the first reason Postgres might ignore "empty" space: if the remaining space in a page is less than the next 32-byte increment, it's effectively invisible to the FSM for certain search operations.

Peeking into the Map

To stop guessing, you should use the pg_freespacemap extension. It’s a standard contrib module that lets you inspect the FSM as if it were a table.

CREATE EXTENSION IF NOT EXISTS pg_freespacemap;

-- Check the free space for a specific table
SELECT 
    blkno, 
    avail 
FROM pg_freespacemap('users') 
LIMIT 10;

The avail column here shows the number of bytes Postgres *thinks* are free. If you see a lot of pages with avail at 0, but you know you’ve deleted data, your FSM is out of date. If you see avail at 128 but your inserts are still triggering table growth, your new rows are likely larger than 128 bytes, and the FSM is doing its job by telling the executor to look elsewhere.

The Binary Tree Behind the Curtain

The FSM doesn't just store a flat list of these 1-byte values. That would still be too slow to search for a table with millions of pages. Instead, each FSM page is organized as a Max-Heap Binary Tree.

Inside a single FSM page, the bytes are arranged so that each parent node contains the *maximum* value of its two children.

Imagine a simplified tree where we track 4 pages:
- Page 0: 32 bytes free
- Page 1: 64 bytes free
- Page 2: 0 bytes free
- Page 3: 128 bytes free

The tree structure in the FSM would look something like this:

          [ 128 ]  <-- Root (Max of children)
         /       \
     [ 64 ]     [ 128 ]
     /    \     /    \
  [32]   [64] [0]   [128]

When the Postgres executor needs to insert a row that requires 100 bytes, it looks at the root. 128 >= 100, so it proceeds. It looks at the left child: 64 < 100. It immediately ignores that entire branch of the tree and moves to the right.

This is incredibly efficient. But it's also where things get weird.

The Search "Memory" and the Failures

To prevent everyone from dog-piling onto the very first available page in the table (which would create massive lock contention), Postgres doesn't always start searching from the root. It maintains a "target page" for the relation in memory.

If the search starting at the target page fails, it updates its memory of where it should look next time. However, if multiple backends are trying to insert large rows simultaneously, they might all find the same "sufficient" page in the FSM, try to lock the main data page, find that another backend just filled it, and then trigger a FSM update.

If the FSM tree becomes inconsistent—which can happen because FSM updates aren't logged to WAL for performance reasons (they are considered "hints")—you end up with "phantom bloat." The data is gone, the main page is empty, but the binary tree in the FSM still thinks the page is full (or vice-versa).

Why 'Empty' Pages are Ignored

Let’s look at a concrete scenario where you’ve deleted data but the table keeps growing.

1. The Row Size vs. Bucket Size Discrepancy

If you are inserting rows that are 500 bytes each, and your pages have 480 bytes of free space, those pages are useless. The FSM will correctly report ~480 bytes available, but the executor will skip them.

You can check your average row width to see if this is happening:

SELECT avg_width FROM pg_stats 
WHERE tablename = 'your_table_name' 
AND attname = 'some_column'; -- or just look at the whole record width

2. The Vacuum FSM Limit

VACUUM is responsible for updating the FSM. However, VACUUM doesn't always scan the whole table. If you're relying on Autovacuum, it might skip pages that are "all-visible" (tracked in the Visibility Map) unless a freeze is required.

Furthermore, VACUUM only updates the FSM *after* it has finished scanning the heap. If you have a massive long-running transaction holding back the xmin (the oldest transaction ID), VACUUM cannot remove dead tuples. If it can't remove them, it can't update the FSM with new free space values.

3. FSM Update Latency

The FSM is updated lazily. When a page is found to have more free space during a VACUUM or when a page is found to have *less* space during an INSERT, the FSM is updated.

But consider this: if you perform a mass DELETE and then immediately start a mass INSERT, and VACUUM hasn't had a chance to run yet, the FSM still reflects the "full" state of the table. Postgres has no choice but to append new pages to the end of the file, increasing the table size, even though perfectly good space exists in the middle.

Testing the FSM Behavior

Let's simulate a scenario where we create bloat and see if Postgres finds it.

-- Create a table with a low fillfactor to force space gaps
CREATE TABLE fsm_test (
    id int,
    data text
) WITH (fillfactor = 10);

-- Insert 10,000 rows
INSERT INTO fsm_test SELECT i, repeat('a', 100) FROM generate_series(1, 10000) i;

-- Check size
SELECT pg_size_pretty(pg_relation_size('fsm_test'));

-- Delete half the data
DELETE FROM fsm_test WHERE id % 2 = 0;

-- Even after DELETE, the FSM doesn't know yet
SELECT SUM(avail) FROM pg_freespacemap('fsm_test');

-- Run Vacuum to update FSM
VACUUM fsm_test;

-- Now check FSM
SELECT blkno, avail FROM pg_freespacemap('fsm_test') WHERE avail > 0 LIMIT 5;

If you run this, you'll see avail values jump after the VACUUM. If you then try to insert rows that are larger than the avail values, you'll watch the table size grow again despite the "free" space.

The Problem with "Upper-Level" FSM Pages

For very large tables, the FSM itself becomes multi-layered. One FSM page can only track about 4,000 heap pages (since each FSM page is 8KB and uses 1-2 bytes per heap page plus tree overhead).

When your table exceeds ~4,000 pages (~32MB), Postgres creates a second level of the FSM. This is a tree of FSM pages. Level 0 tracks heap pages. Level 1 tracks Level 0 FSM pages.

I’ve seen cases in multi-terabyte databases where the upper-level FSM pages became corrupted or stale. The bottom-level FSM page knew there was space, but the Level 1 page thought that entire 32MB chunk of the table was full. The executor checked the Level 1 page, saw "0 bytes free" at the root, and never even bothered to look at the 4,000 heap pages underneath it.

How to Fix It

If you’re seeing table growth that doesn't make sense, here’s the triage list:

1. Check for Long-Running Transactions

If VACUUM can't clean up dead tuples, it can't report free space to the FSM.

SELECT pid, now() - xact_start AS duration, query 
FROM pg_stat_activity 
WHERE state != 'idle' ORDER BY duration DESC;

2. Force an FSM Update

A standard VACUUM (VERBOSE, ANALYZE) will usually do the trick. If you suspect the FSM is truly corrupted (rare, but happens after crashes or storage issues), the only way to "rebuild" the FSM file specifically is to VACUUM FULL or use CLUSTER, but those require heavy locks.

Actually, there is a lighter way. Since Postgres 12, you can use REINDEX CONCURRENTLY for indexes, but for the table itself, pg_repack is the go-to tool to rebuild the table and its FSM without exclusive locks.

3. Adjust Fillfactor

If you have a table with frequent updates that increase row size, set a lower fillfactor.

ALTER TABLE your_table SET (fillfactor = 70);
VACUUM FULL your_table; -- To apply it

This leaves 30% of every page empty for future updates. While this technically "bloats" the table upfront, it keeps that space "local" to the page, so Postgres doesn't have to consult the FSM and find a new page for every update, which also prevents the FSM from needing frequent updates.

4. Tuning Autovacuum for FSM Health

If your FSM is consistently stale, your autovacuum isn't running often enough.
Decrease autovacuum_vacuum_scale_factor (e.g., to 0.05 or 0.01) so it triggers after 1% or 5% of rows change rather than the default 20%. This keeps the FSM tree much closer to reality.

The Takeaway

The Free Space Map is a "best effort" structure. It's designed for speed, not 100% accuracy. It rounds your free space into 32-byte buckets, organizes them into a max-heap tree that might have multiple levels, and updates lazily without WAL logging.

When your table grows despite having empty space, don't just blame "bloat." Look at the pg_freespacemap. Check if your row sizes are slightly larger than the available buckets. Check if your autovacuum is being held back by a ghost transaction from a developer's forgotten psql session.

Understanding the binary tree in the FSM turns a "weird Postgres behavior" into a predictable engineering problem. The space is there; you just have to make sure Postgres can see it through the 32-byte lens of the FSM.