
Your B-Tree Index Is a Space Leak: Scaling Massive Postgres Tables with BRIN
An in-depth analysis of Block Range Indexing (BRIN) and why trading B-Tree precision for range-based summaries is the only way to keep index sizes from eclipsing your data.
When your index size starts rivaling your actual data volume, you aren't just paying a storage tax—you’re slowly strangling your database’s ability to cache meaningful information. In Postgres, the B-Tree is the default, reliable workhorse, but on tables pushing into the hundreds of millions or billions of rows, that workhorse can turn into a resource-hogging monster.
We’ve been conditioned to think that an index is just a necessary cost of performance. If a query is slow, add an index. If it’s still slow, add a composite index. But on massive datasets—especially those where data is naturally ordered, like time-series logs or event streams—the traditional B-Tree is often a massive "space leak" that provides diminishing returns for an exorbitant price in RAM and disk I/O.
The Physics of the B-Tree Bloat
A B-Tree index is an exact map. For every single row in your table, there is a corresponding entry in the B-Tree leaf nodes. If you have a billion rows, you have a billion entries in the index. Each entry contains the indexed value and a pointer (TID - Tuple ID) to the physical location of the row on disk.
This precision is great for high-concurrency OLTP workloads where you need to find one specific user ID out of millions. But this precision comes with a steep price:
1. Storage: The index can easily grow to 30-50% of the table size.
2. Maintenance: Every INSERT, UPDATE, and DELETE requires a tree traversal and potential page splits.
3. Cache Pressure: Because B-Trees are large, they push the actual data (the "hot" rows) out of the Postgres Buffer Cache and the OS Page Cache.
If you’re querying ranges of data—say, all logs from last Tuesday—you don't necessarily need a map that tells you exactly where every single millisecond is. You just need to know which chunks of the disk contain Tuesday's data.
Enter the Block Range Index (BRIN)
Postgres 9.5 introduced BRIN (Block Range Index), and it remains one of the most underutilized features for large-scale data. Instead of indexing every row, BRIN indexes ranges of physical blocks.
By default, BRIN looks at a range of 128 blocks (pages) on disk. For each range, it stores only the minimum and maximum values of the indexed column.
When you run a query with a WHERE clause, Postgres looks at the BRIN index and asks: "Does the value I’m looking for fall between this min and max?" If yes, it scans those 128 blocks. If no, it skips them entirely.
The difference in size is staggering. We aren't talking about a 20% reduction; we are talking about a 99% reduction in index size.
Let's Look at the Math
Imagine a table with 100 million rows of sensor data.
CREATE TABLE sensor_readings (
id BIGSERIAL,
created_at TIMESTAMP NOT NULL,
reading FLOAT8,
metadata JSONB
);
-- Insert 100 million rows (roughly 10GB of data)
INSERT INTO sensor_readings (created_at, reading)
SELECT
now() - (i || ' seconds')::interval,
random()
FROM generate_series(1, 100000000) s(i);If we create a standard B-Tree index on created_at:
CREATE INDEX idx_sensor_btree ON sensor_readings(created_at);That index will likely take up about 2.2 GB.
Now, let's try a BRIN index:
CREATE INDEX idx_sensor_brin ON sensor_readings USING brin(created_at);The BRIN index will take up about 64 KB.
Yes, Kilobytes. You just traded 2 gigabytes of disk and memory pressure for the size of a small text file.
The Correlation Dealbreaker
Before you drop every B-Tree in your production environment, there is a massive "if." BRIN only works if your data is physically sorted on disk relative to the indexed column.
Postgres tracks this via a statistic called correlation. Correlation ranges from -1 to 1. If it's near 1, the data is stored on disk in the same order as the values (e.g., an auto-incrementing ID or a created_at timestamp in an append-only table). If it's near 0, the data is scattered randomly.
If your data is random, a BRIN index is useless because the min/max for every block range will likely be the same (the full range of the data), forcing Postgres to scan the entire table anyway.
You can check your correlation like this:
SELECT
attname,
correlation
FROM pg_stats
WHERE tablename = 'sensor_readings';If your correlation for created_at is 0.99, BRIN is your best friend. If it’s 0.05, BRIN will effectively do nothing.
Tuning the Granularity
The default BRIN range is 128 blocks (usually 1MB of data). You can tune this using the pages_per_range parameter.
If you make it smaller, the index gets larger but more precise (fewer false positives during a scan). If you make it larger, the index gets even tinier, but you'll end up reading more data from disk for each query.
CREATE INDEX idx_sensor_brin_fine
ON sensor_readings
USING brin(created_at)
WITH (pages_per_range = 32);In a world where NVMe drives are standard, the penalty for scanning 128 blocks instead of 32 is often negligible compared to the massive benefit of keeping your entire index set in the CPU's L3 cache or the fastest tier of RAM.
The "Silent" Maintenance Gotcha
B-Trees are updated synchronously. When you insert a row, the B-Tree is updated immediately. BRIN is a bit more relaxed.
As you append data to the end of a table, BRIN summarizes the ranges. However, if you are doing a lot of out-of-order inserts or updates to older data, the min/max values in the BRIN index might grow to encompass too much, making the index "lossy" and inefficient.
Furthermore, new rows at the very end of the table might not be summarized immediately until a manual brin_summarize_new_values is called or the autovacuum daemon catches up.
To ensure your BRIN index stays tight on a high-velocity table, you can enable autosummarize:
ALTER INDEX idx_sensor_brin SET (autosummarize = ON);When to Stick with B-Tree
I am making a strong case for BRIN, but it is not a silver bullet. You should stay with a B-Tree if:
1. You need Uniqueness: BRIN cannot enforce UNIQUE constraints.
2. Point Lookups: If your primary query pattern is SELECT * FROM table WHERE id = 12345, a B-Tree will always be faster. It takes you directly to the row. BRIN takes you to the neighborhood and makes you knock on 128 doors.
3. Low Correlation: If you’re indexing a UUID column that is generated randomly, BRIN is mathematically incapable of helping you.
4. Small Tables: If your table is 100MB, just use a B-Tree. The complexity isn't worth the 20MB of savings.
Real-World Strategy: The Hybrid Approach
The most effective way to use BRIN is in a hybrid strategy, especially with partitioned tables.
Suppose you have a table partitioned by month.
* Current Month: Use B-Trees for high-speed inserts and frequent point-lookups while the data is "active."
* Historical Months: Once a month is finished, swap the B-Trees for BRIN indexes.
You keep the precision where it's needed and reclaim gigabytes of RAM from historical data that is usually only queried in large analytical ranges anyway.
-- The 'Old Data' migration
BEGIN;
DROP INDEX idx_logs_june_2023_btree;
CREATE INDEX idx_logs_june_2023_brin
ON logs_y2023m06 USING brin(event_time)
WITH (pages_per_range = 64);
COMMIT;Final Thoughts
We often treat database performance as a CPU or Disk I/O problem, but frequently, it’s a memory geometry problem. When your indexes are too large to fit in RAM, your database performance falls off a cliff as it starts swapping index pages from disk just to find where the data lives.
BRIN allows you to cheat the physics of indexing. By accepting a small amount of extra work during the scan, you free up massive amounts of memory, allowing the rest of your database to run faster. If your data has a natural chronological or sequential order, stop leaking space into your B-Trees and let BRIN summarize the heavy lifting.


