
What Nobody Tells You About Postgres Statistics: Why Your Query Planner Suddenly Decided to Ignore Your Index
When your perfectly indexed query starts performing a sequential scan, the culprit isn't your code—it's a stale histogram in the system catalogs.
I remember staring at a 10-second query on a table with 50 million rows, feeling that specific brand of developer rage. I had the index. I’d verified it existed with \d+. I’d named it properly. But when I ran EXPLAIN ANALYZE, there it was in plain text: Seq Scan on heavy_table.
The database was ignoring me. It was like I’d built a high-speed rail line and the conductor decided to drive the train through the mud instead.
If you’ve spent any time with Postgres, you’ve likely hit this wall. You assume that if an index exists, the query planner will use it if the query filters by that column. But the Postgres Query Planner isn't a rule-based engine; it's a cost-based engine. It doesn't care about your feelings or your "perfectly optimized" index. It only cares about the math.
The culprit is almost always the same: the statistics in the system catalogs are lying to the planner, or the planner doesn't have the specific type of math it needs to understand your data distribution.
The Ghost in the Machine: pg_stats
Postgres doesn't look at your actual table data when you run a query. That would be too slow. Instead, it looks at a "sketch" of your data stored in a system view called pg_stats.
When you run a query, the planner asks: "How many rows will this return?" This is called cardinality estimation. If the planner thinks your filter will return 90% of the table, it will skip the index and do a sequential scan because reading the whole file is faster than jumping back and forth between an index and the heap.
The problem arises when the planner *thinks* you're getting 90% of the rows, but you're actually getting 0.01%.
Let's look at what the planner sees. Run this on one of your tables:
SELECT
attname,
inherited,
null_frac,
n_distinct,
most_common_vals,
most_common_freqs,
histogram_bounds
FROM pg_stats
WHERE tablename = 'your_table_name';In that output, you’ll see histogram_bounds. This is a list of values that divide the column's data into roughly equal-sized groups. If your data has changed significantly since the last time ANALYZE ran, these bounds are garbage.
The "Moving Frontier" Problem
The most common reason for a sudden index abandonment is what I call the Moving Frontier.
Imagine an orders table where you have an index on created_at. You’re querying for orders from the last hour.
SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '1 hour';If you just bulk-loaded 100,000 rows for today, but the last ANALYZE ran yesterday, the pg_stats view thinks the "maximum" value in that column is yesterday’s timestamp. To the planner, your query for "today's data" looks like it's asking for a range that exists entirely *outside* the known distribution of the table.
In older versions of Postgres, this often led to a default estimate of 0 or 1 row, which might actually lead to a Nested Loop join that kills performance. In other cases, the planner gets confused and assumes a high cost, opting for a Seq Scan.
The Fix: You need to tell Postgres to update its map.
ANALYZE orders;But don't just run it manually every time. If you have a high-ingest table, you might need to tune your autovacuum settings to be more aggressive so it analyzes more frequently.
ALTER TABLE orders SET (
autovacuum_analyze_scale_factor = 0.01, -- Analyze when 1% of rows change, not 10%
autovacuum_analyze_threshold = 1000 -- Or every 1000 rows
);The Correlation Trap
This is the one that catches even senior DBAs. There is a column in pg_stats called correlation.
Correlation measures the statistical relationship between the physical order of data on the disk and the logical order of the values in the column.
* 1.0: The data is perfectly ordered on disk (e.g., an auto-incrementing ID).
* 0.0: The data is completely random.
If the correlation is near 0, using an index is "expensive" because the disk head (or the SSD controller) has to jump to a hundred different blocks to fetch a hundred rows. If the correlation is 1.0, those hundred rows are all sitting right next to each other in one or two blocks.
If your data starts out ordered but becomes fragmented over time due to UPDATEs or DELETEs, the correlation drops. Eventually, the planner decides the "random I/O penalty" of the index is higher than the "bulk read benefit" of a Seq Scan.
You can check this with:
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'users'
ORDER BY correlation DESC;If you see a low correlation on a column you frequently range-scan, you might need to CLUSTER the table (which is a blocking operation, be careful) or use a tool like pg_repack to reorder the data physically.
When 100 Buckets Aren't Enough
By default, Postgres splits your data into 100 buckets (the statistics target). For a table with 100 million rows, each bucket represents a million rows.
If your data distribution is "peakier" than that—meaning you have huge clusters of specific values mixed with thin spreads of others—the 100-bucket histogram is too blunt an instrument. The planner will round off the edges and make a bad guess.
You can increase the resolution for a specific column without bloating the stats for the whole database:
-- Increase stats target for the 'status' column to 1000 buckets
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
-- You MUST run analyze for the change to take effect
ANALYZE orders;I’ve seen this single change turn a 30-second query into a 50ms query because it finally gave the planner enough detail to realize that status = 'PENDING' was rare, while status = 'SHIPPED' was common.
The Cross-Column Correlation Headache
Postgres, by default, assumes that columns are independent. It’s a naive Bayesian assumption.
Suppose you have a cars table with make and model.
* The planner knows make = 'Honda' matches 10% of rows.
* The planner knows model = 'Civic' matches 1% of rows.
If you query WHERE make = 'Honda' AND model = 'Civic', the planner multiplies the probabilities: 0.10 * 0.01 = 0.001.
This works... until it doesn't. What if you query WHERE make = 'Honda' AND model = 'F-150'? The math says 0.1% of rows, but the reality is 0. A Honda F-150 doesn't exist.
More importantly, what if you have state and city? If you filter by state = 'California' AND city = 'San Francisco', the planner thinks this is a very rare combination because it multiplies the probability of CA by the probability of SF. But in reality, almost every "San Francisco" record is also a "California" record. The planner underestimates the number of rows, thinks the result set is tiny, and picks a slow plan (like a Nested Loop) that falls over when it hits 50,000 rows instead of the 5 it expected.
Since Postgres 10, we have Extended Statistics to fix this.
CREATE STATISTICS city_state_dep (dependencies)
ON city, state
FROM addresses;
ANALYZE addresses;This tells Postgres: "Hey, these two columns are linked. Don't just multiply their probabilities; actually look at how they correlate."
The "Cost" Parameters: Lies We Tell the Planner
Sometimes the statistics are perfect, but the planner's "worldview" is wrong. This usually happens because of the random_page_cost setting.
Historically, random_page_cost was set to 4.0, while seq_page_cost was 1.0. This was back in the days of spinning platters, where moving the disk head was much slower than reading sequentially.
If you are running on modern SSDs or NVMe, the penalty for random access is almost zero. If you leave random_page_cost at 4.0, you are essentially lying to Postgres, telling it that your index is four times slower than it actually is. This causes the planner to flip to a Seq Scan much earlier than it should.
Check your settings:
SHOW random_page_cost;
SHOW seq_page_cost;On modern cloud infrastructure (AWS Aurora, RDS with GP3, etc.), I almost always set random_page_cost to 1.1 or even 1.0.
-- Set it for the whole session to test
SET random_page_cost = 1.1;
EXPLAIN ANALYZE SELECT ... -- See if the index is now usedHow to Debug the Planner’s Brain
When you’re stuck, stop looking at the query and start looking at the estimates.
Run EXPLAIN (without the ANALYZE if the query is too slow to finish).
Index Scan using idx_orders_customer on orders (cost=0.42..8.44 rows=10 width=45)Look at rows=10. Now run SELECT COUNT(*) with the same filters. If the count is 10,000 and the estimate was 10, your statistics are stale or your statistics target is too low.
Wait, what about `LIMIT`?
Here is a sneaky one: LIMIT can actually break your query. If you have a query that should use an index but has a LIMIT 1, the planner might think: "I could use the index to find these rows in order, OR I could just start a Seq Scan and I'll probably stumble upon one row that matches within the first few blocks."
If the row it's looking for is actually at the *end* of the table, the Seq Scan will take forever. This is the "Row Estimation Whack-a-Mole."
Summary Checklist for When Your Index is Ignored
1. Check `pg_stat_user_tables`: When was the last time last_analyze ran? If it was hours ago and the table is busy, run ANALYZE.
2. Compare Estimate vs. Reality: Does EXPLAIN say rows=1 when the actual count is 1000? If so, your stats are the problem.
3. Check Correlation: Is the column's physical order a mess? Consider re-indexing or clustering.
4. Check Statistics Target: For skewed data, bump the column statistics target to 500 or 1000.
5. Use Extended Statistics: If you filter on multiple related columns, use CREATE STATISTICS.
6. Adjust the Cost Model: If you're on SSDs, make sure random_page_cost isn't stuck at the default 4.0.
Postgres is incredibly smart, but it's a mathematician, not a mind reader. It relies on the data in pg_catalog to make its choices. When the planner fails, don't just add more indices—fix the map it's using to navigate the ones you already have.


