loke.dev
Header image for Why Does Your Postgres Query Planner Fail on Correlated Columns?

Why Does Your Postgres Query Planner Fail on Correlated Columns?

Discover how the query planner's 'Independence Assumption' leads to catastrophic estimation errors and how multivariate statistics can restore your database performance.

· 8 min read

The Postgres query planner is a marvel of engineering, but it is also a hopeless optimist. It approaches every query with the naive assumption that the world is a series of unrelated events. It assumes that if you’re looking for a person named "John" who also happens to live in "London," the fact that his name is John has absolutely nothing to do with his location. Usually, this optimism works out fine. But when your data is deeply intertwined—like a car’s make and its model, or a zip code and a city—that optimism turns into a performance nightmare.

I spent most of a Tuesday last month staring at a query that was taking six seconds to return a handful of rows from a table with only a million records. On paper, the indexes were perfect. In practice, the planner was choosing a nested loop that was thrashing the CPU. The culprit wasn't a missing index or a bloated table; it was the "Independence Assumption."

The Math Behind the Curtain

To understand why the planner fails, we have to look at how it calculates "selectivity." Selectivity is just a fancy word for "what percentage of the rows will this filter match?"

When you run a query with multiple conditions in the WHERE clause, Postgres has to guess how many rows will come back to decide whether to use an Index Scan, a Bitmap Scan, or a Sequential Scan.

The formula Postgres uses by default is basic probability:
P(A and B) = P(A) * P(B)

If 10% of your rows have color = 'Red' and 10% have type = 'Sedan', Postgres assumes that color = 'Red' AND type = 'Sedan' will match exactly 1% of the table (0.10 * 0.10).

This works if you’re flipping coins. It fails miserably when the columns are correlated.

When Logic Meets Data: A Practical Example

Let’s build a scenario. Imagine we're running a database for a massive auto parts supplier. We have a vehicles table.

CREATE TABLE vehicles (
    id SERIAL PRIMARY KEY,
    make TEXT,
    model TEXT,
    year INT,
    description TEXT
);

-- Insert 10 makes and 100 models, but ensure they are correlated
-- (e.g., 'F-150' only belongs to 'Ford')
INSERT INTO vehicles (make, model, year, description)
SELECT 
    'Make ' || (i % 10),
    'Model ' || (i % 100),
    2000 + (i % 24),
    md5(random()::text)
FROM generate_series(1, 1000000) s(i);

CREATE INDEX idx_vehicles_make_model ON vehicles (make, model);
ANALYZE vehicles;

In this dataset, Model 1 *only* exists for Make 1. If I filter for make = 'Make 1' AND model = 'Model 1', I should get about 10,000 rows (since 1,000,000 / 100 = 10,000).

Let’s see what Postgres thinks.

EXPLAIN ANALYZE
SELECT * FROM vehicles 
WHERE make = 'Make 1' AND model = 'Model 1';

The result often looks something like this:
Plan: Index Scan ... (cost=0.42..445.43 rows=1000 width=74)
Actual: rows=10000

The planner estimated 1,000 rows, but it found 10,000. It's off by a factor of 10.

Why? Because it calculated the probability of make = 'Make 1' (10%) and multiplied it by the probability of model = 'Model 1' (1%).
0.10 * 0.01 = 0.001 (or 1,000 rows)

In reality, if the model is 'Model 1', the make is *always* 'Make 1'. The columns are functionally dependent. The real selectivity should just be the selectivity of the model (1%).

Why Small Errors Lead to Big Crashes

You might think, "So what? It's off by 9,000 rows. It still used the index."

In a simple SELECT, this error is a rounding mistake. But in a complex query involving three or four joins, these errors propagate exponentially.

If Postgres thinks a filter will return 100 rows when it actually returns 100,000, it might choose a Nested Loop Join. It thinks, "I'll just loop through these 100 rows and do a quick index lookup on the next table." But when it starts looping through 100,000 rows, your database hangs, your CPU spikes, and your Slack starts blowing up.

This is the "Selectivity Gap." It's the primary reason why queries that run fine in your staging environment (with small, random data) explode in production (with large, correlated data).

Enter Multivariate Statistics

Starting in Postgres 10, and significantly improved in later versions, we have a tool to fix this: CREATE STATISTICS. This command allows us to tell the planner, "Hey, these columns are related. Don't treat them as independent."

There are three main types of multivariate statistics we can collect:
1. dependencies: For functional dependencies (if I know A, I can guess B).
2. ndistinct: For cases where the combination of columns has a different number of unique values than the planner expects.
3. mcv (Most Common Values): A list of the most frequent combinations of values.

Let's fix our vehicles table.

CREATE STATISTICS stts_vehicles_make_model 
ON make, model FROM vehicles;

ANALYZE vehicles;

Now, let's run that EXPLAIN ANALYZE again.

EXPLAIN ANALYZE
SELECT * FROM vehicles 
WHERE make = 'Make 1' AND model = 'Model 1';

The New Result:
Plan: Index Scan ... (cost=0.42..445.43 rows=10000 width=74)
Actual: rows=10000

Perfect. The planner now sees the relationship and adjusts its estimate. It no longer multiplies the probabilities; it looks at the mcv or the dependencies list to see how often these two values actually appear together.

When to Use Which Statistic Type?

When you run CREATE STATISTICS, you can specify the types. If you don't, it defaults to dependencies and ndistinct (and mcv in newer versions).

1. Dependencies

This is best when one column is a "subset" of another. Think zip_code and city. If you know the zip code, you know the city.

CREATE STATISTICS s_geo ON (zip_code, city) 
FROM locations;

2. N-Distinct

This helps the planner estimate the number of groups in a GROUP BY. If you have a department column and a sub_department column, the number of unique pairs is much lower than count(department) * count(sub_department).

CREATE STATISTICS s_dept_groups (ndistinct) 
ON (department, sub_department) FROM employees;

3. MCV (Most Common Values)

This is the most powerful one for skewed data. It builds a map of the most common pairs. If 'Make 1' and 'Model 1' appear 10,000 times, but 'Make 2' and 'Model 1' appear 0 times, the MCV list will capture that.

CREATE STATISTICS s_skewed_data (mcv) 
ON (col_a, col_b) FROM big_table;

The "Join" Problem

It is important to manage expectations here: CREATE STATISTICS currently only works for columns within the same table.

If you have a correlation between orders.customer_id and customers.region, Postgres still uses the independence assumption because those columns live in different tables. Cross-table statistics are the "holy grail" of query optimization, but they are incredibly difficult to implement without making ANALYZE take three days to run.

If you hit a performance wall because of cross-table correlation, you usually have to resort to:
- Denormalization (moving the columns into the same table).
- Writing a CTE or a temporary table to "materialize" a result set and trick the planner.
- Using a join hint (if you’re using an extension like pg_hint_plan, though I usually advise against this).

How to Find Correlation Issues in the Wild

You don't want to create statistics on every pair of columns in your database. That's a recipe for catalog bloat and slow maintenance. Instead, you should be a detective.

Look for queries where the estimated rows and actual rows differ by an order of magnitude.

I use this helper query to find candidates for multivariate stats in my slow query logs:

-- This isn't a SQL query for Postgres, but a workflow:
1. Identify a slow query.
2. Run EXPLAIN (ANALYZE, BUFFERS).
3. Look for the "widest" gap in row estimates.
4. Check if those columns are in the same table.
5. Check if they are logically related (State/City, Category/Subcategory, etc.).

Another trick is to query the pg_stats view to see the "null_frac" and "n_distinct" of columns. If you see two columns that both have low cardinality (few unique values) but are often queried together, they are prime candidates for CREATE STATISTICS.

The Cost of Being Right

Nothing in databases is free. When you add multivariate statistics:
1. Analyze takes longer: Postgres has to do more work to sample the data and find correlations.
2. Memory usage: The pg_statistic system table gets larger.
3. Complexity: It’s one more thing to manage. If you change your schema, you need to remember these stats exist.

I generally follow the "Rule of 10": If the planner's estimate is off by more than 10x, and that specific query is part of a critical path, it's time for CREATE STATISTICS.

A Gotcha: The MCV Limit

By default, Postgres only tracks a certain number of "most common values." If your data has a "long tail" of thousands of correlated pairs, the default statistics might not be enough.

You can check the statistics target for a table:

ALTER TABLE vehicles ALTER COLUMN make SET STATISTICS 500;

This increases the granularity of the histograms, which CREATE STATISTICS will then use. Just be careful—bumping this too high can make your ANALYZE runs painfully slow.

Final Thoughts

The Postgres query planner is a black box that we often treat with a mix of awe and frustration. But it’s not magic; it’s a statistical engine. When it fails, it’s usually because the statistics it has don't match the reality of your data.

The Independence Assumption is a necessary shortcut for the planner to stay fast, but it’s a shortcut that leads off a cliff when your data has internal logic. By using CREATE STATISTICS, you’re essentially giving the planner a map of those shortcuts.

Next time you see a query plan choosing a nested loop for a hundred thousand rows, don't just add another index. Look at the correlations. The fix might be a single line of SQL that teaches your database a little bit more about how your data actually relates to itself.