The Case for Partial Indexing: How to Shrink Your Postgres Storage Footprint by 80% Without Losing Query Speed

I spent the first few years of my career treating database indexes like a "binary toggle." Either a column was indexed or it wasn't. If my queries against the orders table were slow when filtering for status = 'pending', I’d just slap an index on the status column and call it a day. It felt like the right thing to do until I looked at the statistics. I realized I was forcing Postgres to maintain a massive B-Tree containing 10 million rows where 9.9 million of them were marked as shipped or cancelled—rows I literally never searched for. I was paying a massive "storage tax" for data I didn't care about, and my INSERT performance was suffering because of it.

Then I found partial indexes. It’s one of those "level up" moments where you realize you can tell the database, "Only index the rows that actually matter."

The Bloat We Accept as Normal

When you create a standard B-Tree index in Postgres, every single row in your table gets an entry in that index. If your table has a hundred million rows, your index has a hundred million entries.

Think about the typical lifecycle of data. In a high-volume system, "hot" data—the stuff your users are actually interacting with—usually represents a tiny fraction of the total dataset.
* In a task manager, users care about incomplete tasks.
* In an e-commerce backend, admins care about processing or failed payments.
* In a messaging app, you care about unread notifications.

Once an order is delivered or a notification is read, it mostly sits there for historical purposes. It’s "cold." Yet, a standard index treats a 5-year-old delivered order with the exact same priority as a payment that failed five seconds ago.

This leads to three major problems:
1. Disk Space: Indexes take up room. On large tables, the indexes can often exceed the size of the raw data.
2. Memory Pressure: Postgres tries to keep indexes in the shared_buffers (RAM). If your index is 40GB but your RAM is 32GB, you’re constantly swapping index pages from disk. That kills performance.
3. Write Amplification: Every time you INSERT a row or UPDATE an indexed column, Postgres has to update the B-Tree. If you have 10 indexes on a table, every write is significantly more expensive.

Enter the Partial Index

A partial index is simply an index with a WHERE clause. It tells Postgres: "Only include a row in this index if it meets this specific condition."

The syntax is remarkably simple:

CREATE INDEX idx_orders_unfulfilled 
ON orders (created_at) 
WHERE status NOT IN ('delivered', 'cancelled');

By adding that WHERE clause, you’ve fundamentally changed the economics of your database. If only 5% of your orders are currently "active," your index is now 95% smaller than it would have been.

A Real-World Comparison

Let's look at a concrete example. Imagine an events table that tracks system logs. 99% of these are info or debug level. You only ever query for critical or error levels to build your alerting dashboard.

-- The "Default" Way
CREATE INDEX idx_events_level_full ON events (level);

-- The "Partial" Way
CREATE INDEX idx_events_level_partial ON events (level) 
WHERE level IN ('critical', 'error');

On a table with 10 million rows:
- The Full Index might take up 280 MB.
- The Partial Index (assuming 1% error rate) might take up 3 MB.

That is a 98.9% reduction in storage footprint for that specific index. And because the partial index is so small, it is much more likely to stay cached in RAM, making the queries that use it lightning-fast.

The "One Default" Pattern: A Unique Use Case

Partial indexes aren't just for performance; they are also a powerful tool for enforcing data integrity.

I recently worked on a system where users could have multiple shipping addresses, but only one could be the "primary" address. If you try to enforce this with a standard unique constraint, you're stuck. You can't make the is_primary column unique because then only one user in the *entire database* could have a primary address.

This is where a partial unique index shines:

CREATE UNIQUE INDEX idx_user_single_primary_address 
ON addresses (user_id) 
WHERE is_primary IS TRUE;

This index only tracks rows where is_primary is true. It ensures that for any given user_id, the value can only appear once in the index. If a user tries to set a second address as primary, Postgres will throw a violation. Meanwhile, a user can have an unlimited number of addresses where is_primary is false, because those rows aren't part of the unique constraint's scope.

How the Query Planner Decides

One thing that tripped me up early on was understanding when Postgres actually decides to use a partial index. The query planner is smart, but it’s also pedantic.

To use a partial index, your query's WHERE clause must mathematically imply the index's predicate.

If you have this index:

CREATE INDEX idx_active_users ON users (last_login) 
WHERE status = 'active';

This query will use the index:

SELECT * FROM users 
WHERE status = 'active' 
ORDER BY last_login DESC;

This query will NOT use the index (usually):

SELECT * FROM users 
WHERE last_login > '2023-01-01';

Even if all users with a last_login > '2023-01-01' happen to be active, Postgres doesn't know that for sure. It won't risk using a partial index that might be missing data. You have to explicitly include the index's condition in your SQL so the planner knows it's safe.

The "Soft Delete" Performance Trap

If you use soft deletes (the infamous deleted_at column), you are likely wasting huge amounts of space. Most developers index deleted_at or include it in composite indexes.

But think about it: your application probably filters out deleted rows in 99% of its queries.

-- Standard approach: huge index including deleted rows
CREATE INDEX idx_posts_user_id ON posts (user_id);

-- Optimized approach: index only the data the app actually uses
CREATE INDEX idx_posts_user_id_active ON posts (user_id) 
WHERE deleted_at IS NULL;

In a system with a lot of churn, the "active" index stays lean. As rows are deleted, they are automatically removed from the partial index. This keeps your "active" lookups fast and prevents the index from bloating with "tombstone" entries that you'll never query.

Gotchas and Limitations

It’s not all sunshine and free disk space. There are a few things that will bite you if you aren't careful.

1. Hardcoded Values

The predicate of a partial index must be made of static values. You cannot use functions that aren't IMMUTABLE.
For example, you can't do this:

-- THIS WILL FAIL
CREATE INDEX idx_recent_orders ON orders (id) 
WHERE created_at > NOW() - INTERVAL '30 days';

Postgres needs the index definition to be constant. If it depended on NOW(), the index would have to constantly re-sort and re-evaluate itself every second.

2. Dependency on the Predicate

If you use a partial index, your application code is now "coupled" to that logic. If you decide to change your "active" status from active to verified, your index becomes useless until you drop and recreate it with the new predicate.

3. Redundancy

If you have a partial index on WHERE status = 'pending' and a full index on the same column, you’re just wasting space. I often see people forget to remove the "general" index after adding a more specific partial one. Run a quick check on pg_stat_user_indexes to see if your full indexes are actually being used after you’ve deployed your partial ones.

The Impact on Write Amplification

This is the hidden win. Every time you update a row in Postgres, the database doesn't actually "overwrite" the data (due to MVCC). It creates a new version of the row. This means every index pointing to that row also needs to be updated.

However, if you have a partial index and your update doesn't touch the columns in the predicate—or if the row still doesn't meet the predicate—Postgres can often skip updating that specific index.

By shrinking your index set to only include "interesting" data, you reduce the I/O load on every single INSERT, UPDATE, and DELETE. On a write-heavy system, this can be the difference between a database that's cruising at 20% CPU and one that's spiking to 90%.

Checking for "Index Worth"

Before you go off and create a hundred partial indexes, you should verify if they're actually needed. I usually start by looking for "low cardinality" columns with highly skewed data distributions.

You can use this query to find the most common values in a column:

SELECT item, count(*) 
FROM orders, unnest(status) AS item 
GROUP BY item 
ORDER BY count(*) DESC;

If you see that one value (like shipped) accounts for 90% of the table, but your application logic almost always filters for the other 10%, you have a prime candidate for a partial index.

Summary of Best Practices

If you want to start shrinking your footprint today, here’s my checklist:

1. Identify the "Junk": Look for columns with is_processed, deleted_at, or status. Find the values that represent "finished" or "historical" states.
2. Check Query Patterns: Look at your slow query logs. Are you constantly adding WHERE status != 'archived'?
3. Measure First: Use SELECT pg_size_pretty(pg_relation_size('your_index_name')); to see how much space your current index takes.
4. Deploy Concurrently: Always use CREATE INDEX CONCURRENTLY in production to avoid locking the table.
5. Verify the Plan: Run EXPLAIN ANALYZE on your queries to ensure the planner is actually picking up the new partial index.

Partial indexing is one of those rare "free lunches" in database engineering. You get faster queries, smaller backups, and lower disk costs, all by writing one extra line of SQL. We shouldn't be indexing everything by default; we should be indexing with intent.