loke.dev
Header image for An Extra Passenger for the Postgres Index

An Extra Passenger for the Postgres Index

How the INCLUDE clause allows you to sidestep the 'Heap Fetch' and achieve index-only scans without the overhead of a multi-column B-tree.

· 5 min read

Why is Postgres still hitting the disk to read the table even when you’ve clearly indexed the column in your WHERE clause?

It’s a common frustration. You’ve done the work, you’ve added the index, and yet, the query plan shows a "Heap Fetch." In Postgres-speak, the "Heap" is just the table itself. When Postgres looks up a value in an index, it usually has to take that information and then go knock on the table’s door to grab the actual data you requested in your SELECT statement.

This extra trip is called a Heap Fetch, and if you’re doing it millions of times, it’s the silent killer of performance.

The Tax of the Heap Fetch

Let’s look at a classic example. Imagine we have a users table with 5 million rows.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email TEXT NOT NULL,
    username TEXT NOT NULL,
    last_login TIMESTAMP,
    bio TEXT
);

CREATE INDEX idx_users_email ON users(email);

If you run a query like this:

SELECT username FROM users WHERE email = 'hello@example.com';

Postgres will use idx_users_email to find the right row. But wait—the index only contains the email and a pointer to the row. It doesn't know what the username is. So, it has to go to the table (the heap) to find that specific row and pull the username string.

In a high-traffic app, these random I/O hits add up fast.

Entering the Index-Only Scan

The holy grail of Postgres performance is the Index-Only Scan. This happens when the index itself contains every single piece of data the query needs. If the index has the answer, Postgres never has to touch the heap.

Before Postgres 11, the only way to do this was to create a multi-column index:

-- The "Old Way"
CREATE INDEX idx_users_email_username ON users(email, username);

This works, but it’s heavy. Because username is part of the index key, Postgres has to keep the B-tree sorted by email *and* username. If you update a user's username, the database has to do more work to re-sort and maintain that B-tree structure.

The INCLUDE Clause: The Extra Passenger

Postgres 11 introduced a smarter way to handle this: the Covering Index using the INCLUDE clause. It allows you to attach "payload" data to the index without making it part of the search key.

Think of it like a sidecar on a motorcycle. The motorcycle (the index key) does the steering, while the sidecar (the included column) just sits there for the ride.

CREATE INDEX idx_users_email_include_username 
ON users(email) 
INCLUDE (username);

In this setup:
1. The Search Key is just `email`. The B-tree is only sorted by email.
2. The Payload is `username`. The data is stored right there in the leaf nodes of the index.

Now, when you run that SELECT username query, Postgres sees the data in the index and performs an Index-Only Scan. No heap fetch required.

Why is INCLUDE better than a multi-column index?

You might wonder why we don't just put everything in the main index key. There are three big reasons:

1. Size and Overhead: Since the INCLUDE columns aren't part of the tree's internal search structure, they don't affect the sorting logic. This makes the index slightly more efficient to maintain.
2. Unique Constraints: This is the killer feature. You can enforce uniqueness on one column but still "include" others for performance.

`sql
-- This ensures email is unique, but still lets us
-- get the last_login via an Index-Only Scan
CREATE UNIQUE INDEX idx_unique_email_covering_login
ON users(email)
INCLUDE (last_login);

3. Suffix Truncation: Postgres is smart enough to truncate long strings in index keys to save space, but it can’t do that as effectively if you have five different columns in the key.

Real-World Performance Comparison

I recently worked on a system where we were polling a tasks table for the next available job.

-- Standard Index
CREATE INDEX idx_tasks_status ON tasks(status);

-- The Query
SELECT id, priority FROM tasks WHERE status = 'pending';

With 10 million rows, the "Index Scan + Heap Fetch" was taking roughly 15ms per call. After switching to a covering index:

CREATE INDEX idx_tasks_status_covering 
ON tasks(status) 
INCLUDE (id, priority);

The query time dropped to under 1ms. We effectively removed the disk I/O bottleneck by making the index "self-sufficient."

The "Visibility Map" Gotcha

There is one catch that trips people up. For an Index-Only Scan to work, Postgres needs to know if the data in the index is actually "visible" to your current transaction (i.e., it hasn't been deleted or updated by someone else recently).

Postgres tracks this in a Visibility Map. If your table is under heavy UPDATE pressure and VACUUM isn't keeping up, the visibility map will be "dirty." When that happens, Postgres will be forced to check the heap anyway to verify the row's version, even if you use INCLUDE.

The lesson: If you’re using covering indexes, make sure your autovacuum settings are dialed in.

When should you use it?

Don't go adding INCLUDE to every index. It increases the size of your index on disk and in memory (the Buffer Cache).

Use it when:
- You have a specific, high-frequency query that is bottlenecked by Heap Fetches.
- You want to include columns that change frequently, but aren't used for filtering/sorting.
- You need a unique constraint on one column but want the benefits of a covering index for others.

It's a surgical tool. When used correctly, it’s one of the easiest ways to squeeze 10x performance out of a "slow" Postgres query without changing a single line of application code.