A 48-Hour Deep Dive into the Postgres Transaction Log: How I Finally Visualized Multi-Version Concurrency Control

I was staring at a 50GB table that only held 2 million rows of small JSON objects. The math didn't add up until I realized I wasn't just storing my data; I was storing the history of every mistake I’d made in the last 48 hours. It turns out that in Postgres, the "current" state of your database is just the surface of a much deeper, messier ocean of versioned data.

The Lie of the "Update"

We’re taught from our first SQL lesson that UPDATE users SET name = 'Alice' WHERE id = 1; finds a row and changes a value. In Postgres, that is a lie. Postgres never actually updates a row in place. If it did, it would have to lock that row, preventing anyone else from reading it while the change was happening. To maintain high concurrency, Postgres uses Multi-Version Concurrency Control (MVCC).

When you "update" a row, Postgres performs a DELETE followed by an INSERT. The old version of the row stays on the disk, marked as "dead," while a new version is created.

This isn't just an implementation detail; it’s the core of how the database functions. To see this in action, you have to look at the hidden columns that Postgres tracks for every single row.

-- Create a dummy table to experiment
CREATE TABLE deep_dive (
    id SERIAL PRIMARY KEY,
    val TEXT
);

INSERT INTO deep_dive (val) VALUES ('Initial State');

-- Look at the hidden columns: xmin and xmax
SELECT xmin, xmax, ctid, val FROM deep_dive;

When you run that query, you’ll see xmin and xmax.
- xmin: The ID of the transaction that inserted this row version.
- xmax: The ID of the transaction that deleted or updated this row version. If it's 0, the row is currently "live."
- ctid: The physical location of the row version on the disk (page number and slot).

Tracking the Transaction Log (WAL)

Before any of these changes hit the actual table data (the "Heap"), they are recorded in the Write-Ahead Log (WAL). If the power goes out, Postgres replays the WAL to restore the state.

I spent a good chunk of my 48-hour deep dive digging through the pg_wal directory. You can’t read these files with a standard text editor—they are binary segments. But you can peek at what’s happening using the pg_waldump tool.

The WAL is essentially an append-only stream of "This happened." It's the ultimate source of truth. When you commit a transaction, Postgres ensures the WAL record is flushed to disk. It doesn't necessarily flush the actual data table to disk immediately—that happens later during a "checkpoint." This is why your database can be so fast: sequential writes to the WAL are much cheaper than random writes to the data files.

Visualizing MVCC: A Tale of Two Sessions

The best way to understand how Postgres manages these "ghost" versions is to open two terminal windows and watch the xmin and xmax values shift.

Session A: Start a Transaction

BEGIN;
SELECT txid_current(); -- Let's say this returns 1001
UPDATE deep_dive SET val = 'Updated in A' WHERE id = 1;

-- Look at the row inside the transaction
SELECT xmin, xmax, ctid, val FROM deep_dive;

Inside Session A, you’ll see the xmin is now 1001 and the ctid has likely changed (e.g., from (0,1) to (0,2)).

Session B: The Outside Perspective

-- While Session A is still open
SELECT xmin, xmax, ctid, val FROM deep_dive;

Session B still sees the "Initial State." It sees the version where xmin was the original transaction ID and xmax is now 1001. Because transaction 1001 hasn't committed yet, Session B knows it should ignore any changes made by it. This is the essence of "Read Committed" isolation—the default in Postgres.

The Performance Cliff: Table Bloat

This versioning system is brilliant, but it has a massive tax: Bloat. Every time you update a row, you are consuming more disk space. If you update a row 1,000 times, you have 999 "dead" tuples taking up space until a process called VACUUM comes along to clean them up.

Here is the kicker: VACUUM cannot clean up dead rows if a transaction is still open that might need to see them.

I once worked on a system where a developer left a BEGIN open in a staging environment for three days. The production-sized database swelled from 200GB to 1.2TB. Why? Because the VACUUM process saw that old transaction and said, "I can't delete these dead rows because that old transaction might still need to see the version of the world as it existed three days ago."

To check if you have "ghost" rows clogging your pipes, you can use the pg_stat_all_tables view:

SELECT 
    relname AS table_name,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    last_vacuum,
    last_autovacuum
FROM pg_stat_all_tables
WHERE schemaname = 'public';

If n_dead_tup is significantly higher than n_live_tup, your database is basically a graveyard. You're paying for the IO to skip over those dead rows every time you run a SELECT *.

Mastering Isolation Levels

Understanding the transaction log and MVCC changes how you think about isolation levels. Most people stick to the default, but sometimes you need to be stricter—or you need to understand why your reports are inconsistent.

1. Read Committed (Default)

Each query in a transaction sees a snapshot of the data as it existed when the *query* started. If someone commits a change halfway through your transaction, your next query will see that change.

2. Repeatable Read

This is where MVCC really shines. A transaction sees a snapshot of the data as it existed when the *transaction* started. No matter what other people commit, your view of the data remains static.

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
-- Everything you do here sees the same "ghost" versions
-- until you COMMIT.

3. Serializable

The nuclear option. It guarantees that the result is the same as if all transactions ran one after another, sequentially. Postgres does this by monitoring "predicate locks"—basically keeping track of what you read so it can fail the transaction if someone else modifies that range of data.

The "HOT" Optimization (Heap Only Tuples)

Postgres engineers aren't fans of bloat either. They introduced an optimization called HOT (Heap Only Tuples). Normally, an UPDATE requires updating the table data *and* every index pointing to that row. That’s a lot of write overhead.

If you update a column that isn't indexed, and there is enough space on the same "page" (an 8KB block of memory/disk), Postgres can create the new version without updating the index. The old version just points to the new version internally.

To make this work, you need to leave some "fillfactor" on your tables.

-- Leave 10% of every page empty for HOT updates
ALTER TABLE deep_dive SET (fillfactor = 90);
VACUUM FULL deep_dive; -- Warning: This locks the table!

By lowering the fillfactor, you give Postgres room to store these "ghost" versions on the same page, drastically reducing the pressure on your transaction log and indexes.

Visualizing the Raw Pages

If you really want to see the "ghosts," you can use the pageinspect extension. This is where the 48-hour deep dive got weird. You can actually look at the raw bits of an 8KB page.

CREATE EXTENSION IF NOT EXISTS pageinspect;

-- Look at the first page (0) of our table
SELECT * FROM heap_page_items(get_raw_page('deep_dive', 0));

This returns a list of every tuple (row version) on that page. You’ll see t_xmin, t_xmax, and t_ctid. You can literally see the history of your data sitting there. You’ll see tuples where t_infomask tells you they are "dead" but still occupying space.

When Things Go Wrong: Long-Running Transactions

The biggest takeaway from my dive into the WAL and MVCC is a fear of long-running transactions. It’s not just about locking; it’s about preventing the database from cleaning itself.

Common culprits for "Database Constipation" (as I've started calling it):
1. Analytical queries on the primary: Running a 2-hour report on your production DB.
2. Forgotten `BEGIN` blocks: An application error that catches an exception but fails to call ROLLBACK.
3. Uncommitted Prepared Transactions: Using two-phase commits (PREPARE TRANSACTION) and leaving them hanging.
4. Idle Replication Slots: If you have a replication slot for a standby that goes down, Postgres will keep every WAL file since the standby was last seen. Your disk will fill up until the DB crashes.

To find these culprits:

SELECT 
    pid, 
    now() - xact_start AS duration, 
    query, 
    state 
FROM pg_stat_activity 
WHERE state != 'idle' 
ORDER BY duration DESC;

Practical Advice for the Weary Dev

After spending two days looking at hex dumps and transaction IDs, here is how I've changed my approach to Postgres:

1. Keep transactions short. Like, sub-second short. If you need to do a long-running batch update, do it in chunks and COMMIT between them.
2. Monitor bloat, not just disk space. A table might only have 10GB of data but take up 100GB of space. Use the pgstattuple extension for a precise measure of "dead" space.
3. Be careful with `UPDATE`. If you are updating a row frequently (e.g., a last_seen_at timestamp), consider moving that to a separate table or using a cache. Frequent updates are the primary cause of table fragmentation.
4. Tune Autovacuum. Don't turn it off. If it's too slow, make it more aggressive. If it's hitting the IO too hard, give it more workers. But never, ever let it fall behind.

Why MVCC Wins

Despite the complexity of the transaction log and the constant battle against bloat, MVCC is the reason Postgres is so robust. It allows "Readers to not block Writers, and Writers to not block Readers." In a modern web app, that's everything.

You can have a heavy reporting query running against your orders table while thousands of customers are simultaneously placing new orders. Neither affects the other's performance or consistency. That’s the magic of the "ghost" rows. You are essentially seeing a snapshot of time that is uniquely yours, while the rest of the world moves on.

Visualizing the transaction log isn't just a party trick for DBAs. It’s the difference between a database that runs smoothly for years and one that hits a "performance cliff" on a random Tuesday because the VACUUM can't keep up with the ghosts of transactions past. Once you see the hidden columns and the WAL stream, you stop treating Postgres as a black box and start treating it as the sophisticated versioning engine it truly is.

Next time you see a query taking 500ms when it should take 5ms, don't just look at the EXPLAIN plan. Look at the n_dead_tup. The ghosts might be thicker than you think.