loke.dev
Header image for Why Does Your 'Perfect' Postgres Transaction Still Fail with a Serialization Error?

Why Does Your 'Perfect' Postgres Transaction Still Fail with a Serialization Error?

An in-depth investigation into Serializable Snapshot Isolation (SSI) and why your database rejects valid-looking queries to prevent the invisible phantom of write-skew.

· 8 min read

You’ve written the logic perfectly, the unit tests pass, and the schema is tight, yet your logs are suddenly bleeding 40001 serialization errors. It feels like a betrayal—you chose the SERIALIZABLE isolation level specifically to stop worrying about race conditions, only for the database to start throwing your valid transactions back in your face.

The frustration is real. When you set a transaction to SERIALIZABLE, you’re asking Postgres to guarantee that the outcome is exactly the same as if every transaction ran one after another, in a single line. But Postgres doesn't actually run them one at a time. It runs them concurrently and uses a complex, optimistic mathematical model to "guess" if they might interfere. When it guesses wrong, or when it sees a conflict that violates the laws of serializable physics, it kills your transaction.

To stop these errors from being a mystery, we have to look at what's happening under the hood of Serializable Snapshot Isolation (SSI).

The Illusion of Total Isolation

Most developers are comfortable with READ COMMITTED (the Postgres default). You know that you might see "non-repeatable reads"—if you query a row twice, it might change in between. To fix this, you might move to REPEATABLE READ. This solves the non-repeatable read problem, but it leaves you wide open to a more sinister phenomenon: Write Skew.

Write skew occurs when two transactions read the same data, but modify *different* sets of data in a way that breaks a business rule.

Let's look at a classic "Doctor On-Call" scenario. The rule is simple: at least one doctor must be on call at all times.

The Setup

CREATE TABLE doctors (
    id SERIAL PRIMARY KEY,
    name TEXT,
    on_call BOOLEAN DEFAULT TRUE
);

INSERT INTO doctors (name, on_call) VALUES ('Alice', true), ('Bob', true);

The Conflict

Imagine Alice and Bob both feel sick and want to go off-call. They both start a transaction at the same time.

Transaction 1 (Alice):

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Check if anyone else is on call
SELECT count(*) FROM doctors WHERE on_call = true;
-- Result: 2

Transaction 2 (Bob):

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Check if anyone else is on call
SELECT count(*) FROM doctors WHERE on_call = true;
-- Result: 2

Transaction 1 (Alice):

UPDATE doctors SET on_call = false WHERE name = 'Alice';
COMMIT;

Transaction 2 (Bob):

UPDATE doctors SET on_call = false WHERE name = 'Bob';
COMMIT;

Under REPEATABLE READ, both transactions commit successfully. Alice saw there were 2 doctors, so she left. Bob saw there were 2 doctors, so he left. Now, 0 doctors are on call. The business rule is broken, and yet the database didn't complain because Alice and Bob updated *different* rows. No row-level lock was ever contested.

This is the "invisible phantom" that SERIALIZABLE is designed to hunt down.

Why Postgres Kills Your "Perfect" Query

When you switch that example to ISOLATION LEVEL SERIALIZABLE, Postgres tracks not just what you write, but what you read. It uses something called SIREAD locks (predicate locks). These aren't traditional locks that make other processes wait; they are more like "bookmarks" that say: "I relied on this data to make a decision."

If Transaction A relies on a certain state of the data, and Transaction B changes that data, Postgres identifies a dependency.

The specific error you see—ERROR: could not serialize access due to read/write dependencies among transactions—happens when Postgres detects a "cycle" of dependencies. It realizes that if it allows both transactions to finish, there is no possible chronological order (A then B, or B then A) that would result in the current state of the database.

The SSI Logic

Postgres looks for a pattern of three links:
1. T1 reads something that T2 then modifies (a "rw-conflict").
2. T2 reads something that T3 then modifies.
3. A cycle or a dangerous structure is formed.

Actually, it can happen with just two transactions ($T1 \to T2 \to T1$), but the engine is designed to be very conservative. It would rather fail a safe transaction than allow an unsafe one to compromise data integrity.

The "Perfect" Query that Fails

Here is where it gets annoying. You can have a transaction that is logically sound, doesn't actually violate any constraints, and has zero chance of causing write skew, but Postgres will still kill it.

Why? Granularity.

Postgres tracks these SIREAD locks in memory. It doesn't always track them at the row level. If it's tracking too many locks, it "promotes" them to page-level locks. If you read a few rows on a page, Postgres might mark the whole *page* as being read. If another transaction modifies a completely unrelated row on that same page, Postgres sees a conflict.

Example: The Over-Aggressive Failure

-- Transaction A
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM orders WHERE status = 'pending'; -- Reads rows on Page 45

-- Transaction B (simultaneous)
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE users SET last_login = now() WHERE id = 500; -- Also happens to be on Page 45
COMMIT;

-- Transaction A
UPDATE orders SET status = 'processed' WHERE id = 10;
COMMIT; -- ERROR: could not serialize access!

In this case, Transaction B had nothing to do with orders, but because Postgres escalated its tracking to the page level to save memory, it assumed there was a potential for write skew and aborted Transaction A.

You Must Handle Retries

If you are using SERIALIZABLE, you are making a deal with the database: "I will let you handle the complex race condition logic, and in exchange, I promise to retry my transaction if you tell me to."

If your application code doesn't have a retry loop for 40001 errors, you shouldn't be using SERIALIZABLE. Period.

Here is how you might handle this in a Python-based worker using psycopg2 or psycopg3:

import time
import psycopg2
from psycopg2 import errorcodes

def execute_with_retry(conn, logic_func, max_retries=5):
    retries = 0
    while True:
        try:
            with conn: # Starts a transaction
                with conn.cursor() as cur:
                    cur.execute("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;")
                    return logic_func(cur)
        except psycopg2.OperationalError as e:
            if e.pgcode == errorcodes.SERIALIZATION_FAILURE:
                retries += 1
                if retries > max_retries:
                    raise Exception("Max retries reached for serialization failure")
                
                # Exponential backoff to let the conflicting transaction finish
                wait_time = 0.1 * (2 ** retries)
                time.sleep(wait_time)
                continue
            else:
                raise # Re-raise if it's a different error

The 40001 error is not a bug in your code. It is a signal from the database saying: "I couldn't guarantee safety this time, please try again when the dust has settled."

Can You Avoid These Errors Without SERIALIZABLE?

Yes, and often you should. SERIALIZABLE is the most expensive isolation level in terms of performance and mental overhead. You can often achieve the same safety using SELECT ... FOR UPDATE.

In our doctor example, we could have done this:

BEGIN;
-- Lock the rows we are checking so nobody else can modify them
SELECT count(*) FROM doctors WHERE on_call = true FOR UPDATE;
-- Logic...
UPDATE doctors SET on_call = false WHERE name = 'Alice';
COMMIT;

This forces Transaction 2 to wait until Transaction 1 is finished. Unlike SERIALIZABLE, which lets everyone proceed and then crashes the losers at the finish line, FOR UPDATE makes people wait in line.

When to use SERIALIZABLE vs FOR UPDATE:

* Use `FOR UPDATE` when you know exactly which rows are the source of the conflict. It’s more performant and doesn't require complex retry logic (though deadlocks are still a possibility).
* Use `SERIALIZABLE` when your business logic depends on the *absence* of data or a complex aggregate across many tables (e.g., "Sum of all account balances must stay positive"). It's hard to lock "the absence of data" manually, but SSI handles it beautifully.

The Memory Trade-off

One reason your "perfect" query fails is that the Postgres predicate_lock_manager has a fixed size. You can check this in your postgresql.conf:

* max_pred_locks_per_transaction
* max_pred_locks_per_page
* max_pred_locks_per_relation

If your transaction touches a lot of rows, Postgres will run out of slots to track individual rows and start locking whole tables. Once it starts table-locking for SSI, your serialization errors will skyrocket because every transaction is now seen as a potential conflict with every other transaction.

If you see an uptick in 40001 errors during high load, check if your transactions are too large. Keep `SERIALIZABLE` transactions short. The longer a transaction lives, the more likely it is to overlap with a conflicting write.

Practical Advice for the Weary Developer

1. Don't over-use it. If READ COMMITTED with a specific SELECT ... FOR UPDATE works, use that.
2. Short and Sweet. Do your heavy computation *outside* the transaction. Get in, check the data, write the change, and get out.
3. Read-only optimization. If you have a transaction that only reads, mark it as such: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY;. Postgres can use a "Safe Snapshot" optimization that reduces the chance of it being the victim of a serialization failure.
4. Index Everything. SSI relies on predicate locking. If you do a sequential scan on a table because of a missing index, Postgres will effectively lock the entire table for serialization. An index allows it to lock only the specific B-tree pages or rows you actually touched.
5. Monitor Conflicts. Use the pg_stat_database view to see how many conflicts_serialization are occurring. If the number is high, you have a hot spot in your data model.

SELECT datname, serialization_failures FROM pg_stat_database;

Summary

The 40001 error is Postgres doing its job. It’s the database equivalent of a "Check Engine" light that actually saves the car from exploding. While it's tempting to view these failures as a nuisance, they are proof that your data integrity is being protected against race conditions that are nearly impossible to catch in testing.

Understand the write-skew, implement a robust retry loop, and keep your transactions focused. If you do that, SERIALIZABLE becomes a powerful tool rather than a source of production dread.