Write Skew Is the Invisible Database Killer

Imagine waking up to a production database where every row is technically valid according to your schema constraints, yet the business logic has completely collapsed. You have three doctors on call when there should be at least one, or a user has managed to withdraw money from an account that was already empty, all because your database allowed two transactions to pass each other like ships in the night.

This isn't a bug in your application code, and it’s not a hardware failure. It is Write Skew, the most deceptive anomaly in the world of relational databases. It’s "invisible" because it doesn't trigger the usual errors like deadlocks or unique constraint violations. It just silently corrupts the integrity of your system while your monitoring tools show everything is green.

The Mirage of "Repeatable Read"

Most developers are taught about the standard isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable. We are told that "Repeatable Read" prevents most problems. If you're using PostgreSQL, you might even think you're safe because its REPEATABLE READ implementation is actually stronger than the SQL standard requires.

But here is the reality: Repeatable Read does not prevent write skew.

Write skew occurs when two concurrent transactions read the same data, but then modify *different* objects in a way that violates a premise that was true when the transactions started. Because they are updating different rows, the row-level locks that databases usually rely on never collide.

Let's look at a concrete example to see how this happens in the wild.

The On-Call Disaster: A Practical Example

Imagine a hospital management system. The rule is simple: At least one doctor must be on call at all times.

Here is our table structure in PostgreSQL:

CREATE TABLE doctors (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    on_call BOOLEAN NOT NULL DEFAULT TRUE
);

INSERT INTO doctors (name, on_call) VALUES ('Alice', true), ('Bob', true);

Both Alice and Bob are currently on call. Both are feeling sick and want to opt-out. They hit the "Leave Shift" button at the exact same millisecond.

Our application logic looks like this (pseudo-code):

def leave_shift(doctor_id):
    with db.transaction():
        # 1. Check how many doctors are currently on call
        count = db.query("SELECT count(*) FROM doctors WHERE on_call = true")
        
        if count >= 2:
            # 2. If we have at least 2, it's safe for 1 to leave
            db.execute("UPDATE doctors SET on_call = false WHERE id = %s", doctor_id)
        else:
            raise Exception("Cannot leave shift: At least one doctor must remain on call.")

If these two transactions run sequentially, it works. Alice leaves, count becomes 1. When Bob tries to leave, count is 1, and the application throws an error.

But under REPEATABLE READ (Snapshot Isolation), here is the execution flow:

1. Transaction A (Alice) starts. It reads the snapshot. count is 2.
2. Transaction B (Bob) starts. It reads the *same* snapshot. count is 2.
3. Transaction A executes the UPDATE for Alice. It succeeds.
4. Transaction B executes the UPDATE for Bob. Since Bob is updating a *different* row than Alice, there is no lock contention. It succeeds.
5. Both transactions commit.

Now, the doctors table has zero people on call. The business invariant is broken, and no database error was ever thrown.

Why Databases Allow This

You might be asking, "Why doesn't the database just stop this?"

The answer is performance. To prevent write skew, the database has to track not just the rows you *changed*, but the rows you *read* to make the decision to change them.

In the example above, the database would need to know that Transaction A’s write was predicated on the result of a SELECT COUNT(*) query. If the data underlying that query changes before Transaction A commits, Transaction A must be aborted.

Tracking these "predicate locks" is computationally expensive. Most databases default to READ COMMITTED because it's fast. Even when you level up to REPEATABLE READ, the database only guarantees that you will see the same data if you read a row twice. It does *not* guarantee that the conditions you used to authorize a write will remain true.

Visualizing the Conflict

If we were to look at the timeline of the "Invisible Killer," it looks like this:

| Time | Transaction 1 (Alice) | Transaction 2 (Bob) |
| :--- | :--- | :--- |
| T1 | BEGIN ISOLATION LEVEL REPEATABLE READ; | |
| T2 | | BEGIN ISOLATION LEVEL REPEATABLE READ; |
| T3 | SELECT count(*) -> 2 | |
| T4 | | SELECT count(*) -> 2 |
| T5 | UPDATE ... WHERE id = 1 (Alice) | |
| T6 | | UPDATE ... WHERE id = 2 (Bob) |
| T7 | COMMIT; | |
| T8 | | COMMIT; |

At T8, the integrity is gone. The database didn't see a conflict because Alice and Bob touched different rows.

Strategy 1: The Heavy Hammer (Serializable Isolation)

The most direct way to kill write skew is to use SERIALIZABLE isolation. In PostgreSQL, this uses Serializable Snapshot Isolation (SSI). It tracks when a transaction’s results might have been affected by a concurrent write.

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- ... run logic ...
COMMIT;

If you ran the Alice/Bob scenario under SERIALIZABLE, the second person to commit would receive a serialization error:

ERROR: could not serialize access due to read/write dependencies among transactions

The Gotcha: You cannot just flip a switch and make everything serializable. SSI requires the application to be prepared to retry transactions. If you get a 40001 error (serialization_failure), your app must catch it and replay the entire logic. This adds latency and complexity to your service layer.

Strategy 2: Explicit Locking (SELECT FOR UPDATE)

If you don't want the overhead of SERIALIZABLE, you can use pessimistic locking. By using SELECT ... FOR UPDATE, you tell the database: "I am reading these rows, and I intend to change them or the logic surrounding them. Lock them so no one else can touch them."

However, in our doctor example, SELECT COUNT(*) doesn't lock specific rows in a way that helps us, because the "on_call" status is what's changing.

A better approach is to lock the specific rows you are evaluating:

BEGIN;

-- Lock all doctors currently on call
SELECT * FROM doctors WHERE on_call = true FOR UPDATE;

-- Now perform the check in application logic
-- ... 

UPDATE doctors SET on_call = false WHERE id = 1;

COMMIT;

Now, when Alice's transaction hits the SELECT FOR UPDATE, it locks all rows where on_call is true. When Bob’s transaction tries to do the same, it will block until Alice commits. Once Alice commits, Bob’s SELECT will execute again, see only one doctor left, and his application logic will correctly deny the request.

The Gotcha: This can lead to massive bottlenecks. If you have 1,000 doctors on call, you are locking 1,000 rows just to let one person leave their shift.

Strategy 3: Materializing Conflicts

Sometimes, there are no rows to lock. Imagine a meeting room booking system. You want to ensure no two people book Room A at 10:00 AM. If the room isn't booked yet, there is no "Booking" row to lock.

In this case, you can "materialize" the conflict. You create a dummy table specifically for locking.

CREATE TABLE room_locks (
    room_id INT,
    booking_hour INT,
    PRIMARY KEY (room_id, booking_hour)
);

To book a room, you must first lock the row in room_locks. If it doesn't exist, you insert it. By forcing all transactions to touch the same row in the room_locks table, you turn a potential write skew into a standard lock contention or a unique constraint violation.

Strategy 4: The Database Constraint (The Gold Standard)

Whenever possible, push the logic into the database schema itself. If you can express your business rule as a UNIQUE index or a CHECK constraint, write skew becomes impossible.

For example, if the requirement was "A doctor can only be on call for one ward at a time," a UNIQUE constraint on (doctor_id, ward_id) solves it perfectly.

But for our "at least one doctor on call" rule, standard SQL constraints are usually not enough because they typically only look at the row being updated, not the state of the whole table. (Though some databases support ASSERTIONS, they are rarely implemented or performant).

Dealing with the "Double Spend" in Fintech

Write skew is most dangerous in financial applications. Consider a user with two accounts: Savings ($100) and Checking ($100). The bank allows a "Total Credit" limit where the sum of both accounts must stay above $0.

1. Tx A reads Savings ($100) and Checking ($100). Total = $200.
2. Tx B reads Savings ($100) and Checking ($100). Total = $200.
3. Tx A withdraws $150 from Savings. (New total: $50). Valid.
4. Tx B withdraws $150 from Checking. (New total: $50). Valid.
5. Both Commit.

The user now has -$50 in Savings and -$50 in Checking. They have withdrawn $300 from a $200 limit.

In this scenario, FOR UPDATE is your best friend. But you must lock the parent record (e.g., the User or AccountGroup record).

-- Lock the user's account group to prevent concurrent balance shifts
SELECT id FROM account_groups WHERE user_id = 123 FOR UPDATE;

-- Perform balance checks and updates
-- ...

By locking the parent, you serialize all financial activity for that specific user without locking the entire global accounts table.

Choosing Your Weapon

How do you decide which approach to take? I usually follow this hierarchy:

1. Schema Constraints: Can I make this a UNIQUE constraint or an EXCLUDE constraint? (If yes, do it).
2. Serializable Isolation: Is my write volume low enough that retrying failed transactions is acceptable? (If yes, this is the safest and cleanest code).
3. Pessimistic Locking (`FOR UPDATE`): Do I have a specific "anchor" row I can lock to serialize this specific operation? (This is usually the sweet spot for performance).
4. Optimistic Concurrency Control (OCC): Can I add a version column to my rows?
UPDATE doctors SET on_call = false, version = version + 1 WHERE id = 1 AND version = 5;
If the update returns 0 rows affected, someone else changed the data.

The Danger of "Hidden" Write Skew

The most insidious part of write skew is that it often passes through testing. In a local environment or a staging server with a single user, you will never see it. It only appears under high concurrency—exactly when your business is finally succeeding and the stakes are highest.

If you are writing code that checks a condition (IF count > X) and then performs an action based on that condition, you are vulnerable.

Don't trust that your isolation level has your back. Most ORMs (like Sequelize, Hibernate, or Entity Framework) default to READ COMMITTED. Even if you've bumped it to REPEATABLE READ, you are still open to the "Invisible Killer."

Summary: A Checklist for the Paranoid

Before you ship that next migration or feature, ask yourself:

* Does my transaction's success depend on data it read but didn't modify?
* Am I using REPEATABLE READ or READ COMMITTED? (If so, write skew is possible).
* Can two users perform this action simultaneously and break a rule?
* Do I have a retry mechanism if I switch to SERIALIZABLE?

Database consistency is a lie we tell ourselves to sleep better at night. The reality is a messy web of snapshots and locks. Understanding write skew is the difference between building a system that works on your machine and building a system that survives the real world.

Stop letting your database be a silent accomplice to data corruption. Lock your rows, use the right isolation levels, and always assume that if something *can* happen concurrently, it *will*.