Why Is Your 'Safe' Nested Transaction Still Throttling Your Postgres Throughput?

It’s one of those things you don’t think about until you’re staring at a CPU graph that looks like a mountain range, wondering why a "safe" code change just tanked your production throughput. Everything looked fine in staging, the logic is sound, and you’re using nested transactions specifically to *prevent* data corruption.

But Postgres has a secret, and it’s buried in how it handles subtransactions.

The "Safe" Pattern That Bites

We’ve all been there. You have a batch process, and you don't want the entire thing to fail just because one row is a bit wonky. You wrap each iteration in a "subtransaction." In Python’s Django, it looks like atomic() blocks; in Ruby on Rails, it’s transaction(requires_new: true); in raw SQL, it’s the humble SAVEPOINT.

# The "Safe" Loop of Doom
for record in massive_dataset:
    try:
        with transaction.atomic():  # This creates a SAVEPOINT
            process_and_save(record)
    except DataError:
        log_error(record)
        continue

On the surface, this is great. If record #402 fails, record #401 stays committed. But under the hood, you’re creating a chain of subtransactions. Postgres is remarkably good at many things, but it has a very specific, very hard limit on how many of these it can track efficiently.

The Magic Number: 64

Postgres keeps a small, fast cache in shared memory for each backend process to track subtransactions. This cache has exactly 64 slots.

As long as your transaction has 64 or fewer subtransactions, life is grand. Postgres checks the cache to see if a specific subtransaction ID is committed or aborted. It’s nearly instantaneous.

The moment you hit number 65? You fall off a performance cliff.

When the 64-item cache overflows, Postgres marks the top-level transaction as "suboverflowed." This forces the database to stop looking at the fast local cache and start hitting the pg_subtrans SLRU (Segmented Least Recently Used) cache on disk/shared buffers.

Here is what that looks like in raw SQL if you were doing it manually:

BEGIN; -- Top level transaction
SAVEPOINT s1;
-- ... do some work
SAVEPOINT s2;
-- ... do more work
-- Imagine 62 more of these...
SAVEPOINT s65; -- BOOM. You just overflowed the cache.
COMMIT;

Why This Is "Contagious"

The real kicker isn't just that the overflowing transaction gets slower. It’s that every other backend process in your database starts to suffer.

Postgres uses MVCC (Multi-Version Concurrency Control) to determine which rows a user can see. When any transaction overflows its subtransaction cache, other processes trying to determine visibility can no longer simply check if a transaction ID is "active." They have to check if that ID is a subtransaction of an overflowed transaction, which involves scanning the SLRU.

Suddenly, a single worker process running a "safe" loop is causing lock contention and I/O spikes for your entire API. You’ll see SubtransSLRU or SubtransControlLock wait events start to climb in your monitoring tools.

How to Spot the Overflow

If you suspect this is happening, you can query pg_stat_slru to see if there’s a spike in lookups or if you're hitting the disk for subtransactions.

SELECT name, blks_read, blks_hit 
FROM pg_stat_slru 
WHERE name = 'subtrans';

If blks_read is climbing rapidly during your batch jobs, you’ve got an overflow problem. You can also peek at pg_stat_activity to see if your sessions are stuck on SubtransSLRU.

Better Patterns to Keep Things Fast

If you find yourself hitting this limit, you need to change your strategy. Here are a few ways I’ve handled this in the past without sacrificing data integrity.

1. The "Batch and Reset" Strategy

Don't open one massive transaction with 10,000 savepoints. Open a transaction, process 50 items, commit, and start a new one. This keeps your subtransaction count well below the 64-item limit.

# Better: Chunking the work
def process_in_chunks(dataset, chunk_size=50):
    for i in range(0, len(dataset), chunk_size):
        with transaction.atomic(): # Top level transaction for the chunk
            for record in dataset[i:i+chunk_size]:
                try:
                    with transaction.atomic(): # Subtransaction
                        process_and_save(record)
                except DataError:
                    handle_error(record)

2. Validation Before Insertion

Often, we use subtransactions because we’re lazy. We let the database catch the error. If you can validate the data in your application logic *before* hitting the database, you can avoid the exception block entirely.

3. Use `ON CONFLICT`

If your subtransactions are mostly handling unique constraint violations, use INSERT ... ON CONFLICT DO NOTHING (or DO UPDATE). This is an internal Postgres mechanism that doesn't create a subtransaction in the same way an exception block does.

-- Instead of a SAVEPOINT/INSERT/ROLLBACK loop:
INSERT INTO users (email, name)
VALUES ('test@example.com', 'Test')
ON CONFLICT (email) DO NOTHING;

The Exception to the Rule: ORM "Nested" Blocks

Be careful with ORMs. Some, like SQLAlchemy, are very explicit about when a subtransaction is created. Others might sneak them in. For example, in Django, entering a transaction.atomic() block while already inside one *always* creates a savepoint.

If you have a function decorated with @transaction.atomic and you call it inside a loop that is *also* wrapped in an atomic block, you are incrementing that 64-item counter every single time.

Final Thoughts

Subtransactions aren't "evil," but they are an expensive resource. They're like credit cards: incredibly useful in a pinch, but if you start using them for every small purchase without paying off the balance (committing the top-level transaction), the interest (performance overhead) will eventually bankrupt your throughput.

Next time you see a process looping through thousands of items with a try/except block inside, remember the magic number 64. Your database will thank you.

Why Is Your 'Safe' Nested Transaction Still Throttling Your Postgres Throughput?

The "Safe" Pattern That Bites

The Magic Number: 64

Why This Is "Contagious"

How to Spot the Overflow

Better Patterns to Keep Things Fast

1. The "Batch and Reset" Strategy

2. Validation Before Insertion

3. Use `ON CONFLICT`

The Exception to the Rule: ORM "Nested" Blocks

Final Thoughts

Related Articles

The Fillfactor Is a Performance Insurance Policy

Stop Measuring Query Latency: Why the BUFFERS Flag Is the Only Real Way to Audit Postgres Performance

The 'OR' Clause Tax: Why Your Postgres Filters Are Silently Forcing Sequential Scans

Related Articles

The Fillfactor Is a Performance Insurance Policy

Stop Measuring Query Latency: Why the BUFFERS Flag Is the Only Real Way to Audit Postgres Performance

The 'OR' Clause Tax: Why Your Postgres Filters Are Silently Forcing Sequential Scans

The "Safe" Pattern That Bites

The Magic Number: 64

Why This Is "Contagious"

How to Spot the Overflow

Better Patterns to Keep Things Fast

1. The "Batch and Reset" Strategy

2. Validation Before Insertion

3. Use ON CONFLICT

The Exception to the Rule: ORM "Nested" Blocks

Final Thoughts

Related Articles

The Fillfactor Is a Performance Insurance Policy

Stop Measuring Query Latency: Why the BUFFERS Flag Is the Only Real Way to Audit Postgres Performance

The 'OR' Clause Tax: Why Your Postgres Filters Are Silently Forcing Sequential Scans

Related Articles

The Fillfactor Is a Performance Insurance Policy

Stop Measuring Query Latency: Why the BUFFERS Flag Is the Only Real Way to Audit Postgres Performance

The 'OR' Clause Tax: Why Your Postgres Filters Are Silently Forcing Sequential Scans

3. Use `ON CONFLICT`