Why Does Your 'Exactly-Once' Delivery Still Result in Duplicate Side Effects?

If you’ve configured your message broker for "exactly-once" delivery, why did your customer just receive three "Order Shipped" emails for a single purchase?

It’s a frustrating realization many developers hit after spending weeks tuning Kafka configurations or setting up idempotent producers. You followed the documentation. You set enable.idempotence=true. You’ve wrapped your processing in transactions. Yet, the "ghost" side effects—the duplicate emails, the double-charged credit cards, the redundant Slack notifications—continue to haunt your logs.

The uncomfortable truth is that "exactly-once delivery" is a bit of a marketing misnomer. In a distributed system, exactly-once *processing* of a side effect is physically impossible without a global transaction that spans every piece of infrastructure involved. Since we rarely have that (and usually don't want the performance hit), we have to build for it ourselves.

The Scope of the Lie

When a technology like Kafka or Flink promises "exactly-once," they aren't lying, but they are speaking a very specific dialect of truth. They are talking about internal state consistency.

If a Kafka Streams application reads a message, increments a counter in its state store, and produces a result to another topic, the exactly-once guarantee ensures that if the system crashes, the counter won't be incremented twice and the output won't be duplicated in the final topic.

But your database? Your third-party API? Your legacy SOAP service? They aren't part of that transactional bubble.

Consider this common (and broken) logic:

# A typical consumer loop that feels safe, but isn't.
def process_order(message):
    order_data = parse(message.value)
    
    # 1. Execute the side effect (The Trap)
    payment_gateway.charge_card(
        amount=order_data.total,
        card_token=order_data.token
    )
    
    # 2. Update the local database
    db.execute(
        "UPDATE orders SET status = 'PAID' WHERE id = %s", 
        (order_data.id,)
    )
    
    # 3. Commit the offset to the broker
    consumer.commit()

If the service crashes at step 3, the message is re-delivered. The payment gateway is hit again. If it crashes at step 2, the payment is taken, but the database says it isn't. This is the Side Effect Gap.

The Ghost in the Network

Network failures are rarely clean. Most developers code for the "Server Not Found" error, but the real nightmare is the "Success but Timed Out" error.

You send a request to a billing API. The API processes the payment successfully, but the network connection drops before it can send you the HTTP 200 OK. Your code catches a timeout exception. Naturally, your retry logic kicks in, or the message broker re-delivers the message. You call the API again.

Without a way to tell the API "this is the same request as five seconds ago," you’ve just double-charged the user. No amount of Kafka configuration fixes this because Kafka doesn't know what happened inside your payment_gateway.charge_card() function.

Strategy 1: The Idempotency Key (The Gold Standard)

The most effective way to kill ghosts is to make every side effect idempotent. An operation is idempotent if it can be performed multiple times without changing the result beyond the initial application.

In distributed systems, this usually requires an Idempotency Key. This is a unique identifier generated by the producer (or the source of truth) that travels with the event.

Here is how you actually implement it in a consumer:

import redis
import uuid

cache = redis.Redis(host='localhost', port=6379, db=0)

def handle_event(event):
    # The key must be unique to the specific intent, not the execution
    # Good: order_id + "payment"
    # Bad: uuid.uuid4() (this changes on retry!)
    idempotency_key = f"payment:{event['order_id']}"
    
    # Atomic SET NX (Set if Not Exists)
    # We set a TTL so we don't clog Redis forever
    is_new_request = cache.set(idempotency_key, "processing", nx=True, ex=3600)
    
    if not is_new_request:
        status = cache.get(idempotency_key)
        print(f"Duplicate request detected. Status: {status}")
        return
    
    try:
        # Perform the side effect
        result = stripe.Charge.create(
            amount=2000,
            currency="usd",
            source="tok_visa",
            idempotency_key=idempotency_key # Passing it forward!
        )
        
        # Update the status in our 'lock'
        cache.set(idempotency_key, "completed", ex=3600)
        
    except Exception as e:
        # If it's a transient error, delete the key so we can retry
        cache.delete(idempotency_key)
        raise e

The Gotcha: Notice that I passed the idempotency_key *into* the Stripe API call. Many modern APIs (Stripe, AWS, Adyen) support this. If they do, use it. If they don't, you have to manage the state yourself, which brings us to the next problem.

Strategy 2: The Transactional Outbox Pattern

A common mistake is trying to talk to a database and a message broker in the same block of code without a shared transaction.

I’ve seen this countless times:
1. Start DB Transaction.
2. Update User Record.
3. Send Message to Kafka.
4. Commit DB Transaction.

If the DB commit fails at step 4 (maybe a constraint violation), the message has already been sent to Kafka. Other services will start acting on data that "doesn't exist" in the source database.

The Transactional Outbox Pattern solves this by using the database as a temporary message queue.

-- Instead of sending to Kafka directly, we write to an outbox table
-- within the same transaction as our business logic.

BEGIN;

UPDATE accounts SET balance = balance - 100 WHERE id = 'user_123';

INSERT INTO outbox (id, aggregate_type, aggregate_id, type, payload)
VALUES (
    gen_random_uuid(), 
    'Account', 
    'user_123', 
    'MoneyWithdrawn', 
    '{"amount": 100, "currency": "USD"}'
);

COMMIT;

A separate process (a Relay) polls the outbox table or watches the WAL (Write-Ahead Log) and pushes the messages to Kafka. Since the write to the outbox and the update to the accounts table are in the same atomic transaction, it's impossible for one to happen without the other.

Once the Relay ensures the message is in Kafka, it deletes the outbox entry or marks it as processed.

Strategy 3: Deterministic Keys and the Offset Trap

If you are using Kafka, you might think you don't need Redis to store idempotency keys. Can't you just use the Kafka offset?

"I'll just check if the current offset is greater than the last processed offset stored in my DB," is a common thought.

This works for updating your own database, but it fails for external side effects. Why? Because Kafka offsets are specific to a partition. If your consumer group rebalances and your consumer is assigned a new partition, your "last processed offset" logic needs to be incredibly robust across all partitions to avoid gaps.

Instead, generate a deterministic key based on the business logic.

// Good: The key is derived from the data itself
const crypto = require('crypto');

function getIdempotencyKey(order) {
    const data = `${order.userId}-${order.items.sort().join(',')}-${order.total}`;
    return crypto.createHash('sha256').update(data).digest('hex');
}

By making the key deterministic, even if the producer sends the same logical event twice with different message IDs (because of a retry), the consumer can generate the same hash and recognize it as a duplicate.

The Problem of "Zombie" Instances

In a distributed environment, you often face the "Split Brain" or "Zombie" problem. Imagine Consumer A is processing a message. It takes a long time (a "stop-the-world" GC pause or a slow network). The broker thinks Consumer A is dead and reassigns the partition to Consumer B.

Now, both A and B are processing the same message. Consumer B finishes first. Then Consumer A wakes up and finishes its work.

To prevent this, you need Fencing Tokens. When using a database for idempotency, include a version number or a fence.

-- Update only if the version hasn't changed since we last checked
UPDATE process_state 
SET status = 'COMPLETED', version = version + 1
WHERE task_id = 'task_456' AND version = 5;

If the update affects 0 rows, you know a "zombie" (another instance) has already progressed the state, and you should abort.

Making the Side Effect "Committable"

The hardest side effects are those that aren't inherently idempotent and don't support idempotency keys. Imagine a legacy internal API that just sends an SMS and returns { "status": "sent" }.

In these cases, you must transform the interaction into a two-step process:
1. The Reservation: Call the service to "reserve" an action (e.g., POST /sms/reserve). It returns a pending_id.
2. The Execution: Call the service to "confirm" the action (e.g., POST /sms/confirm/{pending_id}).

If the system crashes between step 1 and 2, you have a "pending" SMS that was never sent. You can then run a background job to clean up or re-trigger expired reservations. This is essentially implementing a mini-Two-Phase Commit (2PC) at the application level. It’s painful, but it's the only way to get close to 100% correctness.

Distributed Systems are Never "Finished"

The quest for exactly-once delivery often leads developers down a rabbit hole of configuration. But the truth is that the infrastructure can only take you 90% of the way there. The final 10%—the part where your code actually interacts with the world—is your responsibility.

To stop the "ghost" executions:
* Assume everything will be delivered at least twice.
* Use Idempotency Keys and pass them as far down the stack as possible.
* Use the Transactional Outbox Pattern to keep your internal state and your event stream in sync.
* Derive keys from business data, not from transient metadata like timestamps or random UUIDs generated at the time of sending.

Exactly-once is an illusion created by carefully managing at-least-once delivery combined with idempotent processing. Once you accept that "exactly-once" is a property of your *entire system architecture* rather than a toggle in a config file, you can finally stop the duplicates.

The next time someone tells you their system handles exactly-once delivery out of the box, ask them one question: "What happens if the network dies after the side effect but before the acknowledgment?" If they don't have an answer involving idempotency keys, keep your guard up.