loke.dev
Header image for The Day I Finally Navigated the Compaction Storm: A Deep Dive into LSM Trees

The Day I Finally Navigated the Compaction Storm: A Deep Dive into LSM Trees

I tore apart the architecture of modern edge databases to understand why they favor Log-Structured Merge-Trees and how to survive the sudden latency spikes caused by background compaction.

· 7 min read

B-Trees are a lie we tell ourselves to feel safe about random access, but in a realm of relentless, high-velocity write streams, they are essentially a performance tax that eventually bankrupts your latency budget. We’ve spent decades optimizing them, balancing their nodes, and hugging their predictable O(log n) lookups, but the moment you move to the edge or start dealing with high-ingestion workloads, the B-Tree's insistence on in-place updates becomes a massive bottleneck.

A few months ago, I was monitoring a distributed edge KV store during a heavy data migration. Everything was smooth—until it wasn't. Out of nowhere, p99 latency spiked from 12ms to 450ms. The CPU didn't redline, and the network wasn't saturated. We were caught in a Compaction Storm.

To understand how to survive that storm, you have to understand why modern heavyweights like RocksDB, LevelDB, and ScyllaDB threw the B-Tree playbook out the window in favor of Log-Structured Merge-Trees (LSM Trees).

The Fundamental Bet: Append-Only Wins

The core philosophy of an LSM Tree is a bet: Sequential I/O is significantly faster than random I/O. Even with NVMe drives that handle random access better than spinning rust, the overhead of finding a specific block, reading it, modifying it, and writing it back (the B-Tree way) is astronomical compared to just shouting data onto the end of a file.

In an LSM Tree, we never modify a file in place. If you update a key, you don't find the old value and overwrite it. You just write the new value with a newer timestamp. Deleting a key? You write a "tombstone" marker.

This makes writes incredibly fast. But it creates a "debt" that we eventually have to pay.

The Write Path: From MemTable to SSTable

Every write starts its life in two places: a Write-Ahead Log (WAL) on disk (for durability) and an in-memory sorted structure called a MemTable.

Here is a simplified look at how you might represent a MemTable in Python using a SortedDict or a simple skip-list logic.

import threading

class MemTable:
    def __init__(self, capacity_bytes):
        self.data = {}  # In a real DB, use a SkipList or B-Tree for sorted keys
        self.capacity = capacity_bytes
        self.current_size = 0
        self.lock = threading.Lock()

    def put(self, key, value):
        with self.lock:
            # Simplified size tracking
            entry_size = len(key) + len(value)
            self.data[key] = value
            self.current_size += entry_size
            
            if self.current_size >= self.capacity:
                return True # Signal that we need to flush to disk
            return False

    def get(self, key):
        return self.data.get(key)

    def get_sorted_items(self):
        return sorted(self.data.items())

Once that MemTable hits a size limit (say, 64MB), it becomes immutable. A new MemTable is created for incoming writes, and the old one is "flushed" to disk as an SSTable (Sorted String Table).

The Magic of the SSTable

The SSTable is the atomic unit of the LSM Tree. It is a file containing key-value pairs, sorted by key. Because it’s sorted, we can search it quickly, and more importantly, we can merge multiple SSTables efficiently using a merge-sort algorithm.

An SSTable usually consists of:
1. Data Blocks: The actual KV pairs.
2. Index Block: Offsets to help find keys without scanning the whole file.
3. Bloom Filter: A probabilistic data structure that tells us "This key definitely isn't in this file" or "This key *might* be in this file."

The Bloom Filter is the unsung hero. Without it, every GET request for a key that doesn't exist would force the database to check every single SSTable on disk.

from pybloom_live import BloomFilter # Requires: pip install pybloom-live

class SSTable:
    def __init__(self, filename, kv_pairs):
        self.filename = filename
        self.bloom = BloomFilter(capacity=len(kv_pairs), error_rate=0.1)
        self.index = {}
        
        # In reality, you'd write these to a file sequentially
        self._write_to_disk(kv_pairs)

    def _write_to_disk(self, kv_pairs):
        offset = 0
        with open(self.filename, 'wb') as f:
            for k, v in kv_pairs:
                self.bloom.add(k)
                self.index[k] = offset
                line = f"{k}:{v}\n".encode('utf-8')
                f.write(line)
                offset += len(line)

    def search(self, key):
        if key not in self.bloom:
            return None # Fast path: Key definitely doesn't exist here
        
        # Only if bloom says "maybe", we do the disk I/O
        offset = self.index.get(key)
        if offset is not None:
            # Perform a seek and read...
            return self._read_from_offset(offset)
        return None

The Compaction Storm: When Debt Comes Due

Here is where my latency spike happened. Over time, you end up with dozens, then hundreds of SSTable files. This leads to Read Amplification: to find a single key, you might have to check multiple files (even with Bloom filters).

To fix this, the database runs a background process called Compaction. It picks a few SSTables and merges them into a new, larger, sorted SSTable. During this merge, it throws away old versions of keys and removes tombstones.

There are two main strategies, and choosing the wrong one for your workload is exactly how you trigger a "storm."

1. Size-Tiered Compaction (STCS)

Common in Cassandra. It groups SSTables of similar sizes. When you have four 100MB tables, it merges them into one 400MB table.
- The Good: Great for write-heavy workloads. Very simple.
- The Bad: It requires a lot of "headroom." If you have a 500GB dataset, you might suddenly need another 500GB of free space just to perform the merge. This is where IOPS get throttled, and your p99s go to hell.

2. Leveled Compaction (LCS)

Common in RocksDB and LevelDB. Data is organized into levels (L0, L1, L2...). Each level has a total size limit (e.g., L1 is 100MB, L2 is 1GB).
- The Good: Minimal Read Amplification. Most reads can be satisfied by looking at a very small number of files.
- The Bad: High Write Amplification. A single piece of data might be read and written many times as it moves from L0 down to L6.

Navigating the Storm: Practical Fixes

When I looked at our edge DB logs during the spike, I saw that the compaction threads were fighting for the same I/O bandwidth as the flush threads. The system couldn't clear the MemTables fast enough because the disk was busy merging massive SSTables in the background.

If you find yourself in a Compaction Storm, here is your survival kit:

1. The "Backpressure" Lever

Most modern LSM engines (like RocksDB) have a feature called "Write Delay" or "Write Stalls." If the background compaction falls too far behind, the database intentionally slows down incoming writes.

It sounds counter-intuitive—why would I want to slow down my app? Because the alternative is the database crashing or latency jumping to 10 seconds. You’d rather a consistent 50ms than a jagged 10ms-to-1000ms range.

2. Rate Limiting Compaction

Don't let compaction use 100% of your disk bandwidth. In RocksDB, you can set bytes_per_sync or use a RateLimiter.

// Example C++ snippet for RocksDB tuning
std::shared_ptr<rocksdb::RateLimiter> rate_limiter = 
    rocksdb::NewGenericRateLimiter(10 * 1024 * 1024); // Limit to 10MB/s

rocksdb::Options options;
options.rate_limiter = rate_limiter;

3. Partitioning the Load

If you’re running on the edge, you likely have many small tenants. Instead of one massive LSM tree, use multiple smaller ones. This prevents a single massive compaction event from locking up the entire I/O subsystem.

The Merge Logic: A Simplified View

To understand why compaction is so heavy, look at the code. It’s essentially a multi-way merge sort. We are constantly streaming data off the disk, comparing keys, and streaming it back.

import heapq

def merge_sstables(sstable_iterators):
    """
    sstable_iterators: a list of iterators, 
    each yielding (key, value, timestamp) from an SSTable
    """
    # Use a min-heap to keep track of the smallest key across all SSTables
    heap = []
    
    for i, it in enumerate(sstable_iterators):
        try:
            k, v, ts = next(it)
            heapq.heappush(heap, (k, -ts, i, v, it)) # -ts for latest timestamp first
        except StopIteration:
            pass

    last_key = None
    while heap:
        k, neg_ts, i, v, it = heapq.heappop(heap)
        
        # If we see the same key twice, we only keep the one with 
        # the highest timestamp (the first one out of our heap)
        if k != last_key:
            yield (k, v)
            last_key = k
        
        # Advance the iterator that we just popped from
        try:
            next_k, next_v, next_ts = next(it)
            heapq.heappush(heap, (next_k, -next_ts, i, next_v, it))
        except StopIteration:
            pass

Imagine running this logic on four files, each 2GB in size. You are reading 8GB and writing 8GB. If your disk can only do 200MB/s, you've just blocked your I/O for 80 seconds. That is your storm.

Why the Edge Loves LSM (Despite the Storms)

You might ask: "If B-Trees are more predictable, why bother with the complexity of LSM Trees?"

For edge databases like Turso (built on libSQL/SQLite) or local stores for IoT devices, the answer is SSD Health and Throughput.
1. Wear Leveling: SSDs hate random writes. They have a limited number of erase cycles. By writing sequentially (LSM style), we play nicely with the SSD’s internal flash management.
2. Concurrency: In an LSM tree, we don't need heavy row-level locking for writes because we are always appending. This is massive for high-concurrency edge environments.

Lessons Learned from the Trenches

The "Compaction Storm" isn't a bug; it's a fundamental characteristic of Log-Structured Merge-Trees. It is the cost of doing business in a write-optimized world.

If you are building or using a database based on LSM Trees:
- Monitor your SSTable count. If it's growing indefinitely, your compaction strategy is failing.
- Don't fill your disk beyond 70%. Compaction needs breathing room to merge files.
- Size your MemTables wisely. Too small, and you create too many tiny SSTables (Read Amplification). Too large, and the flush pauses become noticeable.

In my case, the fix was a combination of switching from Size-Tiered to Leveled compaction and implementing a strict rate limiter on background I/O. The p99s stabilized, the disk stopped screaming, and I finally understood that in the world of databases, you can't avoid the work—you can only decide when you're going to pay for it.