loke.dev
Header image for The Linux Futex Is a Sleeping Giant

The Linux Futex Is a Sleeping Giant

A deep dive into the fast userspace mutex and how it enables high-concurrency applications to synchronize without the heavy tax of kernel context switching.

· 9 min read

If you run strace on a high-performance multithreaded Linux application, you’ll likely see a flurry of futex(0x..., FUTEX_WAIT_PRIVATE, ...) calls. Most developers treat these as background noise—the plumbing of the pthread library. But the futex (Fast Userspace Mutex) is arguably the most critical synchronization primitive in the Linux kernel. Before its introduction in kernel 2.5.7, every single lock acquisition required a transition into kernel space, even if there was no contention. In a world of high-concurrency C++, Rust, and Go applications, that overhead would be a death sentence for performance.

The magic of the futex lies in its name: Fast Userspace. It allows a program to attempt a lock acquisition entirely in userspace using atomic operations. The kernel only gets involved when things get messy—specifically, when a thread needs to actually go to sleep because a resource is blocked, or when a thread needs to be woken up.

The Problem: The Tax of the Syscall

To appreciate the futex, you have to remember how we used to do things. In the early days of Linux, synchronization often relied on heavy-duty system calls. Every lock() and unlock() meant a context switch.

A context switch isn't free. The CPU has to save registers, flush pipelines, and potentially deal with TLB (Translation Lookaside Buffer) misses. On modern hardware, a syscall can take anywhere from several hundred to a few thousand clock cycles. If your critical section only does ten cycles of work (like incrementing a counter), the "tax" of the lock is 100x the cost of the work.

We want the "Fast Path" to be a simple atomic instruction.

#include <stdatomic.h>
#include <stdbool.h>

// A naive spinlock - no kernel involvement, but burns CPU
void spin_lock(atomic_int *lock) {
    int expected = 0;
    while (!atomic_compare_exchange_strong(lock, &expected, 1)) {
        expected = 0; // Reset expected for the next loop
        // This is the "Fast Path" but if it fails, 
        // we just spin and waste electricity.
    }
}

The spinlock is fast if the lock is held for a microsecond. But if the lock is held for 10 milliseconds, spinning is a disaster for power consumption and cache contention. We need a way to sleep. That’s where the futex bridges the gap between atomic userspace operations and kernel-level scheduling.

Anatomy of a Futex Call

A futex isn't a complex object in the kernel. In fact, the kernel doesn't even "own" the futex state. The state is just a 32-bit integer in your application's memory. You tell the kernel: "Hey, look at this address. If the value there is still 1, put me to sleep. If it's changed, don't bother."

The basic signature (simplified) looks like this:

long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val,
            const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3);

- uaddr: The address of your 32-bit integer in userspace.
- futex_op: What you want to do (FUTEX_WAIT, FUTEX_WAKE, etc.).
- val: The expected value.

Implementing a Basic Mutex with Futex

Let's build a minimalist mutex. It’s not as robust as pthread_mutex_t, but it shows the "Fast Path/Slow Path" logic. We'll use three states:
- 0: Unlocked.
- 1: Locked, no contention.
- 2: Locked, with potential waiters (requires a syscall to wake).

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/futex.h>
#include <stdatomic.h>
#include <stdint.h>

typedef struct {
    atomic_int state;
} simple_mutex_t;

void mutex_lock(simple_mutex_t *m) {
    int expected = 0;
    // FAST PATH: Try to flip from 0 to 1
    if (atomic_compare_exchange_strong(&m->state, &expected, 1)) {
        return; // Success! No kernel involvement.
    }

    // SLOW PATH: Contention detected
    do {
        // If state is 2, or we flip it to 2, we prepare to wait
        if (expected == 2 || atomic_exchange(&m->state, 2) != 0) {
            // Tell the kernel: sleep if the value at &m->state is still 2
            syscall(SYS_futex, &m->state, FUTEX_WAIT, 2, NULL, NULL, 0);
        }
        expected = 0;
    } while (!atomic_compare_exchange_strong(&m->state, &expected, 2));
}

void mutex_unlock(simple_mutex_t *m) {
    // If the state was 1, no one is waiting. Just set to 0.
    if (atomic_fetch_sub(&m->state, 1) != 1) {
        // There were waiters (state was 2), so we must wake them.
        m->state = 0;
        syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0);
    }
}

In the mutex_lock function, the first atomic_compare_exchange_strong is our golden ticket. If the mutex is free, we grab it and move on. This takes roughly 15-30 nanoseconds. If we fail, we mark the mutex as having waiters (state = 2) and call FUTEX_WAIT.

The Race Condition That Almost Was

You might look at FUTEX_WAIT and see a massive race condition. What if the thread holding the lock releases it *after* we decide to sleep, but *before* we actually enter the kernel?

1. Thread A: Sees state == 2.
2. Thread B: Sets state = 0, calls FUTEX_WAKE. (No one is sleeping yet, so wake does nothing).
3. Thread A: Calls FUTEX_WAIT.

Thread A would sleep forever.

The kernel solves this by making FUTEX_WAIT atomic with respect to the value check. When you call FUTEX_WAIT, you pass in the val you expect (in our code, 2). The kernel re-reads the value at uaddr. If the value at uaddr is no longer 2, the system call fails immediately with EAGAIN. This is the "Sleeping Giant's" secret: it refuses to sleep if the world has changed since you last looked.

Why Not Just Spin?

I've seen many "performance" libraries try to replace futexes with pure spinlocks. It’s a seductive idea. "I have 64 cores, I can afford to spin!"

The reality is harsher. Spinlocks are vulnerable to priority inversion and preemption. If the thread holding your spinlock is preempted by the OS scheduler (maybe for a background cron job or a network interrupt), every other thread trying to get that lock will spin for their entire time slice, doing zero work and generating heat.

The futex is smarter. It allows a thread to yield its time slice back to the scheduler. It says, "I can't do anything useful right now, go find a thread that can."

Inside the Kernel: The Hash Table

Where does the kernel keep track of all these waiting threads? It doesn't attach them to the userspace address directly (the kernel doesn't want to pollute your memory). Instead, it maintains a global (but sharded) hash table of "wait queues."

When you call FUTEX_WAIT, the kernel:
1. Hashes the physical address of your uaddr (and the memory space/VM).
2. Finds the corresponding bucket in the global futex hash table.
3. Adds your task to a linked list in that bucket.
4. Puts your task into the TASK_INTERRUPTIBLE state.

When someone calls FUTEX_WAKE, the kernel hashes the same address, finds the bucket, and wakes up $N$ tasks from the list.

The Gotcha: Because this is a hash table, collisions can happen. Two entirely different locks in two different applications might hash to the same bucket. This is why the kernel must verify the address carefully. It's also why, in extremely high-concurrency scenarios, the global futex hash table can become a bottleneck (though modern kernels shard this table significantly to reduce contention).

FUTEX_REQUEUE: The Optimization You Didn't Know You Needed

If you've ever looked at how pthread_cond_broadcast works, it’s a thundering herd nightmare. You wake up 100 threads, all of them wake up, try to grab the same mutex, 99 of them fail and go back to sleep.

The Linux-specific FUTEX_REQUEUE (and its smarter sibling FUTEX_CMP_REQUEUE) solves this. Instead of waking 100 threads, the kernel wakes 1 thread and moves the other 99 from the condition variable’s futex wait-queue directly to the mutex’s futex wait-queue. They stay asleep, but they are now "queued" for the lock. No unnecessary context switches.

This is the kind of engineering that makes Linux scale.

The Complexity of Priority Inheritance

One of the darkest corners of synchronization is Priority Inversion. Imagine a low-priority thread (Low) holds a lock. A high-priority thread (High) wants it and sleeps. Then a medium-priority thread (Med) comes along and starts a CPU-intensive task. Since Med has higher priority than Low, Low never gets to run, and thus never releases the lock. High is now effectively blocked by Med.

The futex subsystem includes "PI-futexes" (FUTEX_LOCK_PI) to handle this. When a high-priority thread blocks on a futex held by a low-priority thread, the kernel temporarily boosts the low-priority thread to the priority of the waiter.

This adds massive complexity to the futex code. The kernel now has to track ownership of the futex—something it usually avoids.

Practical Advice for the Systems Programmer

You should rarely call syscall(SYS_futex, ...) directly. It’s unportable, the glibc wrappers are non-existent, and it’s easy to get the memory ordering wrong. However, understanding it changes how you write code.

1. Keep critical sections small. The futex's "Fast Path" only works if the lock is released quickly. If you hold a lock for a long time, you force everyone into the kernel, defeating the purpose.
2. Beware of false sharing. Since futexes are identified by their address, if you put two futexes on the same cache line, you might get "false sharing" performance hits at the CPU cache level, even if the kernel handles the hash buckets correctly.
3. Memory Ordering Matters. Notice I used atomic_compare_exchange_strong. In a real implementation, you'd likely use memory_order_acquire and memory_order_release. If you get these wrong, the "Fast Path" might see an unlocked state while the "Slow Path" thinks it's locked.

The Future: futex2?

For years, there has been talk of a futex2 system call. The primary motivation? Variable-sized futexes. Currently, a futex must be a 32-bit int. But what if you want to synchronize on an 8-bit char to save memory in a massive array? Or a 64-bit value to avoid ABA problems?

The game-changer, however, is Wait on Multiple Futexes. This is something Windows has had for a long time (WaitForMultipleObjects). For Linux, this is a missing piece for things like Wine (the Windows compatibility layer), which has to emulate this behavior using complex worker threads. Recent kernels have introduced futex_waitv, allowing a thread to sleep on an array of futexes. This is a massive win for game engines and translators.

Closing Thoughts

The futex is a testament to the "do only what is necessary" philosophy of Linux kernel design. It recognizes that the kernel is an expensive mediator and provides a mechanism to bypass it whenever possible.

Next time you see a pthread_mutex or a Rust parking_lot::Mutex, remember that underneath that clean API is a 32-bit integer and a kernel that is trying very hard to stay out of your way. The "Sleeping Giant" only wakes up when it has to, and that is why our modern systems can handle thousands of threads without collapsing under the weight of their own management.