loke.dev
Header image for The Night My NVMe Drive Ignored Me: How I Finally Unmasked the Linux I/O Scheduler

The Night My NVMe Drive Ignored Me: How I Finally Unmasked the Linux I/O Scheduler

Explore why your high-speed SSD still suffers from mystery latency spikes and how the Linux block layer's reordering logic can make or break your database performance.

· 9 min read

Have you ever stared at a monitoring dashboard, watching your P99 latency climb into the hundreds of milliseconds, while your "blazing fast" NVMe drive reports only 5% utilization?

It happened to me at 2:00 AM on a Tuesday. We were running a high-throughput PostgreSQL cluster on Gen4 NVMe drives. On paper, these drives can push hundreds of thousands of IOPS. In reality, our application was choking. Every few minutes, a simple SELECT by ID—something that should take microseconds—would hang for 200ms.

The disk wasn't busy. The CPU wasn't pinned. The network was quiet. The culprit was a silent middleman I hadn't thought about in years: the Linux I/O scheduler.

The Lie We Tell Ourselves About Modern Storage

We like to think of NVMe drives as magic black boxes where data goes in and comes out instantly. Back in the days of spinning rust (HDDs), we knew the kernel had to be smart. It had to group requests together so the physical disk arm didn't have to fly back and forth across the platters like a caffeinated hummingbird. We called this "elevating" or "sorting" I/O.

But NVMe has no moving parts. It has thousands of internal queues. So, the conventional wisdom says: "Just use none or noop. Get the kernel out of the way."

That wisdom is often wrong.

The Linux block layer is a complex beast. Even with NVMe, your I/O requests don't just jump from your application to the NAND flash. They pass through the virtual file system (VFS), the block layer, and finally the blk-mq (Multi-Queue) architecture. If your scheduler is misconfigured for your specific workload, the kernel might spend more time "thinking" about how to order the I/O than the drive spends actually performing it.

Checking the Pulse of Your Block Device

Before you can fix the problem, you have to see it. Most people look at iostat, but iostat gives you averages. Averages are where latency spikes go to hide.

Instead, look at what the kernel actually thinks it's doing. You can find your current scheduler by catting the queue/scheduler file in sysfs. Replace nvme0n1 with your actual device name:

cat /sys/block/nvme0n1/queue/scheduler

On a modern Ubuntu or RHEL system, you’ll likely see something like this:

[none] mq-deadline kyber bfq

The brackets [ ] indicate the active scheduler. If it says none, the kernel is doing basic FIFO (First-In-First-Out) bypass. If it says mq-deadline, the kernel is trying to prevent "starvation" by ensuring no single request waits too long.

The Anatomy of a Hang: Why mq-deadline Might Be Killing You

In my 2:00 AM crisis, we were using mq-deadline. On the surface, mq-deadline is great. It prioritizes reads over writes, which is usually what you want for a database. But mq-deadline has a quirk: it tries to merge adjacent I/O requests to improve throughput.

When our database was doing a massive background write (a checkpoint), mq-deadline was busy trying to batch those writes together. Meanwhile, my tiny, urgent SELECT query was getting stuck in the "Read Queue." The scheduler was so focused on optimizing the "big" throughput that it added overhead to the "small" latency-sensitive tasks.

To see if this is happening to you, you need to look at the distribution of your I/O latency. I used biolatency from the BCC tools suite:

# This requires bpfcc-tools installed
sudo biolatency-bpfcc 1 10

This tool produces a histogram. If you see a "bi-modal" distribution—where most I/O is fast but a small cluster is very slow—you have a scheduler or queuing problem.

The Contenders: Choosing Your Weapon

Linux gives us a few choices for the multi-queue era. Let's break down the logic behind each.

1. none (The Purist)

This is exactly what it sounds like. No reordering, no merging, no scheduling. It hands the I/O to the hardware as fast as the CPU can process it.
* Best for: High-end NVMe drives with massive internal controllers, extremely fast CPUs, and workloads that are already well-ordered.
* The Gotcha: If your application is poorly written and sends thousands of tiny, non-contiguous writes, you might actually lose performance because you're not taking advantage of the kernel's ability to merge them.

2. mq-deadline (The Safe Bet)

It groups requests into batches and enforces a deadline (default 500ms for writes, 50ms for reads).
* Best for: General purpose servers, mixed workloads where you want to ensure reads don't get stuck behind a massive write dump.
* The Gotcha: The "merging" logic can introduce CPU jitter. On a system with 64+ cores, the locking required to manage these queues can become a bottleneck.

3. kyber (The Smart One)

Developed by Facebook, Kyber is designed specifically for fast flash storage. It works by setting target latencies. If the latency for reads exceeds the target, it throttles the "extra" work (like background syncs) to clear the path.
* Best for: Latency-sensitive web servers and databases on NVMe.
* The Gotcha: It’s not available on all kernels (requires 4.12+) and requires some tuning to find the "sweet spot" for your hardware.

4. bfq (Budget Fair Queuing)

This is the successor to the old CFQ. It’s incredibly complex and tries to give every process a fair share of bandwidth.
* Best for: Desktops. If you're compiling a kernel while trying to watch a 4K video, bfq makes sure the video doesn't stutter.
* The Gotcha: Never use this for a high-performance database. The CPU overhead of calculating "fairness" will destroy your IOPS.

Real-World Testing: How I Flipped the Switch

Back to my Tuesday morning. I decided to test kyber against none. Changing the scheduler is one of the few things in Linux you can do safely at runtime without unmounting the disk or restarting services.

# Switch to kyber
echo "kyber" | sudo tee /sys/block/nvme0n1/queue/scheduler

# Or switch to none
echo "none" | sudo tee /sys/block/nvme0n1/queue/scheduler

I ran a quick synthetic test using fio. I wanted to simulate our database workload: random 4k reads mixed with heavy sequential writes.

# Save this as test_io.fio
[global]
ioengine=libaio
direct=1
group_reporting
runtime=60
time_based

[random-read]
filename=/dev/nvme0n1
rw=randread
bs=4k
numjobs=4
iodepth=64

[sequential-write]
filename=/dev/nvme0n1
rw=write
bs=1m
numjobs=1
iodepth=8

Running fio test_io.fio with mq-deadline showed my random read P99 latency at 12ms.
Switching to kyber dropped it to 1.2ms.
Switching to none dropped it further to 0.8ms, but my sequential write throughput plummeted by 20%.

The "Aha" moment: Our application wasn't just doing reads; it was doing a lot of small, fragmented writes. none was forcing the drive to handle every tiny write individually. kyber was the winner because it merged those writes just enough to keep the drive happy without blocking the reads.

The Hidden Knob: nr_requests

While I was digging through /sys/block/nvme0n1/queue/, I found the real reason for the 200ms spikes. It wasn't just the scheduler—it was the queue depth.

Look at this file:

cat /sys/block/nvme0n1/queue/nr_requests

This defines how many requests the block layer will buffer before it forces the application to wait (block). If this number is too high, you get "Bufferbloat." Your requests sit in a massive software queue inside the kernel, waiting for their turn. To the application, this looks like disk latency. To the disk, it looks like it's idling because the kernel hasn't handed the work over yet.

For NVMe, you often want a *smaller* nr_requests than you'd think. You want the pressure to stay on the application or the hardware, not the kernel's intermediate buffer. We dropped ours from 1024 to 256, and the "mystery spikes" vanished.

Automating the Fix with udev

You don't want to manually echo values into sysfs every time you reboot. The "Linux way" to handle this is a udev rule. This ensures that whenever a drive of a certain type is detected, the right settings are applied.

Create a file at /etc/udev/rules.d/60-scheduler.rules:

# Set kyber for NVMe drives
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="kyber"

# Increase the read-ahead for better sequential performance if needed
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{bdi/read_ahead_kb}="4096"

# Tighten the request queue to prevent bufferbloat
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/nr_requests}="256"

Reload your rules:

sudo udevadm control --reload-rules
sudo udevadm trigger

When "None" is Actually Better

I should clarify: I chose kyber for that specific PostgreSQL workload. If you are running a purely NoSQL workload (like ScyllaDB or Cassandra) that manages its own I/O scheduling and uses io_uring or AIO, the Linux scheduler is your enemy.

Those databases are designed to bypass the kernel's opinions entirely. In those cases, none is the only correct answer. If you give ScyllaDB a scheduler, you're essentially putting two chefs in a tiny kitchen. They will spend more time arguing about who gets the frying pan than actually cooking.

Visualizing the Impact with iostat -x

If you want to verify your changes are working without fancy BPF tools, use iostat -x 1. Pay attention to two columns: avgqu-sz (average queue size) and %util.

If %util is 100% but your avgqu-sz is low, your drive is the bottleneck.
If %util is low (e.g., 10%) but your avgqu-sz is high (e.g., 50+), the kernel is the bottleneck. The scheduler is holding onto I/O for too long.

# Example of a healthy NVMe system under load
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.00    0.00    2.50    0.10    0.00   92.40

Device            r/s     w/s     rkB/s     wkB/s   await r_await w_await  qu-sz %util
nvme0n1       1500.00  500.00   6000.00  20000.00    0.15    0.10    0.30   2.50   8.00

Notice the await is 0.15ms. That’s what health looks like. If that await starts climbing while %util stays low, it’s time to look at your scheduler.

Lessons from the Trenches

The biggest mistake I made that night was assuming that because I bought expensive hardware, I didn't need to care about the software stack beneath it.

Here is the hierarchy of I/O performance:
1. Hardware: Can the NAND keep up?
2. Driver: Is the NVMe driver configured correctly? (Usually, yes).
3. The Block Layer (The Scheduler): Is the kernel reordering things in a way that helps or hurts?
4. The Filesystem: Is XFS or EXT4 adding its own locking contention?
5. The Application: Are you sending 1-byte writes like a madman?

Usually, we optimize #1 and #5 and ignore everything in the middle. But as storage gets faster, the "middle" becomes the most common source of tail latency.

If you’re seeing spikes, don’t just blame the SSD. The kernel might be trying to be "helpful" by reordering your requests, and in the world of NVMe, sometimes the best help the kernel can give is to simply step aside.

Next time your drive seems to ignore you, check the scheduler. It might just be overthinking things.