loke.dev
Header image for Stop Trusting Your Load Average (Use Linux Pressure Stall Information to Identify Real Bottlenecks Instead)

Stop Trusting Your Load Average (Use Linux Pressure Stall Information to Identify Real Bottlenecks Instead)

Discover why the traditional 'load average' is a ghost metric that fails to distinguish between CPU and I/O pressure, and how Linux PSI provides the surgical precision needed to diagnose production stalls.

· 8 min read

Stop Trusting Your Load Average (Use Linux Pressure Stall Information to Identify Real Bottlenecks Instead)

Your Linux load average is a ghost metric. It’s a relic of the 1970s that we’ve collectively decided to keep on our dashboards despite the fact that it tells us almost nothing about the actual health of a modern system. If you see a load average of 50 on a 16-core machine, you might panic. But that number is a black box—it doesn't tell you if your CPU is screaming, if your NVMe drive is failing, or if a remote NFS mount is hanging and dragging everything into the abyss.

We need to stop treating load averages as a diagnostic tool and start using them as what they are: a "something might be wrong" light that often blinks for the wrong reasons. To actually fix performance issues in production, you need Pressure Stall Information (PSI).

The Lie of the Load Average

The problem with the traditional load average (the three numbers you see in uptime or top) is that it conflates two very different types of "busy."

In the early days of Unix, the load average only counted processes that were either running or ready to run on the CPU. Then, in the 90s, Linux made a controversial change: it started including processes in "uninterruptible sleep" (TASK_UNINTERRUPTIBLE). These are processes waiting for I/O, like a disk read or a network packet.

The result is a metric that represents a weird soup of CPU demand and disk latency.

Imagine you have a backup script that gets stuck waiting on a hung network drive. Your CPU is 99% idle, but because those processes are stuck in D state (uninterruptible sleep), your load average will climb into the hundreds. On the flip side, you could have a CPU-bound task that is slowing down your web server, but because the load average is "only" 4.0 on a 4-core machine, your monitoring system stays silent.

Load average tells you that there is a queue. It doesn't tell you what the queue is for, or more importantly, how much that queue is actually hurting your application's performance.

Enter PSI: The Surgical Alternative

Pressure Stall Information (PSI) was introduced in Linux kernel 4.20 (and backported to many LTS kernels like RHEL/CentOS 8). Instead of counting the number of tasks in a queue, PSI measures the time lost due to resource shortages.

It answers the question: "What percentage of time did my tasks sit around doing nothing because they were waiting for CPU, Memory, or I/O?"

PSI categorizes pressure into three files located in /proc/pressure/:
- /proc/pressure/cpu
- /proc/pressure/memory
- /proc/pressure/io

Each of these files provides a breakdown that looks like this:

$ cat /proc/pressure/io
some avg10=0.10 avg60=0.05 avg300=0.01 total=125406
full avg10=0.00 avg60=0.00 avg300=0.00 total=54201

"Some" vs. "Full"

This is the most critical distinction in PSI, and it's why it's superior to load average.

1. some: This indicates the percentage of time that at least one task was stalled on a resource. For CPU, this means a task was ready to run but had to wait for a core.
2. full: This indicates the percentage of time that all non-idle tasks were stalled simultaneously.

Think about it: if io.full is at 10%, it means for 10% of the last interval, your system was basically doing nothing because *every* active process was waiting on the disk. That is a catastrophic bottleneck. If io.some is high but io.full is low, it means some processes are waiting, but others are still making progress.

Note: The cpu metric only has some. Why? Because as long as one process is running, the CPU is doing work. You can't have a "full" CPU stall where the CPU is simultaneously busy and everyone is waiting for it (that’s just a busy CPU).

Investigating a Mystery Stall with PSI

Let's look at a real-world scenario. You have a database node that is feeling "sluggish." The load average is 12 on an 8-core box. Is it the disk? Is it the CPU?

If you check /proc/pressure/cpu:

some avg10=45.20 avg60=30.10 avg300=15.05

This tells you that for 45% of the last 10 seconds, tasks were delayed because the CPU was oversubscribed. This is a clear compute bottleneck.

If you check /proc/pressure/io:

some avg10=0.00 avg60=0.00 avg300=0.00
full avg10=0.00 avg60=0.00 avg300=0.00

The disk is fine. The load average was high because of the CPU-runnable tasks. In the old world, you'd be guessing. In the PSI world, you know exactly where to put your engineering effort.

Programming against PSI

Reading files in /proc is fine for a quick check, but the real power of PSI comes from its ability to trigger events. You don't want to poll /proc/pressure/io every second; that’s inefficient. Instead, the Linux kernel allows you to use poll() or select() on these files to get notified when pressure exceeds a threshold.

Here is a practical Python example that uses the select system call to monitor for memory pressure. This is incredibly useful for writing "watchdog" scripts that can kill low-priority processes or clear caches before the OOM (Out Of Memory) killer nukes your primary database.

import select
import os

# We want to be notified if memory pressure (some) 
# exceeds 5% over a 1-second window.
# Format: <some|full> <threshold_in_usec> <window_in_usec>
THRESHOLD_SETTING = "some 50000 1000000"

def monitor_memory_pressure():
    try:
        fd = os.open("/proc/pressure/memory", os.O_RDWR | os.O_NONBLOCK)
    except FileNotFoundError:
        print("PSI not supported on this kernel.")
        return

    # Write the threshold configuration to the file descriptor
    os.write(fd, THRESHOLD_SETTING.encode())

    # Create a poll object
    poller = select.poll()
    poller.register(fd, select.POLLPRI) # PSI uses priority events

    print(f"Monitoring memory pressure: {THRESHOLD_SETTING}")

    while True:
        events = poller.poll(5000) # 5 second timeout
        if not events:
            print("No pressure events detected in the last 5s.")
            continue

        for fd_event, event_type in events:
            if event_type & select.POLLPRI:
                print("!!! ALERT: High memory pressure detected !!!")
                # This is where you'd trigger your mitigation logic
                # e.g., stop_background_jobs()
            
monitor_memory_pressure()

This script doesn't waste CPU cycles constantly reading and parsing strings. It sleeps until the kernel itself says, "Hey, we're hitting that 5% stall threshold you asked about."

Why Memory Pressure is the Most Dangerous

When people talk about "system slowness," 90% of the time they are actually experiencing memory pressure. Linux is very aggressive about using spare RAM for page caching. When memory gets tight, the kernel starts reclaiming those pages.

If the kernel reclaims a page that an application needs back immediately, it results in a "thrashing" cycle. The application stalls while the page is read back from disk.

In /proc/pressure/memory, the full metric is your "Panic Meter." If full is consistently above 0, your system is actively thrashing. It’s not just "using a lot of RAM"—it is literally stopping execution because it can't keep the working set in memory. Load averages won't distinguish this from a heavy calculation, but PSI will.

Visualizing PSI: The Modern Way

If you are running a modern version of top (procps-ng 3.3.16+), you can actually see PSI metrics right at the top of the interface. Press m or t to cycle through display modes, and if your kernel supports it, you'll see pressure values.

Better yet, use htop. In recent versions, you can go to Setup (F2) -> Meters and add CPU/Memory/IO pressure gauges to the header.

For production clusters, the node_exporter for Prometheus exports PSI metrics by default. You can build a Grafana dashboard that shows "Stall Percentage" instead of "Load Average." This changes the conversation from "The load is 20, is that bad?" to "The application is stalling on I/O 15% of the time."

Checking for Support

Before you go off and rewrite your monitoring stack, check if your system supports PSI. You can do this with a simple grep:

grep CONFIG_PSI /boot/config-$(uname -r)

If it says y, you're good. If it's not there, you might be on an older kernel (pre-4.20).

One Gotcha: Some distributions (like Ubuntu) ship with the code compiled in but disabled by default to save a tiny bit of overhead. You might need to add psi=1 to your kernel boot parameters in /etc/default/grub and run update-grub.

The Verdict

The load average is a legacy tool for a legacy era. It was fine when we had one CPU and a slow spinning disk, and "busy" meant one or the other. Today, with multi-core processors, NVMe storage, and complex container orchestration, the load average is too blunt an instrument.

Pressure Stall Information is surgical. It gives you:
1. Isolation: Is it CPU, RAM, or Disk?
2. Severity: Is one process slow (some), or is the whole system deadlocked (full)?
3. Actionability: You can set kernel-level triggers to react to pressure before the system crashes.

The next time a server feels slow, ignore uptime. Open /proc/pressure/io and see if the hardware is actually keeping up with the software. Most of the time, the truth is hidden in the stalls, not the averages.