Anatomy of a SIGKILL: How the Linux Kernel Ranks Your Containers for Execution

Your Linux kernel doesn’t care about your application’s uptime, your quarterly KPIs, or the fact that your database is in the middle of a massive transaction. When the system runs out of memory, the kernel stops being a resource manager and starts acting like a digital executioner. It has a specific hit list, a cold-blooded ranking system, and a shotgun called SIGKILL.

If you’ve ever had a Kubernetes pod disappear with an OOMKilled status, you’ve been on the receiving end of a very specific set of kernel heuristics. Understanding these heuristics is the difference between building a resilient system and playing Russian Roulette with your production environment.

The Myth of "Free" Memory

We’ve been conditioned to look at free -m and panic when the "free" column is low. In Linux, that’s usually a misunderstanding. The kernel wants to use as much RAM as possible for caching and buffering. RAM that isn't being used for something is a wasted resource.

The problem arises when the *available* memory—RAM that can be reclaimed from caches without hurting performance—hits a critical floor. At this point, the kernel invokes the Out of Memory (OOM) Killer.

The OOM Killer's job is simple but brutal: kill the process that is using the most memory but is the least "important" to the system’s overall survival. To do this, it assigns every process an oom_score.

The Anatomy of `oom_score`

Every process on your system has a score from 0 to 1000. You can see this right now in the /proc filesystem. Pick a high-memory process (like Chrome or a Java app) and check its score:

# Find the PID of a process (e.g., node)
PID=$(pgrep -f node | head -n 1)

# Check its OOM score
cat /proc/$PID/oom_score

The higher the score, the more likely the kernel is to kill that process when things get tight.

The kernel calculates this score primarily based on the percentage of memory the process is using relative to the total allowed memory. If a process uses 100% of the available RAM, it gets a score of 1000. If it uses 50%, it gets 500.

But there’s a massive caveat: the "available memory" changes depending on whether you are using Cgroups.

The Cgroup Factor

In a containerized world, your process isn't just limited by the host's physical RAM. It’s restricted by a memory.limit_in_bytes (Cgroups v1) or memory.max (Cgroups v2) setting. The OOM Killer is cgroup-aware. If your container hits its limit, the kernel will trigger an OOM event *specifically within that cgroup*, even if the host machine has 128GB of RAM sitting idle.

Adjusting the Hit List: `oom_score_adj`

The kernel provides a knob to manually influence the hit list: /proc/[pid]/oom_score_adj. This value ranges from -1000 to 1000.

- -1000: Completely exempt from the OOM Killer (used by sshd or systemd).
- 1000: "Kill me first."
- 0: Neutral (default).

If you want to protect a specific process, you can write to this file. I've used this in the past for custom monitoring agents that *must* stay alive to report why other things are dying.

# Protect my critical process
echo -1000 > /proc/$PID/oom_score_adj

Note: You need root privileges to *lower* the score (make it less likely to be killed), but any user can *increase* their score to make it more likely to be killed.

How Kubernetes Hacks the Kernel

Kubernetes doesn't just let the kernel guess. It uses oom_score_adj to implement its Quality of Service (QoS) classes. This is why your pods die in a specific order.

1. Guaranteed: (Requests == Limits). These get an oom_score_adj of -997. They are the last to be killed, effectively acting like system daemons.
2. BestEffort: (No requests or limits). These get an oom_score_adj of 1000. They are the first to go.
3. Burstable: (Requests < Limits). K8s calculates the score based on the request. The formula is roughly:
1000 - (1000 * memory_request) / machine_capacity

This means a Burstable pod that requests a lot of memory is safer than a Burstable pod that requests very little.

Practical Example: The OOM Victim Finder

If you want to see exactly how your kernel views your processes right now, you can use this script to rank the potential victims. This is much more informative than top because it accounts for the adjustment values.

#!/bin/bash
# rank_oom.sh - See which processes the kernel wants to kill first

printf "%-10s %-10s %-10s %-20s\n" "PID" "Score" "Adj" "Command"
echo "---------------------------------------------------------"

for pid in /proc/[0-9]*; do
    PID=$(basename $pid)
    if [ -f "$pid/oom_score" ]; then
        SCORE=$(cat "$pid/oom_score")
        ADJ=$(cat "$pid/oom_score_adj")
        CMD=$(cat "$pid/comm")
        if [ "$SCORE" -gt 0 ]; then
            printf "%-10s %-10s %-10s %-20s\n" "$PID" "$SCORE" "$ADJ" "$CMD"
        fi
    fi
done | sort -k2 -nr | head -n 15

Pressure Stall Information (PSI): The Early Warning System

Relying on oom_score is reactive. By the time the OOM Killer wakes up, the system is already thrashing, swapping, and likely unresponsive.

In 2018, Facebook contributed Pressure Stall Information (PSI) to the Linux kernel. This was a game-changer. Instead of looking at "how much RAM is left," PSI looks at "how much time is being wasted because we don't have enough RAM."

It measures the delay introduced when tasks have to wait for memory (e.g., waiting for page faults to be serviced).

You can read this in /proc/pressure/memory:

cat /proc/pressure/memory

Output looks like this:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

- some: Percentage of time that *some* tasks were stalled on memory.
- full: Percentage of time that *all* non-idle tasks were stalled (total system thrashing).

If full avg10 is consistently above 10, your application isn't just slow; it's starving. Modern tools like oomd (the userspace OOM killer) use these metrics to kill processes *before* the kernel OOM killer panics, allowing for a much cleaner shutdown.

The Mystery of the "Invisible" OOM

Sometimes you check dmesg and see nothing, yet your process is gone. This usually happens in one of two scenarios:

1. The Parent Killed It

If a wrapper script or a process manager (like Node's cluster module) sees a child process consuming too much, it might send a SIGTERM or SIGKILL itself. This won't show up in kernel logs.

2. The Cgroup Limit was hit, and "OOM Kill" was disabled

In some configurations, you can set memory.oom_control to disable the OOM killer. Instead of killing the process, the kernel "suspends" it until memory becomes available. This is almost always a terrible idea in production, as it leads to "zombie" containers that are alive but don't respond to anything.

To find the truth, always check the kernel ring buffer:

dmesg -T | grep -i "oom"

Look for lines like Out of memory: Kill process 1234 (my_app) score 500 or sacrifice child. The "sacrifice child" part is literal—the kernel prefers to kill children to avoid destroying the parent process, but that's a rabbit hole for another day.

Practical Strategies for Survival

How do you stop the kernel from murdering your containers?

Use "Guaranteed" QoS in Kubernetes

If your database is critical, set the requests and limits to the exact same value. This ensures an oom_score_adj of -997. It’s the closest you can get to "immortality" in a cluster.

Instrument Your Memory Allocation

Don't just measure "used" memory. Measure "Heap used" vs "RSS". In Python or Node, your garbage collector might not be returning memory to the OS as fast as you think.

// Example: Triggering GC manually in Node.js for testing
// Run with node --expose-gc index.js
if (global.gc) {
    global.gc();
} else {
    console.log('GC not exposed');
}

Profile Your Allocations

Use valgrind or memray (for Python) to find leaks. A slow leak will eventually hit any limit you set, no matter how much RAM you throw at it. The OOM Killer is patient; it will wait months for your leak to hit the limit.

Wrapping Up

The OOM Killer isn't a bug. It's the system's final defense mechanism against total hardware lockup. When your container gets a SIGKILL, the kernel isn't being mean—it's performing a triage.

By understanding how oom_score is calculated, how Kubernetes manipulates oom_score_adj, and how to monitor PSI, you can move from reactive firefighting to proactive resource management. The next time you see that OOMKilled status, don't just increase the RAM limit. Check the score, look at the pressure signals, and figure out why your process became the most "killable" thing on the system.