
The Overcommit Bet
Explaining why the Linux kernel lies to your application about available RAM and the catastrophic cost of losing that gamble in production environments.
Your application is a liar, but only because the Linux kernel taught it how to be one. Every time you call malloc in C or allocate a large object in a higher-level language, you aren't actually grabbing physical RAM; you’re entering into a high-stakes gambling debt with the operating system.
Linux is a compulsive gambler. It bets that your application—and every other process running on the system—is lying about how much memory it actually needs. This "Overcommit" strategy is the only reason modern computing feels as fast and fluid as it does, but when the kernel loses that bet, it doesn't just go bankrupt. It starts executing processes to recoup its losses.
The Illusion of Availability
When you request memory from the kernel, the kernel checks its books, sees that it's technically out of physical RAM, and says "Sure, here’s a pointer" anyway.
This happens because of the distinction between Virtual Memory and Resident Memory. When a process asks for 1GB of RAM, the kernel allocates 1GB of *virtual address space*. No physical silicon is actually wired up to that request yet. The kernel is betting that you won’t actually touch all those bytes right away, or perhaps ever.
Let’s look at how easy it is to trick the kernel. Here is a simple C program that "allocates" 100GB of RAM on a machine that might only have 8GB.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main() {
size_t huge_amount = 100ULL * 1024 * 1024 * 1024; // 100 GB
printf("Attempting to allocate 100 GB...\n");
void *ptr = malloc(huge_amount);
if (ptr == NULL) {
perror("malloc failed");
return 1;
}
printf("Success! I have a pointer to 100 GB at %p\n", ptr);
printf("Checking /proc/self/status... (Press Enter)\n");
getchar();
return 0;
}If you compile and run this, it will almost certainly succeed. If you check top or htop while the program is waiting for input, you’ll see something fascinating: VIRT (Virtual) will show 100G, but RES (Resident) will be nearly zero.
The kernel hasn't given you 100GB. It has given you a 100GB *IOU*.
Why the Kernel Lies
You might think this is a bug, or at least a dangerous design flaw. Why not just return NULL if the RAM isn't there?
The answer lies in how Linux processes are born. The fork() system call creates a copy of a process. In the early days, this meant literally copying every byte of memory from the parent to the child. On a modern system, if you have a 4GB Redis instance and you want to save a snapshot to disk, Redis forks itself.
If Linux didn't overcommit, that fork would require a second 4GB of free physical RAM just to *start* the process, even if the child process was only going to exist for a few seconds to write a file.
Linux uses Copy-on-Write (CoW). Both processes point to the same physical pages of RAM until one of them tries to modify a page. Only then does the kernel actually allocate a new physical page. Overcommit is the "grease" that makes CoW viable. Without it, you would constantly be running out of memory because the kernel would have to reserve the "worst-case scenario" for every single process.
The Three Modes of the Gambler
You can actually see and tune how the kernel handles this bet by looking at /proc/sys/vm/overcommit_memory. There are three possible values:
1. Mode 0 (Heuristic): The default. The kernel uses a "guesswork" algorithm to see if there's enough "likely" memory available. It’s remarkably good at this, but it can be fooled by sudden spikes.
2. Mode 1 (Always): The kernel will always say yes to memory requests. It’s the "Yes Man" mode. This is useful for certain scientific applications that deal with massive, sparse arrays.
3. Mode 2 (Never): The kernel becomes a strict accountant. It will only allow allocations up to a specific limit, defined by Swap + (RAM * overcommit_ratio).
To see your current setting:
cat /proc/sys/vm/overcommit_memoryIf you want to see the "Commit Limit" (the maximum amount of memory the kernel will allow in Mode 2) versus the "Committed_AS" (how much memory has currently been promised), you can check meminfo:
grep -E 'CommitLimit|Committed_AS' /proc/meminfoIf Committed_AS is significantly higher than your total RAM, you are currently living on borrowed time. You've placed the bet.
When the Bet Fails: Meet the OOM Killer
If the kernel overcommits and then every process suddenly decides to "touch" their memory at once, the kernel realizes it has promised 12GB of RAM but only has 8GB.
At this point, the kernel can’t just crash. It has to save itself. It triggers the Out Of Memory (OOM) Killer.
The OOM Killer is a janitor with a shotgun. It scans the process list and assigns a "badness" score to every process. The goal is to kill as few processes as possible to free up the most memory, while trying to avoid killing critical system tasks.
You can see the "score" of any running process:
# Replace <PID> with a real process ID
cat /proc/<PID>/oom_scoreThe scoring algorithm generally targets:
* Processes that are consuming a lot of RAM.
* Processes that have been running for a short time (younger processes are "cheaper" to kill).
* Processes that aren't running as root.
The Catastrophic Cost
In production, the OOM Killer is often a silent assassin. Your database might just disappear. Your web server might restart. If you haven't configured your logging correctly, you might not even know why.
You can check if the OOM Killer has been active by searching the kernel ring buffer:
dmesg | grep -i "killed process"The real danger isn't just that a process dies; it's *which* process dies. The OOM Killer might kill your primary PostgreSQL instance to save a leaky Python script, simply because the database was using the most memory.
Practical Example: Forcing a Crash
To understand the transition from "Virtual" to "Resident," we need to modify our previous C code to actually *use* the memory. This is called "faulting" the pages.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
// Let's be more realistic: 2GB
size_t size = 2ULL * 1024 * 1024 * 1024;
char *buffer = malloc(size);
if (!buffer) {
perror("Failed to allocate");
return 1;
}
printf("Allocated 2GB. Now writing to it...\n");
// We write to every 4096th byte (the size of a standard page)
// to force the kernel to actually provide a physical page.
for (size_t i = 0; i < size; i += 4096) {
buffer[i] = 'A';
if (i % (512 * 1024 * 1024) == 0) {
printf("Wrote %zu MB...\n", i / (1024 * 1024));
}
}
printf("Done. All 2GB is now resident in RAM.\n");
return 0;
}If you run this on a machine with limited RAM, the malloc will succeed instantly, but the for loop will eventually slow down or the program will be "Killed" mid-way through. That’s the moment the kernel lost the bet.
Defending Your Application
Since we know the kernel is gambling with our uptime, how do we protect ourselves?
1. Adjusting the OOM Score
If you have a mission-critical process (like a database or a monitoring agent), you can tell the kernel: "Don't kill this unless you have absolutely no other choice."
You do this by adjusting the oom_score_adj. The value ranges from -1000 (never kill) to 1000 (always kill first).
# Make a process nearly unkillable (requires root)
echo -1000 > /proc/<PID>/oom_score_adjMany systemd service files include a setting for this: OOMScoreAdjust=-1000.
2. Disabling Overcommit (The Accountant Strategy)
For high-reliability database servers, it's often better to have a malloc fail (and handle that in the app) than to have the kernel kill the process later.
You can set overcommit to Mode 2:
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80In this mode, if you try to allocate memory that would exceed 80% of your RAM (+ swap), malloc will actually return NULL. This forces you to handle the error gracefully in your code rather than letting the kernel take you out back with a shotgun.
3. Cgroups and Limits
If you are using Docker or Kubernetes, you are likely already using cgroups to limit memory. Cgroups provide a way to wall off memory so that one leaky container can't trigger the OOM Killer for the entire host. However, be aware that when a container hits its memory limit, the OOM Killer still triggers, but it's localized to the processes inside that container's cgroup.
The Edge Case: Swap
The "bet" is made even more complex by Swap. Swap is basically the kernel's credit card. When it runs out of cash (RAM), it starts charging things to the disk (Swap).
A lot of developers think "I have 64GB of RAM, I don't need Swap." This is often a mistake. Swap provides a "buffer" for the overcommit bet. It gives the kernel a place to move inactive pages so that the OOM Killer doesn't have to be triggered the microsecond physical RAM is full.
However, "Thrashing" happens when the kernel is constantly moving data between RAM and Swap because the overcommit bet was too aggressive. Your CPU usage will drop, but your Disk I/O will skyrocket, and the system will become unresponsive. Sometimes, an OOM kill is actually preferable to a system that is technically "up" but practically dead due to thrashing.
Summary
The Linux kernel overcommits memory because it prioritizes efficiency and performance over strict honesty. In most scenarios, this is the right choice. It allows for fast process creation and efficient use of sparse memory.
But as a developer or SRE, you must understand the terms of the debt:
1. `malloc` is not a guarantee. It's a promise that might be broken.
2. Resident memory is what matters. Watch your RSS (Resident Set Size), not just your VSZ (Virtual Size).
3. The OOM Killer is the house. And the house always wins. If you don't manage your memory, the kernel will manage it for you, and you won't like the results.
If you're running a critical system, don't just hope for the best. Check your oom_score, evaluate your overcommit_memory settings, and make sure you have enough swap to provide a safety net for when the kernel’s gambling streak inevitably ends.


