loke.dev
Header image for 3 Things I Wish I Knew About Linux Control Groups Before Setting My Container Memory Limits

3 Things I Wish I Knew About Linux Control Groups Before Setting My Container Memory Limits

Stop relying on 'free' and 'top'—uncover why your containerized apps are being OOM-killed despite having 'plenty' of reported headroom.

· 8 min read

Why does a container with a 2GB limit die with an Out-Of-Memory (OOM) error when your monitoring tools insist it was only using 800MB?

I spent a week chasing this ghost. On the surface, the math didn’t add up. My Prometheus dashboard showed a flat line of memory usage well below the threshold, yet the kernel logs were screaming Memory cgroup out of memory: Killed process. It turns out that everything I thought I knew about "used memory" was a half-truth inherited from the era of bare-metal servers.

When you move into the world of Linux Control Groups (cgroups), the rules of physics—or at least the rules of accounting—change. If you are setting container limits based on what top tells you, you are setting yourself up for a 3:00 AM page.

Here are the three things I wish I’d known before I started carving up my RAM with cgroups.

1. The /proc Hallucination: Your Tools are Lying to You

The first thing you learn the hard way is that standard Linux utilities like free, top, and htop were never designed for containers. They are relics of a time when a process had a fairly transparent view of the hardware it ran on.

If you exec into a Docker container or a Kubernetes pod and run free -m, you’ll likely see the total memory of the underlying host node, not the limit you assigned to the container.

# Inside a container restricted to 512MB on a 32GB node
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31892       14201        2104        1204       15586       16012

This happens because these tools read from /proc/meminfo. In most container runtimes, /proc is not fully virtualized. It represents the host's perspective. Your application—especially if it’s running on a runtime like the JVM or Node.js—might see that 31GB of "total memory" and decide it’s perfectly safe to allocate a massive heap, completely unaware that the cgroup controller is standing over it with a guillotine at the 512MB mark.

The Source of Truth

To see what the kernel actually sees regarding your container, you have to bypass the legacy tools and go straight to the cgroup filesystem.

On a modern system using cgroup v2 (which is the standard on almost any recent distro), the stats live in /sys/fs/cgroup/.

# Get the current usage in bytes
cat /sys/fs/cgroup/memory.current

# Get the hard limit in bytes
cat /sys/fs/cgroup/memory.max

If you’re still on cgroup v1, the path is slightly different, usually /sys/fs/cgroup/memory/memory.usage_in_bytes.

The "Why" matters here: cgroups act as a filter for the kernel's resource allocator. The application thinks it's in a vast field of memory, but as soon as its allocations hit the cgroup limit, the kernel ignores the "free" memory on the host and triggers the OOM killer specifically for that group.

The Lesson: Never trust a tool that reads from /proc/meminfo inside a container unless it has been specifically patched (like newer versions of lxcfs) to be cgroup-aware. Use the /sys/fs/cgroup files for debugging.

2. The Page Cache is the "Silent" Memory Consumer

This is the one that actually broke my production environment. I had a Go-based service that processed large CSV files. The resident set size (RSS)—the "actual" memory used by the application code—was tiny, maybe 100MB. I set a limit of 512MB. The container kept dying.

It turns out that the cgroup memory limit includes the Page Cache.

When your application reads a file from disk, the Linux kernel is "helpful." It keeps the contents of that file in RAM (the Page Cache) in case you want to read it again soon. In a traditional OS environment, this is great; the kernel will just evict those pages if it needs the RAM for something else.

But inside a cgroup, the kernel treats that cache as part of the group’s "consumption."

How to trigger a Page Cache OOM

You can test this yourself. Create a cgroup (using systemd for simplicity) with a strict limit, then try to read a file larger than that limit.

# Create a temporary cgroup with a 100MB limit
systemd-run --user --scope -p MemoryMax=100M /bin/bash

# Inside that shell, try to read a 500MB file into /dev/null
# This doesn't store anything in the app's variables, but it fills the Page Cache.
cat large_500mb_file.bin > /dev/null

If the kernel can't reclaim that cache fast enough—or if the application is writing data to disk faster than the disk can flush it (causing "dirty" pages to pile up)—the cgroup hits its limit and the OOM killer strikes.

Checking for Cache Bloat

To see how much of your limit is being "wasted" on cache, look at memory.stat:

$ cat /sys/fs/cgroup/memory.stat | grep -E "file|anon"
anon 52428800
file 400000000
inactive_file 350000000

- anon (Anonymous memory): This is the memory your app actually allocated (the heap, the stack). This won't go away unless the app frees it.
- file (Page Cache): This is memory used to mirror files on disk.
- inactive_file: This is the portion of the cache that the kernel *should* be able to reclaim under pressure.

The Gotcha: If your inactive_file is high and your memory.current is near memory.max, you are in a dangerous zone. If the kernel spends too much time trying to reclaim those pages and can't keep up with the allocation rate, it stops being polite and kills the process. This is often called "thrashing."

3. Cgroup v2 Changed the Rules (and You Might Not Know It)

For a long time, we lived in the world of cgroup v1. It was messy. Every resource (CPU, Memory, I/O) had its own separate tree. In 2016, cgroup v2 was merged into the kernel, and it’s now the default in Ubuntu 22.04+, RHEL 9+, and Fedora.

If you are using scripts or monitoring agents written five years ago, they are likely looking at the wrong files. But the changes aren't just in file paths; the behavior of how limits are enforced has shifted.

Soft Limits vs. Hard Limits

In v1, we had memory.limit_in_bytes (hard limit) and memory.soft_limit_in_bytes. The soft limit was notoriously flaky; it only kicked in when the entire host was under memory pressure.

In v2, the philosophy changed to a "high/max" model:
- `memory.max` (The Hard Limit): If you hit this, you are OOM-killed immediately. No questions asked.
- `memory.high` (The Throttling Limit): This is the "soft" limit that actually works. If a cgroup exceeds memory.high, the kernel puts the processes in that group into a "penalty box." It slows them down and aggressively tries to reclaim memory (like the page cache) to bring the usage back below the high mark.

Why this matters for your app

If you want to avoid OOM kills, you should set memory.high slightly lower than memory.max. This gives the kernel a "buffer zone" to clean up the page cache or throttle the application before it reaches for the nuclear option.

# Setting a 1GB hard limit but starting reclamation at 800MB
echo 1G > /sys/fs/cgroup/my_app/memory.max
echo 800M > /sys/fs/cgroup/my_app/memory.high

If you see your app's performance suddenly tanking while memory usage is hovering at exactly 800MB, you know the kernel is throttling you to save you from a crash. This is infinitely easier to debug than a sudden death.

Putting it into Practice: The "Working Set Size" Calculation

So, if free is a lie and the Page Cache is a trap, how do you actually calculate a safe memory limit?

The metric you are looking for is the Working Set Size (WSS).
In the cgroup world, WSS is generally calculated as:
Total Memory Usage - Inactive File Cache

Here is a quick script to calculate the "true" memory pressure of a cgroup v2 container:

#!/bin/bash
# Calculate the Working Set Size of a cgroup

CGROUP_PATH=$1

if [ ! -d "$CGROUP_PATH" ]; then
    echo "Usage: $0 /sys/fs/cgroup/your_container_path"
    exit 1
fi

CURRENT=$(cat "$CGROUP_PATH/memory.current")
INACTIVE_FILE=$(grep "inactive_file" "$CGROUP_PATH/memory.stat" | awk '{print $2}')

# WSS = Current - Inactive File
WSS=$((CURRENT - INACTIVE_FILE))

echo "Current Usage: $((CURRENT / 1024 / 1024)) MB"
echo "Inactive Cache: $((INACTIVE_FILE / 1024 / 1024)) MB"
echo "Working Set Size: $((WSS / 1024 / 1024)) MB"

When I started using this WSS calculation to set my Kubernetes requests and limits, the OOM kills stopped. I realized that my apps weren't leaking memory; they were just busy reading logs and data files, and I hadn't accounted for the kernel's desire to cache that data.

The Edge Case: Swap

One final "I wish I knew" moment: By default, cgroups also limit swap usage if swap is enabled on the host. If your host has swap, but your cgroup doesn't explicitly allow it via memory.swap.max, your app will be killed even if there is plenty of swap available on the machine.

In cgroup v2, memory.max only tracks RAM. memory.swap.max tracks swap. If you don't set the latter, it usually defaults to 0 or max depending on the distribution's configuration. If it's 0, you have no safety net.

Summary

Setting memory limits isn't just about knowing how much RAM your code takes. It's about understanding the contract between the kernel and the cgroup.

1. Ignore the host tools: Look at /sys/fs/cgroup to see what the kernel is actually enforcing.
2. Account for the Page Cache: Your application's "usage" includes every file it touches. If you do heavy I/O, you need a larger memory buffer.
3. Use `memory.high`: Give the kernel a chance to throttle and reclaim memory before it reaches for the OOM killer.

The next time your container disappears into the night, don't look at top. Look at memory.stat, subtract the inactive_file, and you'll likely find exactly where your missing megabytes went.