loke.dev
Header image for The Page Cache Is an Implicit Memory Layer

The Page Cache Is an Implicit Memory Layer

An exploration of how the Linux kernel manages file I/O under the hood and why your application’s memory usage is often a beautiful lie.

· 9 min read

I remember the first time I looked at top on a brand-new database server and felt a cold spike of panic. Out of 128GB of RAM, the "free" column showed a measly 400MB, while the "buff/cache" column was swollen to nearly 100GB. I thought a memory leak was devouring the machine, but in reality, the kernel was just doing its job.

That "missing" memory wasn't gone; it was working. It was the Page Cache, a silent, implicit layer of memory that sits between your application and the disk. If you’re writing code that touches the filesystem, you aren't just interacting with an SSD or an NVMe drive—you are interacting with an incredibly sophisticated caching engine that treats your storage like it's part of your RAM.

The Great Memory Lie

When you ask a Linux process how much memory it’s using, it gives you the Resident Set Size (RSS). This is the portion of the process’s memory held in RAM. But RSS is a deceptive metric because it often ignores the massive amount of file data the kernel is holding on that process's behalf.

The Linux kernel hates wasted RAM. To a kernel developer, "free" memory is a wasted resource. If you have 64GB of RAM and your applications only need 8GB for their heaps, the kernel will use the remaining 56GB to store every byte of file data you’ve recently read from or written to the disk.

This is why, if you run cat on a 10GB log file twice, the second time is nearly instantaneous. The data didn't come from the disk; it came from the Page Cache.

How the Kernel Intercepts Your I/O

When your application calls read(), it doesn't actually go to the hardware. Instead, the CPU switches to kernel mode, and the VFS (Virtual File System) layer checks the Page Cache.

1. Check the cache: Is the requested page (usually 4KB) already in memory?
2. Cache Hit: Copy the data from kernel space to your application’s buffer. Return.
3. Cache Miss: The kernel pauses your thread, issues a request to the block device, fills a page in the cache with the disk data, and *then* copies it to your application.

This means every read() is actually a "read-through" cache operation. Writing is even more interesting: when you call write(), the kernel simply copies your data into the Page Cache and marks the page as dirty. Your write() call returns almost immediately, even though the data might not hit the physical disk for another 30 seconds.

Let’s look at a quick way to observe this behavior using a simple Python script and the time command.

# generate_data.py
# Create a 1GB dummy file
with open("big_file.bin", "wb") as f:
    f.write(b"\0" * (1024 * 1024 * 1024))

Now, try reading it twice:

# Clear the cache first (requires root)
sync; echo 3 > /proc/sys/vm/drop_caches

# First read: Cold cache
time cat big_file.bin > /dev/null
# real    0m1.120s (depends on your disk speed)

# Second read: Hot cache
time cat big_file.bin > /dev/null
# real    0m0.145s (pure RAM speed)

The difference is the Page Cache in action. The second run is an order of magnitude faster because the "file" is essentially sitting in DRAM.

The mmap Shortcut: When Files Become Pointers

If the Page Cache is an implicit layer, mmap is how we make it explicit. Instead of calling read() and write()—which involve copying data between kernel space and user space—mmap maps a file directly into your process’s address space.

When you mmap a file, the kernel sets up the page tables so that a range of virtual addresses points directly to the Page Cache pages for that file. No copying. No context switching for every read.

Here is a C example demonstrating how to map a file and read it as if it were a simple array:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main() {
    int fd = open("big_file.bin", O_RDONLY);
    struct stat sb;
    fstat(fd, &sb);

    // Map the file into memory
    char *ptr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (ptr == MAP_FAILED) {
        perror("mmap");
        return 1;
    }

    // We can now access the file data directly via the pointer
    // This doesn't actually load the whole file yet;
    // it triggers "page faults" that load data into the Page Cache lazily.
    printf("First byte: %d\n", ptr[0]);

    munmap(ptr, sb.st_size);
    close(fd);
    return 0;
}

This is the secret sauce behind high-performance databases like LMDB or search engines like Lucene. They don't manage their own buffers; they mmap the whole database and let the Linux kernel's Page Cache handle the heavy lifting of paging data in and out.

The "Dirty" Writeback Problem

The Page Cache is a double-edged sword. Because writes are buffered (asynchronous), an application might think its data is safe when it isn't. If the power cuts out after a write() returns but before the kernel flushes the "dirty" pages to disk, that data is gone.

You can see the current state of dirty pages by checking /proc/meminfo:

grep -E "Dirty|Writeback" /proc/meminfo

If you see Dirty climbing into the hundreds of megabytes, your system is "behind" on its chores. The kernel uses a few knobs to control this, found in /proc/sys/vm/:

* dirty_background_ratio: At what percentage of total memory should the kernel start writing data to disk in the background? (Usually 10%).
* dirty_ratio: At what percentage should the kernel block *all* new writes until some data is flushed? (Usually 20%).

If you are writing a performance-critical application (like a transaction log), you must use fsync() or fdatasync() to force the kernel to flush those specific pages to the physical media. Without it, you are just writing to RAM and hoping for the best.

Bypassing the Cache: O_DIRECT

Sometimes, the Page Cache gets in the way. If you are building a database that implements its own sophisticated caching (like Postgres or MySQL with the InnoDB Buffer Pool), the Linux Page Cache is often redundant. You end up with "double buffering"—the data lives in the database's cache *and* in the kernel's Page Cache.

To solve this, you can open a file with the O_DIRECT flag.

int fd = open("data.db", O_RDWR | O_DIRECT);

This tells the kernel: "Don't cache this. When I read or write, go straight to the disk." However, O_DIRECT is a finicky beast. It requires your memory buffers to be aligned to the disk’s block size (usually 512 or 4096 bytes). If you aren't perfectly aligned, the I/O will fail with EINVAL.

Most developers should stay away from O_DIRECT unless they are writing a storage engine. The kernel's heuristics for pre-fetching and write-buffering are almost always smarter than what you'll write by hand.

When the Kernel Reclaims the Cache

The most common question I get is: "If the Page Cache uses all my RAM, what happens when my app needs to malloc more space?"

This is the beauty of the Page Cache: it is evictable. Because the cache is mostly backed by files on disk, the kernel can "discard" a page of the cache at a moment's notice. If a page is "clean" (identical to what's on disk), the kernel just repurposes the RAM. If it's "dirty," the kernel flushes it to disk and then repurposes it.

However, this eviction isn't free. If your system is under heavy "memory pressure," the kernel might spend all its time thrashing—evicting a page to make room for a heap allocation, then immediately needing to read that file back from disk because another process called read(). This is called IO Wait, and it’s the silent killer of application performance.

Visualizing Cache Pressure with pcstat

If you want to see which files are actually occupying your RAM, there are tools like pcstat (Page Cache Stat). It uses the mincore system call to query the kernel about which pages of a file are currently resident in memory.

Imagine you have a large SQLite database. You can see how much of it is cached:

pcstat production.db
# +---------------+----------------+------------+-----------+---------+
# | Name          | Size (bytes)   | Pages      | Cached    | Percent |
# |---------------+----------------+------------+-----------+---------|
# | production.db | 1073741824     | 262144     | 262144    | 100.000 |
# +---------------+----------------+------------+-----------+---------+

If you see a critical file with a low "Percent Cached" value, you've found your performance bottleneck. You can even "warm" the cache on startup by reading files into /dev/null, ensuring they are resident before the first user request hits.

The "Swappiness" Factor

People often think vm.swappiness controls *when* the kernel swaps. That's not quite right. It actually controls the *relative cost* of reclaiming anonymous memory (your app's heap) versus reclaiming the Page Cache.

* A high swappiness (e.g., 60-100) tells the kernel: "It's okay to swap out unused app memory to keep the Page Cache large."
* A low swappiness (e.g., 1-10) tells the kernel: "Keep the app memory in RAM at all costs, even if it means we have almost no Page Cache."

In modern systems with fast NVMe drives, I usually lean toward keeping swappiness at its default (60). Trying to "disable" swap or setting swappiness to 0 often leads to the OOM killer being triggered prematurely because the kernel lost its flexibility to balance the implicit memory layer.

Practical Takeaways for Developers

Understanding the Page Cache changes how you write software. Here are the rules I live by:

1. Trust the Kernel: Don't try to build your own file-caching layer in user space unless you have a very specific reason. The kernel has 30+ years of optimization in its LRU (Least Recently Used) algorithms.
2. Monitor `iostat` and `bi`: If your application is slow, check the bi (blocks in) and bo (blocks out) columns in vmstat. If they are high, you are blowing your Page Cache, and your RAM is too small for your working set.
3. Use `fadvise`: If you know you are going to read a file sequentially, tell the kernel: posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL). This triggers the kernel to pre-fetch pages more aggressively, so they are already in the Page Cache by the time your code asks for them.
4. Be careful with `fsync`: Calling fsync() in a tight loop is the fastest way to destroy your application’s throughput. Group your writes, then sync once.

The Page Cache is the most successful abstraction in the Linux kernel. It turns the messy, slow reality of physical storage into a smooth, fast, memory-like interface. It makes our memory usage look like a lie, but it’s a lie that makes modern computing possible. Next time you see "0 bytes free," don't panic—just appreciate that the kernel is putting every single transistor to work.