The Garbage Collector Is a Scheduling Problem

I spent three days staring at a Prometheus dashboard, watching a Go service's P99 latency hit 200ms every ten minutes like clockwork. We had 64GB of RAM on the box, and the heap was barely touching 4GB. I’d done the standard dance: reduced allocations, tuned GOGC, and even tried manual runtime.GC() calls during idle windows. Nothing worked. The "Stop the World" (STW) pauses reported by the runtime were only 5ms, yet the client-side latency was forty times that. It felt like the laws of physics were breaking until I looked past the application code and into how the Linux kernel was actually scheduling our threads.

The realization was a turning point for me: Garbage collection isn't a memory problem. It's a scheduling problem.

When we talk about GC, we usually focus on the "Garbage" part—how we find dead objects and reclaim bytes. But the "Collector" part is what kills your performance. A collector is just another set of threads competing for CPU time, and when those threads interact with the Linux kernel's Completely Fair Scheduler (CFS), things get messy.

The Myth of the "Stop the World" Pause

We’re taught that STW pauses are the enemy. In a typical modern runtime (like Go or the JVM with ZGC/Shenandoah), the goal is to keep these pauses under a millisecond. The runtime accomplishes this by doing most of its work "concurrently"—meaning the GC threads run alongside your application threads (mutators).

But "concurrent" doesn't mean "free."

If you have a 4-core machine and your application is using 100% of all 4 cores, and then the GC decides it needs to scan the heap, it has to steal CPU cycles from your application. In a managed runtime, the GC doesn't just ask politely; it triggers a Safepoint.

To reach a safepoint, the runtime signals every running thread to stop at a known "safe" instruction. This is often implemented via a signal (like SIGURG in Go) or by flipping a memory page to be non-readable, forcing a trap.

Here is the problem: the runtime assumes that once it sends the signal, the thread will stop immediately. But the Linux kernel has its own ideas. If a thread is currently in a syscall or stuck behind a heavy kernel-side lock, it might take several milliseconds to reach that safepoint.

When the Kernel and Runtime Fight

In Linux, the CFS (Completely Fair Scheduler) manages how processes get CPU time. It uses a concept called "vruntime" (virtual runtime) to ensure every process gets its fair share.

When your GC kicks in, it spawns or wakes up background worker threads. To the Linux kernel, these are just more threads demanding time. If your application is already running close to its CPU limit (especially in a containerized environment like Kubernetes), the kernel starts "throttling" the process.

Consider this scenario in Go:

// A simple example of something that generates high allocation pressure
// and forces the GC to work hard.
func heavyWork() {
    for {
        // Allocate a bunch of small objects to trigger GC
        data := make([]byte, 1024)
        _ = data
        
        // Simulate some CPU-bound work
        calculatePrime(1000)
    }
}

func main() {
    // We limit the number of OS threads, but the GC will still 
    // try to use a fraction of our CPU quota.
    runtime.GOMAXPROCS(4)
    for i := 0; i < 100; i++ {
        go heavyWork()
    }
    select {}
}

If you run this inside a Kubernetes pod with a cpu: 2 limit, you are in for a bad time. The Go runtime sees 4 cores (GOMAXPROCS), but the CFS quota only allows for 200ms of CPU time per 100ms period.

When the GC starts, it might spin up 2 or 3 background threads. Now you have 4 application threads + 3 GC threads = 7 threads fighting for the time equivalent of 2 cores. The kernel will eventually hit the cfs_quota_us limit and "freeze" the entire process.

If the process is frozen while it's in the middle of a "Stop the World" pause, the pause duration doesn't just reflect the GC work; it reflects the time the process spent sitting in the kernel's "penalty box." This is why your P99s spike even when your heap is small.

The Cost of Context Switching

Every time the GC wakes up a thread, or the runtime forces an application thread to stop, the kernel performs a context switch.

You can see this in action using pidstat:

# Watch context switches for a specific process every 1 second
pidstat -w -p <PID> 1

A context switch isn't just about saving registers. It’s about cache locality. When the kernel switches from an application thread to a GC thread, the L1 and L2 caches are invalidated for the incoming task. When the GC finishes scanning a segment of memory and switches back to your application code, your CPU is "cold."

This "cold cache" effect is a hidden cost of GC scheduling. Your code isn't just stopped; it's made slower for a few hundred microseconds after it restarts.

The "Thundering Herd" of Wakeups

One of the most expensive things a GC does is restarting the world. After a STW phase, the runtime tells the kernel to wake up all the threads it just paused.

If you have 1000 goroutines or Java threads, and 64 CPU cores, the kernel suddenly has a massive influx of "runnable" threads. It has to decide which threads go on which cores. This causes massive "runqueue latency"—the time a thread spends waiting for a CPU to become available.

I’ve seen cases where the GC work took 500μs, but it took the kernel another 10ms to actually get the application threads back on the CPUs because they were all fighting for the same cores.

Practical Mitigation: CPU Affinity and Pinning

If you are dealing with extreme latency requirements, you have to stop the kernel from moving your threads around. This is where CPU Affinity (or "pinning") comes in.

On Linux, you can use taskset or the sched_setaffinity syscall to lock specific threads to specific cores. In a high-performance GC environment, you ideally want your application threads pinned to specific cores and your GC workers pinned to others, though most runtimes (like Go) make this difficult because they manage their own thread pools.

However, you can at least isolate the process:

# Run your application on cores 0-3 only, 
# keeping them away from other system interrupts
taskset -c 0-3 ./my_app

Tuning for Scheduling, Not Just Memory

When your P99s are spiking, your first instinct shouldn't be -Xmx or GOGC. It should be looking at sched_debug.

You can look at /proc/sched_debug (if your kernel has it enabled) to see how long your tasks are waiting in the runqueue. Look for the se.wait_sum value—this is the total time the task has spent waiting to be scheduled.

If wait_sum is increasing rapidly during GC cycles, you don't have a memory leak; you have CPU starvation.

Step 1: Fix the Container Limits

If you are in Kubernetes, ensure your CPU limits and CPU requests are the same (Guaranteed QoS class). If limits are higher than requests, the CFS will eventually throttle you. If limits are lower than the runtime's perceived core count, the runtime will over-schedule threads.

For Go, always use uber-go/automaxprocs to automatically set GOMAXPROCS to match your Linux container CPU quota:

import _ "go.uber.org/automaxprocs"

func main() {
  // Now GOMAXPROCS matches the CGroup limit, 
  // preventing the runtime from trying to use cores it doesn't have.
}

Step 2: Transparent Huge Pages (THP)

This is a classic Linux "gotcha." The kernel tries to be helpful by grouping 4KB memory pages into 2MB "Huge Pages." While this reduces TLB (Translation Lookaside Buffer) misses, the kernel's background process for this (khugepaged) often kicks in at the exact same time the GC is touching a lot of memory.

khugepaged can take a global lock on the memory management subsystem, causing your GC threads to hang while they wait for a page allocation.

Check if it's hurting you:

cat /sys/kernel/mm/transparent_hugepage/enabled

If it's set to always, try setting it to madvise or never to see if your P99 spikes disappear.

The "Work-Stealing" Problem

Modern GCs use work-stealing algorithms. If one GC thread finishes its job early, it looks at the queues of other threads and "steals" some of their work.

In a virtualized environment (like an AWS EC2 instance), "Steal Time" can ruin this. If the underlying hypervisor takes the physical CPU away from your VM while a thread is holding a GC-related lock, every other GC thread will spin waiting for that lock.

# Check for 'st' (steal time) in top or sar
sar -u 1

If you see non-zero steal time, the GC scheduling conflict isn't just between your runtime and Linux; it’s between Linux and the hypervisor.

A Code Example: Measuring the "Silent" Pause

If you want to prove this to yourself, you can write a small shim to measure "Real" vs "Runtime" pause time. In Go:

func measurePause() {
    var stats debug.GCStats
    for {
        start := time.Now()
        // This only tells us what the runtime *thinks* happened
        debug.ReadGCStats(&stats)
        
        // We can track the wall-clock time between iterations
        // to see if the kernel descheduled us entirely.
        time.Sleep(100 * time.Millisecond)
        elapsed := time.Since(start)
        
        if elapsed > 150*time.Millisecond {
            fmt.Printf("Scheduling delay detected! Slept for 100ms, but took %v\n", elapsed)
        }
    }
}

If elapsed is significantly higher than your sleep duration + reported GC STW time, you have a scheduling issue. The kernel decided your thread wasn't important enough to wake up on time.

Why Generational GC makes scheduling worse

You might think that moving to a Generational GC (like Java's G1 or the newer Go tweaks) would solve this. It helps memory pressure, but it can actually aggravate scheduling. Generational GCs rely on "Write Barriers"—extra code that runs every time you update a pointer in the heap.

// Logic inside a Write Barrier (Pseudo-code)
void oop_store(Object obj, Object value) {
    if (is_young_gen(value) && is_old_gen(obj)) {
        dirty_card_log(obj);
    }
    *obj = value;
}

These barriers increase the instruction count of your application. More instructions mean more CPU cycles, which brings you closer to your CFS quota. Furthermore, some GCs use "Concurrent Marking" which relies on these barriers to do the work of the GC during the application's time slice.

Essentially, you are paying for the GC using your application's scheduling priority.

The Solution Isn't More Memory

When we treat GC as a memory problem, we throw RAM at it. We increase the heap size so the GC runs less often. And that works... until it doesn't. When the GC *does* finally run on a 128GB heap, the amount of work it has to do is massive, the number of pages it touches is huge, and the likelihood of hitting a kernel scheduling bottleneck or a Page Fault increase exponentially.

Instead, start thinking about your runtime as a guest living inside the kernel's house.

1. Align your runtime's parallelism with your kernel's reality. Don't let your app think it has 16 cores if it's throttled to 4.
2. Monitor `sched_wait`. Use tools like bcc or bpftrace to look at how long threads spend on the runqueue.
3. Watch your syscalls. GC threads often need to call madvise or mmap to return memory to the OS. These syscalls can trigger kernel-side locks that stall your application threads.

The "Perfect" GC doesn't exist, because the GC doesn't own the CPU. It’s just another tenant. The sooner you stop tuning your heap and start tuning your scheduler relationship, the sooner those P99 spikes will settle down into a flat line.