The CFS Quota Is a Silent Performance Tax

Most developers believe that setting a CPU limit of 1000m in Kubernetes gives their application the dedicated power of a single core. We treat it like a physical constraint, imagining a world where our app gets to use 100% of a processor’s cycles, and if we need more, the kernel simply caps our speed.

That belief is fundamentally wrong.

In reality, CPU limits are not a "speed cap." They are a stopwatch. The Linux Completely Fair Scheduler (CFS) doesn't just slow you down when you hit your limit; it puts your entire process into a medically induced coma for the remainder of a "period." This is why your application might show 40% CPU utilization in Grafana while your P99 latency is exploding. You aren't out of CPU; you're being throttled by a silent, invisible tax collector.

The Periodic Table of Pain

To understand why your app is stuttering, we have to look at how the kernel actually enforces these limits. Under the hood, the CFS uses two primary knobs in the cgroup hierarchy:

1. cpu.cfs_period_us: The duration of the accounting window (usually 100ms).
2. cpu.cfs_quota_us: The total amount of CPU time your container can use within that window.

If you set a limit of 0.5 cores, Kubernetes translates that to a quota of 50ms every 100ms.

Here is the kicker: that 50ms is global for all threads in your container. If you have a multi-threaded Go or Java application running on a 32-core machine, those threads are all reaching into the same 50ms cookie jar. If 10 threads each work for 5ms at the exact same moment, you have exhausted your 50ms quota in exactly 5ms.

For the remaining 95ms of that period, the kernel deschedules your threads. They aren't "running slowly." They are stopped.

Seeing the Tax in Action

You can see this happening on any Linux machine running Docker or Kubernetes. You just need to know where to look. The kernel keeps a ledger of its "arrests" in the cgroup filesystem.

If you’re on Cgroup v1, check here:

cat /sys/fs/cgroup/cpu/cpu.stat

If you’re on Cgroup v2 (modern distros/k8s nodes):

cat /sys/fs/cgroup/cpu.stat

You'll see output like this:

nr_periods 104520
nr_throttled 12430
throttled_time 752304921

If nr_throttled is increasing, you are paying the tax. The throttled_time is in nanoseconds. In the example above, that's 752 seconds of total wall-clock time where the application's threads were ready to run but the kernel refused to schedule them.

The "Averaging" Lie

The reason this catches so many teams off guard is that our monitoring tools are designed to lie to us. Prometheus scrapes metrics every 15 or 30 seconds. It calculates CPU usage as an average over that window.

Imagine your app has a 100ms quota. In the first 10ms of a period, it goes crazy and uses the full 100ms of CPU time (via 10 threads). For the next 90ms, it is throttled. To your user, the app was dead for 90ms. But to Prometheus, your CPU usage was "100% of quota," and if you only did that once a second, Prometheus would report a silky smooth 10% CPU usage.

You look at your dashboard, see 10% usage, and conclude that CPU isn't the bottleneck. You start chasing ghosts in the network stack or the database, when the culprit is sitting right in the kernel scheduler.

Why Multi-threading Makes it Worse

Modern runtimes like the Go scheduler (GOMAXPROCS) or the Java Common Fork-Join Pool are "bursty" by nature. They want to finish work as fast as possible to free up resources.

Let's look at a simple scenario. You have a limit of 2 cores (2000m) and you're running on a massive 64-core node. You have a burst of 20 incoming requests. The runtime spawns 20 threads to handle them.

Each thread needs 10ms of CPU time to finish its task.

* Total CPU time needed: 200ms.
* Your Quota: 200ms per 100ms period.

In a perfect world, these 20 threads would run, consume the 200ms quota in 10ms of real-time (since they run in parallel on 20 different physical cores), and everyone would be happy.

But what if those threads need 11ms? Or what if a background GC (Garbage Collection) cycle kicks in at the same time and uses 5ms?

Suddenly, you hit the 200ms limit at the 9ms mark of the period. The kernel freezes all 20 threads. They sit there, holding locks, keeping connections open, and idling for the next 91ms. Your P99 latency just jumped from 10ms to 101ms, and you have no idea why because your "average CPU" is still extremely low.

A Practical Experiment: The Throttling Simulator

I wrote a small Go program to demonstrate this. It calculates prime numbers—a CPU-intensive task—but does so in short bursts to mimic a web server.

package main

import (
	"fmt"
	"runtime"
	"sync"
	"time"
)

func work(duration time.Duration) {
	done := time.After(duration)
	for {
		select {
		case <-done:
			return
		default:
			// Just spin to consume CPU
		}
	}
}

func main() {
	// Let's use all available cores to maximize burst
	cores := runtime.NumCPU()
	fmt.Printf("Running on %d cores\n", cores)

	for {
		start := time.Now()
		var wg sync.WaitGroup
		
		// Simulate a burst of 10 concurrent requests
		for i := 0; i < 10; i++ {
			wg.Add(1)
			go func() {
				defer wg.Done()
				work(20 * time.Millisecond)
			}()
		}
		
		wg.Wait()
		fmt.Printf("Burst took: %v\n", time.Since(start))
		
		// Wait for the next "second" to keep average utilization low
		time.Sleep(900 * time.Millisecond)
	}
}

If you run this in a container with a CPU limit of 1 (1000m), you’ll notice something strange. The "Burst took" output won't be 20ms. It will likely be closer to 120ms or 200ms. Even though the "work" only takes 20ms of CPU time, the concurrency causes the quota to be exhausted instantly, forcing the app to wait for the next period.

The 2019 Kernel Bug (And Why It Still Matters)

It’s worth noting that for a long time, the CFS quota was actually broken. There was a bug in the Linux kernel where the quota wasn't being returned correctly to the global pool, leading to "unnecessary throttling."

Even if you had quota left, the kernel would sometimes throttle you anyway. This was fixed in Linux 5.4+ (and backported to many kernels), but many people's "fear" of CPU limits stems from this era.

However, even with the bug fixed, the design of CFS quota still causes the problems I'm describing today. It’s not a bug anymore; it’s the intended behavior.

Strategies to Kill the Tax

If you’re suffering from CFS throttling, you have three main paths forward. None of them are a "silver bullet," and each has trade-offs.

1. The Nuclear Option: Remove Limits

The most controversial advice in the Kubernetes world is: Don't use CPU limits.

Only use CPU requests.
* requests: Guarantees the container gets this much time and is used for scheduling.
* limits: Hard cap that triggers CFS throttling.

If you remove the limit, your container can "burst" into the unused capacity of the node. Since the CFS is a "Fair" scheduler, if other containers need that CPU, your container will be scaled back naturally based on its request weight, but it won't be hard-throttled just because it hit an arbitrary number.

*Warning: Some organizations require limits to prevent "noisy neighbors" from taking down a whole node. If a rogue process has an infinite loop, it could starve other processes if there are no limits.*

2. Runtime Awareness (The GOMAXPROCS Trick)

If you are running Go, you must use Uber's automaxprocs library.

By default, Go sees the number of cores on the host, not the container. If you are on a 64-core machine but have a 2-core limit, Go will still try to spawn 64 threads for the scheduler. This is a recipe for instant throttling.

automaxprocs looks at your cgroup limit and sets GOMAXPROCS to match.

import _ "go.uber.org/automaxprocs"

func main() {
  // Your app now respects the CFS quota limits
}

For Java, ensure you are on a modern version (JDK 11+) that is "container aware." Older versions of Java will see the total host memory and CPU, leading to disastrous heap settings and thread pool sizes.

3. CPU Burst (Linux 5.14+)

The Linux kernel recently introduced a "Burst" feature for CFS (cpu.cfs_burst_us). This allows a container to "bank" unused quota from previous periods to use during a spike.

If your app is idle for 500ms, it can accumulate credit. When a burst of requests arrives, it can use more than its 100ms quota for a brief moment without being throttled.

In Kubernetes 1.26+, you can enable this via the CustomCPUCFSQuotaPeriod feature gate or by using a KubeletConfiguration file:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  CPUCFSQuotaPeriod: true
cpuCFSQuotaPeriod: "100ms" # You can also try reducing the period

Reducing the cfs_period_us (e.g., from 100ms to 10ms) can also help. It makes the "arrests" shorter but more frequent, which often results in smoother tail latency, though it increases kernel overhead.

Identifying the Culprit

If you suspect you're being taxed, stop looking at your CPU usage percentage. Instead, set up monitoring for the container_cpu_cfs_throttled_seconds_total metric in Prometheus (exported by cAdvisor).

Here is a PromQL query to find the worst offenders in your cluster:

sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod, container) 
/ 
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, container)

If this ratio is high, your application is spending a significant portion of its life in a frozen state.

Final Thoughts

The CFS quota is a tool for resource isolation, but it's a blunt instrument. It assumes that work is evenly distributed over time, which is almost never true for web applications.

When you see a container stuttering, don't just throw more CPU at it. Look at the cpu.stat. Check your thread counts. If you’re running a bursty microservice, the "best" CPU limit might actually be no limit at all.

Don't let the kernel's stopwatch kill your performance. The tax is only silent if you aren't listening.