Is Your L4 Load Balancer Blind to Your Long-Lived gRPC Streams?

Your high-availability Kubernetes cluster is likely lying to you. You’ve scaled your deployment to ten replicas, your HPA is configured, and your Readiness probes are green, yet one single pod is screaming at 90% CPU while the other nine are idling at 5%. This isn't a fluke or a "warm-up" issue; it is a fundamental architectural mismatch between the way Kubernetes handles networking and the way gRPC communicates.

Standard Kubernetes Services are functionally blind to the reality of modern, persistent traffic. If you are running gRPC, your Layer 4 load balancer isn't actually balancing your load—it’s just pinning it.

The Layer 4 Blind Spot

To understand why your traffic is bunching up, we have to look at how kube-proxy (the component responsible for Kubernetes Services) handles a connection.

By default, a Kubernetes Service acts as a Layer 4 (L4) load balancer. It operates at the TCP/UDP level. When a client wants to talk to a Service, kube-proxy (usually via iptables or IPVS) picks a backend pod and redirects the TCP connection there. From the perspective of the L4 balancer, its job is done once the connection is established. It doesn't look at what's inside the packets.

This worked perfectly for the old world of HTTP/1.1. In that era, clients often opened a new TCP connection for every request, or at least closed them frequently. Every new connection gave the L4 balancer a fresh chance to pick a different, less-busy pod.

Then came gRPC, built on top of HTTP/2.

gRPC is designed for performance. It uses long-lived TCP connections and multiplexes thousands of requests over a single pipe. This is great for reducing handshake latency, but it’s a nightmare for L4 load balancers. Once a gRPC client connects to a pod through a standard Kubernetes Service, it stays connected to *that specific pod* forever. Even if you scale your deployment from 1 pod to 100, the existing clients won't move. They are stuck in a "sticky" relationship they never asked for.

Visualizing the "Hot Pod" Problem

Imagine you have a gRPC client and a service with 3 replicas.

1. Client A connects to my-service.
2. kube-proxy picks Pod 1. The TCP connection is established.
3. Client A sends 10,000 gRPC requests. All 10,000 go to Pod 1.
4. You notice Pod 1 is struggling, so you scale to 10 pods.
5. Client A is still sending requests over that original TCP connection. Pod 1 continues to redline, while Pods 2 through 10 receive zero traffic.

This is the "Hot Pod" problem. Your infrastructure is scaled, but your traffic is stagnant.

Why "Wait and See" Isn't a Strategy

I’ve seen teams try to solve this by aggressively killing pods to force re-balancing. It’s a band-aid that creates its own set of problems—increased error rates, cold starts, and unnecessary churn. If you want to fix this, you have to move the intelligence of the load balancing from the network layer up to the application layer (Layer 7).

There are three primary ways to solve this:
1. Client-side Load Balancing (The DIY approach)
2. L7 Proxy / Ingress (The middleman approach)
3. Service Mesh (The infrastructure approach)

---

Strategy 1: Client-Side Load Balancing (The Headless Solution)

If you don't want to introduce more "middlemen" into your network, you can make the client smarter. In this scenario, the client is responsible for looking up the IP addresses of all available pods and distributing requests among them.

In Kubernetes, we do this using a Headless Service. By setting clusterIP: None in your Service manifest, Kubernetes won't give you a single virtual IP. Instead, a DNS lookup for the service name will return the A-records (IPs) of all the individual pods.

The Headless Service Manifest

apiVersion: v1
kind: Service
metadata:
  name: billing-service-headless
spec:
  clusterIP: None  # This makes it headless
  selector:
    app: billing-service
  ports:
    - protocol: TCP
      port: 50051
      targetPort: 50051

Now, the client needs to actually use these IPs. Most gRPC libraries (Go, Java, C++, etc.) have built-in support for DNS-based round-robin balancing.

Example: gRPC Client-Side Balancing in Go

In Go, you don't just dial the address; you specify a "resolver" and a "balancer" configuration.

import (
	"google.golang.org/grpc"
	"google.golang.org/grpc/resolver"
    _ "google.golang.org/grpc/balancer/roundrobin" // Register the round_robin balancer
)

func main() {
    // The 'dns:///' prefix tells gRPC to use the DNS resolver.
    // The service name must match the Headless Service we created.
	serviceAddr := "dns:///billing-service-headless.default.svc.cluster.local:50051"

	conn, err := grpc.Dial(
		serviceAddr,
		grpc.WithInsecure(),
        // This is the magic: specify the round_robin policy
		grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [{"round_robin":{}}]}`),
	)
	if err != nil {
		log.Fatalf("did not connect: %v", err)
	}
	defer conn.Close()

	// Now, requests made through 'conn' will be distributed across all pod IPs
}

The Gotcha: Client-side balancing requires your clients to be "well-behaved." If you have clients written in five different languages, you have to ensure every single one implements the resolver logic correctly. It also means your clients need permissions to query the DNS frequently to see when pods scale up or down.

---

Strategy 2: The L7 Proxy (The Envoy Pattern)

If you can't or don't want to change your client code, you need a proxy that understands HTTP/2. This proxy sits between the client and the pods. It terminates the gRPC connection from the client and opens new connections to the backend pods. Because it works at Layer 7, it can see individual gRPC requests and route them one-by-one.

Envoy is the industry standard for this. It was built specifically to solve the "long-lived connection" problem of modern microservices.

You can deploy Envoy as a standalone "edge" load balancer or as a gateway. When a request comes in, Envoy looks at its internal load balancing pool and sends the request to the pod with the fewest active requests or via a simple round-robin.

A Minimal Envoy Configuration for gRPC

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 10000 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          http2_protocol_options: {} # This enables HTTP/2 support
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: billing_service_cluster }
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
  - name: billing_service_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS # Envoy will resolve the DNS of the headless service
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    load_assignment:
      cluster_name: billing_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: billing-service-headless
                port_value: 50051

This configuration tells Envoy: "Listen on port 10000, treat everything as HTTP/2, and balance it across the IPs found in billing-service-headless."

---

Strategy 3: Service Mesh (The "Hands-Off" Fix)

If your architecture is growing and you’re tired of manually configuring proxies for every single service, a Service Mesh (like Istio or Linkerd) is the natural evolution.

A Service Mesh essentially automates Strategy 2. It injects a "sidecar" proxy (usually Envoy) into every single pod. When Pod A wants to talk to Pod B, the request actually goes:
Pod A -> Envoy Sidecar A -> Envoy Sidecar B -> Pod B.

Since the sidecars are L7-aware, they handle the gRPC load balancing transparently. You don't have to change your code, and you don't have to configure headless services manually.

Why wouldn't you do this? Overhead. A service mesh adds latency (usually minimal, but present) and significant operational complexity. If you only have two or three gRPC services, Istio is likely overkill. If you have 200, it’s a lifesaver.

---

The "Quick and Dirty" Fix: Connection Aging

Sometimes you're in a pinch. You have a production outage *now*, your pod is melting, and you can't rewrite the client or deploy a service mesh in the next ten minutes.

This is where MaxConnectionAge comes in.

Most gRPC server implementations allow you to set a maximum age for a connection. When the connection reaches this age, the server sends a GOAWAY frame to the client. This tells the client: "Finish what you're doing, but please open a new connection for your next request."

When the client opens that new connection, it goes back through the Kubernetes L4 balancer, which then has a chance to redirect it to a different, less-loaded pod.

Setting MaxConnectionAge in Go (Server-side)

import (
	"google.golang.org/grpc"
	"google.golang.org/grpc/keepalive"
	"time"
)

func main() {
	options := []grpc.ServerOption{
		grpc.KeepaliveParams(keepalive.ServerParameters{
			MaxConnectionAge:      time.Minute * 5,  // Force reconnect every 5 mins
			MaxConnectionAgeGrace: time.Second * 30, // Give 30s to wrap up active RPCs
		}),
	}

	server := grpc.NewServer(options...)
	// ... register services and serve
}

By forcing periodic reconnections, you are essentially mimicking the "frequent connection" behavior of HTTP/1.1, allowing the "dumb" L4 load balancer to do its job. It’s not as elegant as true L7 balancing, but it effectively eliminates permanent hotspots.

---

How to Verify You Actually Have a Problem

Don't just take my word for it. If you suspect your gRPC traffic is unbalanced, you need to look at the metrics. If you’re using Prometheus, you can compare the request rate across different pods in the same deployment.

A simple PromQL query to identify imbalance:

sum(irate(grpc_server_handled_total{app="billing-service"}[5m])) by (kubernetes_pod_name)

If you see one pod handling 500 requests per second and another handling 10, you have a load balancing problem. If you see CPU usage correlating perfectly with these numbers, you’ve confirmed the L4 blind spot.

Summary: Choosing Your Path

The right solution depends on your scale and your control over the environment:

1. Small scale / Single language: Use Client-Side Balancing with a Headless Service. It’s lightweight and requires no extra infrastructure.
2. Legacy clients / Multi-language: Use an L7 Proxy (Envoy) as a gateway. It keeps the complexity out of your application code.
3. High complexity / Enterprise scale: Deploy a Service Mesh. It’s a "set it and forget it" solution for traffic management, albeit with a higher resource cost.
4. Emergency / Stop-gap: Implement MaxConnectionAge on your servers to force the "dumb" balancer to reshuffle the deck.

The worst thing you can do is ignore it. gRPC is a fantastic tool for high-performance systems, but if you treat it like old-school HTTP, you're leaving performance on the table and inviting a production paging event when one pod finally gives up under the pressure. Stop assuming Kubernetes is balancing your traffic just because you have a Service object. Look inside the pipe.