loke.dev
Header image for The Night My Container Refused to Exit: How I Finally Mastered the Linux PID 1 Init Process

The Night My Container Refused to Exit: How I Finally Mastered the Linux PID 1 Init Process

Understand why your Docker containers often ignore SIGTERM and accumulate un-reaped processes by diving into the kernel's unique rules for the init process.

· 9 min read

You run docker stop on a container that’s been behaving perfectly in staging. You expect a clean exit in under a second. Instead, you stare at a blinking cursor for exactly ten seconds until the Docker daemon loses its patience and forcefully nukes the process from orbit with a SIGKILL. Your database connections aren't closed cleanly, your file buffers aren't flushed, and your logs look like they were cut off mid-sentence.

This isn't a bug in Docker. It’s a fundamental feature of the Linux kernel regarding Process ID (PID) 1. When you run a process inside a container, that process becomes the "init" process of its own PID namespace. And the rules for PID 1 are vastly different from the rules for any other process on your system.

The Ten-Second Stare-Down

In a standard Linux environment, if you send SIGTERM to a process, the kernel’s default behavior is to terminate that process. However, if that process has PID 1, the kernel treats it as the system's supervisor. The kernel assumes that if PID 1 dies, the whole "system" (the container) dies with it. To prevent accidental system crashes, the kernel ignores any signal sent to PID 1 unless the process has explicitly registered a handler for that signal.

When you run docker stop, Docker sends a SIGTERM. If your process is running as PID 1 and hasn't explicitly told the kernel, "Yes, I know what to do with SIGTERM," the signal is simply discarded. Docker waits ten seconds (the default timeout) for the process to exit, then gives up and sends SIGKILL (kill -9), which cannot be ignored but doesn't allow for a graceful shutdown.

The Shell Wrapper Trap

The most common reason for this behavior is the "Shell Form" vs. "Exec Form" in your Dockerfile.

Consider this Dockerfile snippet:

FROM ubuntu:22.04
COPY my_app /usr/local/bin/my_app
# This is the "Shell Form"
ENTRYPOINT /usr/local/bin/my_app

When you use the shell form, Docker doesn't execute my_app directly. It executes /bin/sh -c "/usr/local/bin/my_app". Inside the container, /bin/sh is PID 1, and my_app is a child process (PID 2 or higher).

Most shells (like sh or bash) do not forward signals to their child processes. When Docker sends SIGTERM to the container, it hits /bin/sh. The shell ignores it. It doesn't pass it down to my_app. The application keeps running, oblivious to the fact that the platform is asking it to shut down.

Demonstrating the failure

Let's look at a Python script (app.py) designed to handle signals:

import signal
import sys
import time

def handle_sigterm(signum, frame):
    print("Received SIGTERM! Cleaning up...")
    time.sleep(2)
    print("Cleanup complete. Exiting.")
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

print("App is running. Press Ctrl+C or send SIGTERM.")
while True:
    time.sleep(1)

If you run this in a Dockerfile using the shell form:

FROM python:3.9-slim
COPY app.py .
ENTRYPOINT python app.py

Running docker stop will take 10 seconds. But if you switch to the Exec Form (using JSON array syntax), it works instantly:

# This is the "Exec Form"
ENTRYPOINT ["python", "app.py"]

In the JSON array version, Docker executes python directly as PID 1. The Python interpreter sees the SIGTERM, triggers your signal handler, and exits gracefully.

The Zombie Apocalypse (Zombie Processes)

Ignoring signals is only half the problem. PID 1 has another vital responsibility: Reaping orphaned child processes.

In Linux, when a process dies, it doesn't immediately vanish from the process table. It enters a "Zombie" state (marked as <defunct> in ps). It stays there until its parent calls a wait() system call to collect its exit status. If the parent dies before the child, the child becomes an orphan.

The Linux kernel mandates that all orphans are re-assigned to PID 1. It is the absolute duty of PID 1 to "reap" these orphans by calling wait().

Most applications (Node.js, Python, Go binaries) are not designed to be init systems. They don't have logic to loop over waitpid() to clean up children they didn't know they had. Over time, your container fills up with zombie processes. While zombies don't consume CPU or RAM, they occupy slots in the process table. If you hit the process limit (PID_MAX), your container won't be able to fork any new processes, leading to "Out of memory" or "Resource temporarily unavailable" errors that are maddeningly difficult to debug because RAM usage looks fine.

A Recipe for Zombies

Here is a C program that creates an orphan:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    pid_t pid = fork();

    if (pid > 0) {
        // Parent process: just sleeps forever
        printf("Parent (PID %d) is running...\n", getpid());
        while(1) sleep(1);
    } else if (pid == 0) {
        // Child process: exits immediately
        printf("Child (PID %d) exiting to become a zombie...\n", getpid());
        exit(0);
    }
    return 0;
}

If you run this as PID 1 in a container, that child process will remain a zombie forever. The "Parent" (your app) isn't looking for dead children, and since the Parent *is* PID 1, there is no higher authority to clean them up.

Solving the PID 1 Problem

You shouldn't have to write custom signal handling and process reaping logic into every microservice you build. There are three standard ways to solve this.

1. The exec Trick (For Shell Scripts)

If you must use a shell script as your entrypoint (to set environment variables or download secrets), make sure the final command uses exec.

#!/bin/sh
# entrypoint.sh

export CONFIG_VAR="loaded"

# Use 'exec' to replace the shell process with the application
exec /usr/bin/my_app

The exec command replaces the current shell process with the target application. This allows my_app to inherit PID 1 from the shell. The shell is gone, and my_app now receives signals directly.

2. Using an Init Daemon (tini)

The most robust solution is to use a "micro-init" daemon. tini is the most popular choice (it’s actually baked into Docker now). Tini's sole job is to be PID 1. It handles signals (forwarding them to your app) and it aggressively reaps zombies.

In your Dockerfile:

# Add Tini
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

ENTRYPOINT ["/tini", "--", "/usr/bin/my_app"]

Now, the process tree looks like this:
- PID 1: tini
- PID 2: my_app

When docker stop sends SIGTERM, tini catches it and immediately sends it to my_app. If my_app spawns some weird sub-processes that get orphaned, tini will clean them up.

3. The Docker --init Flag

If you don't want to modify your Dockerfile, you can use the --init flag at runtime.

docker run --init my-image

This tells Docker to bake a small init binary (usually tini) into the container and use it as PID 1, automatically wrapping your ENTRYPOINT. It’s the easiest fix, but it's often better to include it in the Dockerfile so the image behaves correctly regardless of how someone runs it (e.g., in Kubernetes).

Kubernetes and the Grace Period

In Kubernetes, the PID 1 problem becomes even more critical. When a Pod is terminated, Kubernetes sends a SIGTERM to the containers and waits for a terminationGracePeriodSeconds (default 30s).

If your app is PID 1 and ignores SIGTERM, it will sit there for the full 30 seconds. This slows down your deployments and can lead to service interruptions if your load balancer keeps sending traffic to a pod that is technically in a "Terminating" state but refuses to stop.

Furthermore, Kubernetes doesn't have a --init flag. If your application needs init-like behavior (like zombie reaping), you must include a tool like tini in your image or ensure your binary handles these responsibilities.

Why Go and Node.js are Different

Interestingly, some runtimes handle this better than others, but none are perfect.

Node.js: By default, Node.js does not handle SIGTERM. If it's PID 1, it will ignore the signal unless you explicitly add:

process.on('SIGTERM', () => {
  console.info('SIGTERM signal received.');
  server.close(() => {
    process.exit(0);
  });
});

Go: Go binaries are statically linked and very "clean," but they still suffer from the PID 1 signal-ignore rule. However, Go's os/signal package makes it very easy to handle:

c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
go func() {
    <-c
    // Run cleanup
    os.Exit(0)
}()

Java (JVM): The JVM actually handles SIGTERM reasonably well by triggering Shutdown Hooks. But even the JVM won't reap zombie processes created by sub-shells or native code calls.

The Kernel's Perspective

To truly understand why the kernel does this, we have to look at the kernel/signal.c source code (roughly speaking). For a standard process, the kernel checks if a signal is blocked or ignored. For PID 1, the kernel adds an extra check: "Is this signal coming from the kernel itself, or was it sent by a user-space process?"

If it's from user-space (like Docker or kill), and the process hasn't registered a specific handler, the kernel assumes the signal was a mistake and protects the "init" process. The kernel logic essentially says: "I cannot allow the system to be killed by a random signal unless the system explicitly told me it's ready to handle it."

This is why kill -9 (SIGKILL) works. SIGKILL doesn't go to the process; the kernel simply sees the signal, looks at the process, and destroys its task structure immediately.

Conclusion: The PID 1 Checklist

Next time you're building a container, don't just shove your binary into an image and hope for the best. Follow this mental checklist:

1. Use Exec Form: Always use ["executable", "param1"] in your CMD or ENTRYPOINT. Avoid the string format that invokes /bin/sh.
2. Handle Signals: Ensure your application code specifically listens for SIGTERM and starts a shutdown sequence.
3. Use an Init Tool: If your application spawns sub-processes (like a cron worker, a shell script, or a multi-process web server), use tini. It is 24KB of insurance against zombie processes and shutdown timeouts.
4. Test It: Run time docker stop <container>. If it takes exactly 10 seconds, you have a PID 1 problem.

Understanding the unique status of the init process turns "flaky containers" into predictable, enterprise-grade infrastructure. It’s the difference between a system that crashes and a system that gracefully bows out.