Skip to content

fix: reap zombie child processes in container environments#894

Open
toller892 wants to merge 1 commit into
charmbracelet:mainfrom
toller892:fix/reap-zombie-children
Open

fix: reap zombie child processes in container environments#894
toller892 wants to merge 1 commit into
charmbracelet:mainfrom
toller892:fix/reap-zombie-children

Conversation

@toller892

Copy link
Copy Markdown

Problem

When soft-serve runs as PID 1 in a container (e.g. Kubernetes) without an init supervisor like tini, orphaned descendant processes are reparented to PID 1 and become zombies. This happens because:

  1. Git operations spawn child processes (e.g. git pack-objects, git index-pack)
  2. When the parent git process exits, its children are reparented to PID 1 (soft-serve)
  3. The Go runtime only tracks children spawned via os/exec, not reparented orphans
  4. Without waitpid() calls, these processes accumulate as zombies

In the reporter's Kubernetes environment, this caused 30k+ zombies in under 24 hours, leading to PID exhaustion and node failure.

Fix

Add a periodic zombie reaper goroutine that calls waitpid(-1, WNOHANG) every 10 seconds to clean up any zombie children. The reaper:

  • Runs only on Linux (where the PID 1 container issue manifests)
  • Is a no-op on other platforms (macOS, Windows)
  • Uses golang.org/x/sys/unix.Wait4 (already a dependency)
  • Stops cleanly when the server context is canceled
  • Logs reaped PIDs at debug level

Changes

  • cmd/soft/serve/reap_linux.go — Linux implementation using unix.Wait4
  • cmd/soft/serve/reap_other.go — no-op stub for non-Linux platforms
  • cmd/soft/serve/serve.go — call reapZombies() during server startup

Testing

go build ./...    # passes
go test ./cmd/soft/serve/  # passes

Fixes #891

When soft-serve runs as PID 1 in a container without an init supervisor,
orphaned descendant processes (e.g. git pack-objects left behind when a
git parent exits) are reparented to PID 1 and become zombies because the
Go runtime only tracks children spawned via os/exec.

Add a periodic reaper goroutine that calls waitpid(-1, WNOHANG) every
10 seconds to clean up any zombie children. The reaper runs only on
Linux (where the PID 1 container issue manifests) and is a no-op on
other platforms.

Fixes charmbracelet#891
@linsein

linsein commented Jun 9, 2026

Copy link
Copy Markdown

I would prefer to modify the Dockerfile to add tini as the PID 1 process.
If unix.Wait4 is called globally, it might intercept signals intended for cmd.Wait, causing cmd.Wait to throw an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

soft serve doesn't reap child git processes, causing zombie accumulation

2 participants