Skip to content

fix(spawn): opt-in PTY_SPAWNER_PID watchdog to prevent orphaned daemons#39

Merged
myobie merged 1 commit into
myobie:mainfrom
schickling-assistant:fix/spawner-pid-watchdog
Jun 11, 2026
Merged

fix(spawn): opt-in PTY_SPAWNER_PID watchdog to prevent orphaned daemons#39
myobie merged 1 commit into
myobie:mainfrom
schickling-assistant:fix/spawner-pid-watchdog

Conversation

@schickling-assistant

Copy link
Copy Markdown
Contributor

Why

spawnDaemon() produces a detached daemon process that has no way to self-terminate when its spawner exits without calling disconnect() / kill(). Because detached: true puts the daemon in its own session, the kernel sends no signal when the spawner dies — the daemon is reparented to init and survives forever.

In practice this leaks daemons whenever a short-lived script, test harness, or scoped resource spawns a daemon and exits. We observed this in the wild: hundreds of @myobie/pty/dist/server.js processes accumulating multi-GiB RSS over days, pushing one host into swap thrashing.

Full write-up + scenario matrix (clean exit / SIGKILL / SIGHUP / SIGTERM): overengineeringstudio/effect-utils#677

Self-contained reproduction: https://github.com/schickling-repros/2026-05-myobie-pty-zombie-daemon-leak

What

New opt-in bindToSpawnerLifetime field on SpawnDaemonOptions. Default off — existing callers see no behaviour change.

When enabled:

  • spawn.ts injects PTY_SPAWNER_PID=<spawner pid> into the daemon's env.
  • server.ts reads the var on startup. If the PID is already dead (ESRCH), it exits via cleanShutdown(0) immediately. Otherwise it polls process.kill(pid, 0) every 5 s and calls cleanShutdown(0) once the spawner is gone.
  • Invalid / missing values disable the watchdog (defensive: garbage env vars don't kill daemons).

How

  • New option on SpawnDaemonOptions; conditional env-var injection in spawnViaNode.
  • New installSpawnerWatchdog + isProcessAlive helpers in server.ts, wired into the existing entry-point block alongside the SIGTERM / SIGINT handlers. Uses setInterval(...).unref() so the watchdog never holds the event loop open on its own.
  • Mechanism is an env var rather than a constructor option so the daemon (separate process) needs no IPC handshake.

Rationale

The issue lists four fix options:

  1. Idle timeout in server.js — requires picking a default, and conflicts with the design intent that daemons survive client restarts.
  2. PTY_SPAWNER_PID env var ← this PR. Opt-in, no config to tune, no behaviour change unless requested.
  3. prctl(PR_SET_PDEATHSIG) — Linux-only, and order-sensitive after detached: true (the parent may already be init by the time we set it).
  4. Wrapper-side bookkeeping in the consumer (e.g. @overeng/pty-effect) — wraps the symptom; same leak recurs in any other caller.

Option 2 is the minimal upstream fix: tight binding when the caller wants it, zero impact when they don't. Long-lived supervisors (e.g. the bundled pty supervisor) that want daemons to outlive them simply omit the option.

Test plan

  • New tests/spawner-pid-watchdog.test.ts covers:
    • Daemon shuts down when the spawner PID dies post-startup (within ~10 s of the 5 s poll).
    • Daemon exits immediately when the spawner PID is already dead at startup.
    • Invalid PTY_SPAWNER_PID values are ignored (no behaviour change).
  • Manually reproduced the original leak (scenario A from the issue) against the patched dist — daemon dies within 5–7 s of spawner exit.
  • npm run typecheck and npm run build clean.
  • npm test: the new tests pass; the only failures locally are in tests/stats-cli.test.ts (macOS ps parsing — pre-existing on main, unrelated to this change).

Alternatives considered

  • prctl(PR_SET_PDEATHSIG, SIGTERM) — would give a kernel-driven signal on Linux, but is platform-locked and prone to "parent already reparented to init" races right after detached: true. Could be layered on top of this PR later as a Linux-only fast path; the env-var watchdog still works on macOS and BSDs.
  • Idle timeout in the daemon — useful complement but needs a tunable, and would change behaviour for callers who legitimately keep daemons up with no clients attached. Out of scope for this PR.
  • Always on — considered injecting PTY_SPAWNER_PID unconditionally, but that would change semantics for the bundled supervisor (daemons would die when the supervisor restarts). Keeping it opt-in preserves the documented "long-lived sessions that survive process restarts" contract.
Posted on behalf of @schickling
field value
agent_name 🦦 cl2-otter
agent_session_id b1658469-b46b-45a0-9754-e3c03740fe05
agent_tool Claude Code
agent_tool_version 2.1.139
agent_runtime Claude Code 2.1.139
agent_model claude-opus-4-7
worktree pty/fix/spawner-pid-watchdog
machine mbp2025
tooling_profile dotfiles@4e6515b

`spawnDaemon` produces a detached daemon process that has no mechanism to
self-terminate when its spawner exits without calling `disconnect()` /
`kill()`. Because `detached: true` puts the daemon in its own session,
the kernel sends no signal when the spawner dies — the daemon is
reparented to init and survives forever.

In practice this leaks daemons whenever a short-lived script, test
harness, or scoped resource (e.g. `@overeng/pty-effect`) spawns a daemon
and exits. We observed hundreds of `dist/server.js` processes
accumulating multi-GiB RSS over days. See effect-utils#677.

Add a new `bindToSpawnerLifetime` option (opt-in, default off). When set,
`spawn.ts` injects `PTY_SPAWNER_PID=<pid>` into the daemon's env, and the
daemon polls `process.kill(pid, 0)` every 5 s. ESRCH triggers a clean
shutdown via the existing `cleanShutdown(0)` path. An invalid or
already-dead PID at startup causes immediate shutdown.

Long-lived supervisors that want daemons to outlive them simply omit
the option — full backwards compatibility.

Refs: overengineeringstudio/effect-utils#677
@schickling-assistant schickling-assistant marked this pull request as ready for review May 26, 2026 09:55
@myobie myobie merged commit 443717f into myobie:main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants