Skip to content

fix: prevent memory leaks in runner that cause OOM over time#67

Open
archon-agent wants to merge 1 commit intomoazbuilds:masterfrom
archon-agent:fix/runner-memory-leaks
Open

fix: prevent memory leaks in runner that cause OOM over time#67
archon-agent wants to merge 1 commit intomoazbuilds:masterfrom
archon-agent:fix/runner-memory-leaks

Conversation

@archon-agent
Copy link
Copy Markdown

Summary

Three fixes for memory leaks in src/runner.ts that cause the bun process to grow unboundedly over time, eventually triggering OOM kills on long-running daemon deployments.

1. Capped stream collection (10 MB limit)

  • Replaced unbounded new Response(proc.stdout).text() with a collectStream() helper that caps output at 10 MB
  • Prevents runaway or very large Claude responses from consuming all available RAM

2. Timer cleanup after process exit

  • The setTimeout for process timeout was never cleared on normal completion
  • Added clearTimeout in both success and error paths to prevent timer reference leaks

3. Queue chain reference reset

  • task.catch(() => {}) passes through fulfilled values, keeping references to all previous task results in the promise chain
  • Changed to task.then(() => {}, () => {}) which discards result references, allowing GC to reclaim them
  • Applied to both globalQueue and threadQueues

Context

In production (24/7 daemon serving Telegram bot), these leaks caused the bun process to grow from ~43 MB to 23 GB over several days, eventually triggering an OOM kill and crashing the VM.

Test plan

  • Verified fixes on a production VM200 instance running claudeclaw as a systemd daemon
  • After applying fixes, daemon memory stayed stable at ~43 MB over extended operation
  • No existing tests to run, but changes are minimal and isolated to resource management

🤖 Generated with Claude Code

Three fixes for long-running daemon deployments:

1. collectStream with 10MB cap: replace unbounded `new Response().text()`
   with a streaming reader that caps output at 10MB. Prevents runaway
   Claude responses from consuming all available RAM.

2. clearTimeout after process completion: the timeout timer was never
   cleared on normal exit, leaking timer references indefinitely.

3. Queue chain reference reset: `task.catch(() => {})` still passes
   through fulfilled values, keeping references to all previous results.
   Changed to `task.then(() => {}, () => {})` which discards them.

In production, these leaks caused the bun process to grow to 23GB over
several days, eventually triggering an OOM kill.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant