Skip to content

fix: add per-context subprocess timeout to prevent daemon freeze#38

Open
dmitryanchikov wants to merge 2 commits intomoazbuilds:masterfrom
dmitryanchikov:fix/subprocess-timeout
Open

fix: add per-context subprocess timeout to prevent daemon freeze#38
dmitryanchikov wants to merge 2 commits intomoazbuilds:masterfrom
dmitryanchikov:fix/subprocess-timeout

Conversation

@dmitryanchikov
Copy link
Copy Markdown
Contributor

Problem

When a claude subprocess hangs indefinitely (e.g. stuck network call, claude billing waiting forever), the daemon's serial queue blocks all subsequent messages, heartbeats, and jobs. There is no timeout anywhere in the call chain.

Observed incidents:

  • 2026-03-10: simple Telegram message hung 35+ min (exit code 1, empty output)
  • 2026-03-14: claude billing inside a bash chain hung 33+ min

Solution

Add a configurable per-context timeout to runClaudeOnce() with a SIGTERM → SIGKILL escalation sequence.

src/runner.ts

  • runClaudeOnce() accepts a timeoutMs parameter
  • On expiry: sends SIGTERM, then SIGKILL after a 5s grace period
  • The kill causes the process's pipes to close, so Response.text() and proc.exited resolve naturally
  • Fallback model retry is skipped on timeout (only meaningful for rate limits)
  • Log files record [TIMED OUT] for observability

src/config.ts

  • New TimeoutsConfig interface added to Settings
  • Timeouts are read via getSettings() on every invocation — changes to settings.json take effect within the daemon's existing 30s hot-reload cycle, no restart required

Default timeouts

Context Default
telegram 5 min
heartbeat 15 min
jobs / other 30 min

All configurable in settings.json:

"timeouts": {
  "telegram": 5,
  "heartbeat": 15,
  "job": 30,
  "default": 5
}

Vesper and others added 2 commits March 15, 2026 12:35
When a claude subprocess hangs indefinitely (e.g. on a stuck network
call), the serial queue blocks all subsequent messages/heartbeats/jobs.
This was observed multiple times: 33-35 min hangs on simple Telegram
messages and `claude billing` calls.

Changes:
- `runClaudeOnce()` accepts a `timeoutMs` parameter; on expiry sends
  SIGTERM then SIGKILL after a 5s grace period
- `resolveTimeoutMs(name)` picks the timeout from `settings.timeouts`
  based on invocation context (telegram / heartbeat / everything else)
- `TimeoutsConfig` added to `Settings` with hot-reload support — editing
  `settings.json` takes effect within the daemon's existing 30s reload
  cycle, no restart needed
- Fallback model retry is skipped on timeout (only retries on rate limit)
- Log entry and `.log` file record `[TIMED OUT]` for observability

Default timeouts (all configurable in settings.json):
  telegram:  5 min
  heartbeat: 15 min
  job / other: 30 min

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a subprocess is killed by SIGTERM (143) or SIGKILL (137), the
Telegram handler now sends a human-readable explanation rather than
"Error (exit 143): Unknown error".

This complements the runner.ts timeout handling: runner returns exitCode 0
with a friendly message when *our* timeout fires, but telegram.ts now
also handles externally-killed processes (e.g. OOM, system signals)
with a clear message instead of the raw exit code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fenrur added a commit to Fenrur/claudeclaw that referenced this pull request Mar 17, 2026
…oazbuilds#38)

- Configurable timeouts per context (telegram, heartbeat, job)
- SIGTERM → SIGKILL grace period for stuck subprocesses
- Timeout detection in Telegram error messages
- Skip fallback retry on timeout (only on rate limit)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant