Skip to content

fix(observer): window invalid-output respawn so benign idle can't poison quiet sessions#3059

Open
crippledgeek wants to merge 7 commits into
thedotmack:mainfrom
crippledgeek:feature/observer-respawn-windowed-burst
Open

fix(observer): window invalid-output respawn so benign idle can't poison quiet sessions#3059
crippledgeek wants to merge 7 commits into
thedotmack:mainfrom
crippledgeek:feature/observer-respawn-windowed-burst

Conversation

@crippledgeek

@crippledgeek crippledgeek commented Jun 25, 2026

Copy link
Copy Markdown

What

Replaces the unbounded consecutiveInvalidOutputs >= 3 respawn counter — which wrongly counted benign idle/prose SDK output and poison-looped quiet sessions, wiping context and dropping all captured work — with a time-windowed burst counter (modelled on systemd's StartLimitBurst/StartLimitIntervalSec and Erlang/OTP supervisor intensity/period).

Closes #3032
Refs #2935 #3007 #3037 #3022 #2955 #2960

Why

On low-signal sessions the SDK legitimately emits idle/empty/prose responses. The old counter treated every one as an "invalid output" and, after 3 in a row, killed and respawned the observer — which on a quiet session just produces more idle output, so the respawn itself re-triggers the counter: a self-sustaining poison→respawn loop. The fix only respawns when invalid outputs arrive as a burst within a bounded window, so isolated benign idles can never accumulate to the threshold.

How

  • New module src/services/worker/agents/respawn-policy.ts: evaluateRespawn (windowed decision with an exhaustiveness guard over output classes), parseRespawnPolicy, cached settings-backed getRespawnPolicy, and a FailureWindow class.
  • ActiveSession.consecutiveInvalidOutputsinvalidOutputWindow (timestamps within the window, not a monotonic count).
  • poisoned output still respawns immediately; idle is exempt by default.
  • Respawn-policy telemetry dimension wired through the scrubber; an orphaned telemetry key retired.

New settings (all string, in SettingsDefaultsManager, defaults reproduce prior behavior minus the bug)

Key Default Range
CLAUDE_MEM_INVALID_OUTPUT_EXEMPT_CLASSES idle comma-list of output classes
CLAUDE_MEM_INVALID_OUTPUT_RESPAWN_THRESHOLD 3 1..=100
CLAUDE_MEM_INVALID_OUTPUT_WINDOW_MS 60000 1000..=3_600_000

Tests

  • tests/worker/agents/respawn-policy.test.ts (new, comprehensive: windowing, threshold boundaries, exempt classes, exhaustiveness).
  • tests/shared/settings-respawn-policy.test.ts (new, settings parse + validation bounds).
  • Extended tests/worker/poison-respawn.test.ts, response-processor.test.ts, scrub.test.ts.
  • Targeted suite: 71 pass / 0 fail; tsc --noEmit clean.
  • Live soak (prior session): real Haiku run emitted 5× idle + 3× prose → badCount ≤ 1, 0 poisons (old code would have respawned).

Notes

@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR changes observer invalid-output recovery to use a time-windowed respawn policy. The main changes are:

  • Adds a settings-backed respawn policy with exempt classes, burst threshold, and window duration.
  • Replaces the session's unbounded invalid-output counter with a rolling failure window.
  • Keeps poisoned output as an immediate respawn trigger while exempting idle output by default.
  • Updates telemetry fields, documentation, and CLI output for the new respawn threshold key.
  • Adds focused tests for policy parsing, window behavior, response processing, scrubbed telemetry, and poisoned-session respawn.

Confidence Score: 5/5

The change is narrowly scoped to observer recovery policy and telemetry/settings wiring, with focused tests covering the new windowed behavior and parsing bounds.

No correctness issues were identified in the reviewed changes, and the added tests exercise the main behavioral boundaries described by the PR.

T-Rex T-Rex Logs

What T-Rex did

  • Compared baseline respawn policy with the after-state to verify how idle invalidOutputWindow and consecutiveInvalidOutputs behave, noting that prose respawns on the third burst output and valid XML resets badCount to 0, while isolated prose events beyond the 60000ms window remain badCount 1 and poisoned output respawns immediately while preserving pending count and clearing the window.
  • Validated the respawn-settings changes by comparing before and after artifacts, documenting defaults and overrides such as idle/3/60000, idle+prose/2/5000, invalid fallback idle/3/60000, and the env threshold override to 7.
  • Checked how scrub output and ResponseProcessor changes affect what fields are preserved, confirming that respawn_threshold is present and consecutive_invalid_outputs is no longer part of the head output payload.

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (1): Last reviewed commit: "fix(observer): wire respawn telemetry di..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant