error_watcher: scan tmux windows for errors and alert a random agent#171
Draft
pseay-imbue wants to merge 22 commits into
Draft
error_watcher: scan tmux windows for errors and alert a random agent#171pseay-imbue wants to merge 22 commits into
pseay-imbue wants to merge 22 commits into
Conversation
Add the new error-watcher library (mirroring app_watcher's layout) and implement its pure, side-effect-free core: case-insensitive error/exception line matching, per-window dedup of already-alerted lines, batched alert formatting, the mngr list/message argv builders, agent-summary parsing, and uniform-random recipient selection. Includes unit tests for every pure function (the argv builders are validated against the live mngr CLI) and the package's ratchet file. The polling loop and tmux/mngr I/O are wired up next. Registers error-watcher as a workspace member so its tests resolve. Co-authored-by: Sculptor <sculptor@imbue.com>
Implement main() as a long-lived 5s poll loop with SIGTERM/SIGINT handlers that exit cleanly. Each poll discovers the tmux session, enumerates and captures every window (skipping the watcher's own svc-error-watcher window), runs the dedup core, and on newly-appeared matches enumerates messageable mngr agents (STOPPED excluded) and sends one batched alert to a random one. All tmux/mngr I/O goes through an injected command runner that never raises, so a vanished window, a hung command, or a missing binary is logged and skipped rather than crashing the loop. Adds run_one_poll integration tests driven by a fake runner (no real tmux), plus unit tests for the messageable filter and the pattern-override compiler. Co-authored-by: Sculptor <sculptor@imbue.com>
Register [services.error-watcher] in services.toml so the bootstrap manager runs the watcher in its own svc-error-watcher tmux window with an on-failure restart policy. Expand the lib README to describe the scan/match/alert behavior, the own-window exclusion, the random-agent notification, and the v1 non-goals (naive matching, in-memory dedup, visible-pane-only). Add the FCT changelog entry for the branch. Co-authored-by: Sculptor <sculptor@imbue.com>
Architecture review flagged that the hard-coded OWN_WINDOW constant
("svc-error-watcher") silently couples to the [services.error-watcher] key in
services.toml: renaming the service without updating the constant would
re-enable the self-alert feedback loop. Cross-reference the coupling in both
places so the dependency is explicit. No behavior change.
Co-authored-by: Sculptor <sculptor@imbue.com>
Problem: run_one_poll recorded matching lines into the dedup state (via new_matches) before sending the alert. If the alert could not be delivered -- no messageable agent (all STOPPED), an mngr list enumeration failure, or a non-zero mngr message send -- those lines were already marked seen and were never re-alerted, even though the error stayed on screen and a recipient could become available later. This silently dropped real errors, contradicting the documented 'alerts exactly once' / 'logs the match and skips sending' behavior (libs/error_watcher/src/error_watcher/watcher.py). Fix: split the dedup into a read-only unseen_matches() (computes new lines without mutating seen) and mark_alerted() (records a delivered batch). run_one_poll now records lines only after _alert_random_agent returns a recipient, and _alert_random_agent returns None when the send itself fails so a failed send is retried too. Added regression tests covering re-alert after a no-messageable-agent poll and after a failed send. Co-authored-by: Sculptor <sculptor@imbue.com>
…red" This reverts commit a7d5c33.
…iew #1) Problem: run_one_poll recorded matching lines into the dedup state (via new_matches) before sending the alert. If the alert could not be delivered -- no messageable agent (all STOPPED), an mngr list enumeration failure, or a non-zero mngr message send -- those lines were already marked seen and were never re-alerted, even though the error stayed on screen and a recipient could become available later. This silently dropped real errors, contradicting the intent of REQ-MATCH-3 ("output it has already reported"). Fix: split the dedup into a read-only unseen_matches() (computes new lines without mutating seen) and mark_alerted() (records a delivered batch). run_one_poll now records lines only after _alert_random_agent returns a recipient, and _alert_random_agent returns None when the send itself fails so a failed send is retried too. Added regression tests covering re-alert after a no-messageable-agent poll and after a failed send. Co-authored-by: Sculptor <sculptor@imbue.com>
Previously _alert_random_agent picked exactly one recipient and gave up if the send failed, so one bad pick (e.g. an agent that stopped between `mngr list` and `mngr message`) dropped the alert even when other agents were reachable. Now recipients are tried in uniformly random order: the first pick stays uniform across the pool (REQ-NOTIFY-5), and on a failed send the watcher falls back to the next agent within the same poll. Only when every messageable agent's send fails does it return None (leaving the error unrecorded so a later poll retries it). Added tests for the fallback-to-another-agent and all-agents-fail paths; refocused the failed-send retry test on a single-agent pool. Co-authored-by: Sculptor <sculptor@imbue.com>
select_messageable_names excluded only STOPPED agents, so it could pick the `main`-type system-services agent (no interactive claude inbox, nobody watching) and message it pointlessly, diverging from the cited reference list_claude_agent_names which filters to `type == "claude"`. AgentSummary now carries the agent `type` (parsed from `mngr list --format json`, defaulting to "" when absent), and select_messageable_names keeps only non-STOPPED claude agents. STOPPED remains the sole excluded lifecycle state -- that matches mngr's actual send-path rule (api/message.py rejects only STOPPED), and transient failures for other states are absorbed by the in-poll recipient fallback rather than by pre-filtering states the spec does not call out. Added unit and integration tests covering the type filter. Co-authored-by: Sculptor <sculptor@imbue.com>
Dedup keyed on the exact line string, so an error line carrying a timestamp,
counter, or numeric request id ('[12:00:05] ERROR x' then '[12:00:10] ERROR x')
counted as a brand-new match every 5s poll and fired a fresh alert to a random
agent -- an alert storm the spec's "benign false positive" non-goals do not
excuse.
Introduced dedup_key(), which collapses runs of digits to '#', and keyed the
per-window `seen` set on it (unseen_matches compares keys, mark_alerted records
keys). A re-stamped copy of one error now shares a key and alerts once. The
tradeoff -- two errors differing only in numbers are treated as one -- is
documented and suits a "something errored" nudge. Added unit and integration
tests for the volatile-line case and a guard that genuinely different text still
alerts.
Co-authored-by: Sculptor <sculptor@imbue.com>
The per-window dedup sets only ever grew and window keys were never evicted, so in this permanent process `seen` would accumulate forever -- one entry per window that ever existed, each set growing with every distinct error line. Added prune_seen_windows(), called each poll with the live window list, to drop dedup state for windows that have closed (skipped when enumeration returns empty, which signals a failed list rather than a window-less session). Also capped the keys retained per window at MAX_SEEN_KEYS_PER_WINDOW as a hard memory ceiling -- number-insensitive keys (finding #4) keep this far from being hit in practice. Added unit tests for both bounds and an integration test that a closed window's state is forgotten. Co-authored-by: Sculptor <sculptor@imbue.com>
Two robustness gaps:
- _alert_random_agent returned before parsing whenever `mngr list` exited
non-zero, so a non-zero exit that still emitted a valid {"agents": [...]} body
(e.g. one provider failed) needlessly skipped the alert. It now parses the
payload regardless of exit status and only treats a non-zero exit as fatal
when no usable agents were parsed.
- _default_command_runner mapped every spawn failure and timeout to
returncode=1, colliding with a real exit-1, and discarded a timed-out
command's partial stdout. It now returns a distinct RUNNER_FAILURE_RETURNCODE
(-1, never a real exit status) and preserves whatever a timed-out command
emitted. The timeout is now an injectable parameter so the timeout branch is
testable without a 30s wait.
Added runner tests (success, missing binary, timeout, sentinel contract) and
integration tests for the non-zero-list-with-payload and non-zero-list-without-
payload cases.
Co-authored-by: Sculptor <sculptor@imbue.com>
- Install a SIGHUP handler alongside SIGTERM/SIGINT. The bootstrap manager stops a service with `tmux kill-window`, which delivers SIGHUP, so this is the signal on the real stop path; the process now exits cleanly via the installed handler rather than the default action (review #10). - Add the conventional `if __name__ == "__main__": main()` guard so the module runs via `python -m error_watcher.watcher`, matching the sibling services (review #11). - Lift the signal handler to a module-level `_handle_signal` and add a test that it exits 0, closing part of the I/O-shell test gap (review #8). The runner and alert-failure branches that finding also cited are already covered by the tests added for review #1, #2, and #6. Co-authored-by: Sculptor <sculptor@imbue.com>
Problem: _default_command_runner's TimeoutExpired branch guarded the partial payload with isinstance(e.stdout, str), but on the timeout path subprocess hands the buffered output back as bytes even under text=True (it skips the decode it does on a normal return). The guard was therefore always False on a real timeout, so the partial stdout/stderr was always discarded -- the exact loss finding #6's code claimed to prevent (e.g. a complete agent list from a hung mngr list was thrown away). The only timeout test emitted no output, so it never caught this. Fix: decode the bytes payload via a small _decode_timeout_output helper (None -> "", bytes -> decode(errors="replace") so a truncated multibyte char can't raise into the never-raise runner, str -> as-is) and use it for both stdout and stderr. Added a regression test that prints before hanging and asserts the partial stdout survives the timeout. Co-authored-by: Sculptor <sculptor@imbue.com>
Refactor the error-watcher service into three loosely-coupled layers so the source of errors and the destination of alerts can each be swapped without touching the core logic: - inputs.py: ErrorInput (ABC) + TmuxWindowErrorInput, wrapping the tmux session-discovery and pane-capture work behind a source-agnostic ErrorReading. A systemd/journald input is a drop-in sibling. - routing.py: ErrorRouter, the main work between the two layers -- matching, per-source number-insensitive dedup, batching, and delivery-gated dedup bookkeeping. Depends only on the layer interfaces. - outputs.py: ErrorOutput (ABC) -> MngrAgentErrorOutput (mngr delivery with an abstract choose_recipients policy) -> RandomMngrAgentErrorOutput. A future error-fixing recipient policy is a one-method subclass. - commands.py: the shared CommandRunner subprocess seam. - watcher.py: wires the three layers and runs the poll loop. Behavior is unchanged; the alert wording is now source-agnostic. Tests reorganized per module (77 passing, including 14 ratchets). Co-authored-by: Sculptor <sculptor@imbue.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit ONLY exists to satisfy the FCT stop hook, which requires a clean working tree. It snapshots a large, pre-existing, in-progress vendor/mngr subtree sync (new mngr_azure lib, aws-stop/separate-snapshots/test-hang changelog reshuffles, provider edits, plus the regenerated uv.lock) that was already present in this worktree and was NOT authored or reviewed by me. My own task -- the error_watcher input/routing/output refactor -- is the separate commit immediately before this one (f68318e). Whoever is driving the vendor sync can recover their uncommitted state with 'git reset --soft HEAD^' (or 'git reset HEAD^' to also unstage); nothing is lost. Co-authored-by: Sculptor <sculptor@imbue.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem: outputs.py, outputs_test.py, and routing_test.py were not formatted to the project's ruff standard, so `ruff format --check libs/error_watcher/` failed (an over-parenthesized single-line f-string header plus several over-88-char lines the formatter would wrap). The repo enforces ruff and CI runs the format check, so the branch failed formatting CI. Fix: ran `ruff format libs/error_watcher/` to apply canonical formatting to the three files (whitespace/line-wrapping only, no semantic change). Co-authored-by: Sculptor <sculptor@imbue.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem: prune_seen_sources rebuilt set(live_sources) on every element of the seen dict (it lived inside the comprehension's membership test), making each prune O(len(seen) * len(live_sources)) per poll in the permanent watcher process. Fix: compute the live-source set once before the comprehension and test against it, restoring O(len(seen) + len(live_sources)). The deletion list is still materialized before any del, so the snapshot-then-mutate behavior is unchanged. Co-authored-by: Sculptor <sculptor@imbue.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ments
Problem: Eight comments/docstrings across commands.py, routing.py, outputs.py,
commands_test.py, and outputs_test.py back-referenced numbered code-review
findings ("(finding #5)", "(review finding #4)", "(finding #6)"). Those numbers
are defined only in vendor/mngr/specs/window-error-watcher/review.md, an
ephemeral per-iteration review document that does not ship with the wheel, so
the parentheticals are dangling pointers to a past review process rather than
descriptions of current behavior (user_request_artifacts_left_in_code).
Fix: Stripped only the "(finding #N)" / "(review finding #N)" fragments,
preserving the surrounding explanatory prose. The legitimate REQ-* spec
requirement references are intentionally left untouched. No code logic changed.
Co-authored-by: Sculptor <sculptor@imbue.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cher comments" This reverts commit de806aa.
This reverts commit 39e2cb0.
Move the concrete tmux input and mngr output out of inputs.py/outputs.py so each interface file holds only its contract: - inputs.py keeps the ErrorInput ABC and the ErrorReading/ErrorSource value types; TmuxWindowErrorInput and its tmux helpers move to a new tmux_window_error_input.py. - outputs.py keeps the ErrorOutput ABC and the ErrorAlert value type; the mngr delivery classes and helpers move to a new mngr_agent_error_output.py. Repoint imports in watcher.py and the tests, rename the unit tests to match their new source files, and update the README architecture section. No behavior changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Sculptor <sculptor@imbue.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
error-watcherbackground service. It scans every tmux window in the agent's session every 5 seconds for output matching/error|exception/iand, when a new match appears, sends one batched message to a randomly selected messageable mngr agent so a service that errored gets noticed instead of scrolling past.Key behaviors:
tmux display-message -p '#S'; enumerates and captures every window's visible pane.svc-error-watcherwindow so its alert text (which contains "error") cannot re-trigger a match.mngr list --format jsonand sends viamngr message; if no agent is messageable it logs and skips without erroring.ERROR_WATCHER_PATTERNenv var.Registered as
[services.error-watcher]inservices.toml(restart = "on-failure"), so the bootstrap manager runs it in its ownsvc-error-watcherwindow.Review resolution (post-review fixes)
After the initial implementation a full review filed 12 findings. All are now resolved or consciously deferred, one fix per commit:
6735584eunseen_matchesfrommark_alerted), so a failed/skipped send no longer drops the error permanently — it's retried on a later poll.dd2b89d30a67eb66type: claudeagents (mirroringlist_claude_agent_names) in addition to excludingSTOPPED, so the non-interactive system-services agent is never messaged.507e214a#), so a re-timestamped error line alerts once instead of every 5s.36740cdaf129139bmngr listeven on a non-zero exit; distinctRUNNER_FAILURE_RETURNCODEso a runner failure isn't confused with a real exit-1; preserve a timed-out command's partial output.444ea1c9subprocess.TimeoutExpired.stdoutis bytes even undertext=True, so the #6 partial-output preservation needed an explicit decode to actually work (found by autofix; the original test never exercised the path).840b5b9bSIGHUP(bootstrap's real stop signal) alongside SIGTERM/SIGINT; add theif __name__ == "__main__"guard; lift + test the signal handler.Deliberately not fixed (rationale in
specs/window-error-watcher/review.md): #7 (duplicate window names can't arise — bootstrap names windows uniquely), #9 (centralizing the mngr argv builders would couple this simple FCT-side script to theimbue-mngrruntime), #12 (re-fetching the session each poll is negligible and the fix would churn every test).Design notes
All tmux/mngr I/O goes through an injected command runner that never raises, so a vanished window, a hung command, or a missing binary is logged and skipped rather than crashing the poll loop. The pure core (matching, dedup, alert formatting, mngr argv builders, agent parsing, recipient selection) is fully unit-tested;
run_one_pollis integration-tested end-to-end with an injected fake runner. Themngrargv builders are validated against the live mngr CLI surface viaassert_mngr_argv_valid.Testing
uv run pytest libs/error_watcher-> 71 passed (57 unit/integration + 14 ratchets), coverage enabled.The spec, plan, and review for this feature live in the mngr monorepo under
specs/window-error-watcher/(mngr PR #2179).Update — refactor into input / routing / output layers
Since the review-resolution commits above, the service was refactored from the single
watcher.pymodule into three loosely-coupled layers, so the error source and the alert destination can each be swapped without touching the core (commitf68318ee):inputs.py):ErrorInput(abc.ABC) withTmuxWindowErrorInputwrapping the tmux session-discovery / pane-capture work behind a source-agnosticErrorReading. A systemd/journald reader would be a drop-in sibling.routing.py):ErrorRouter— the main work — matching, number-insensitive per-source dedup, batching, and delivery-gated dedup bookkeeping; depends only on the two layer interfaces (not on tmux or mngr).outputs.py):ErrorOutput(abc.ABC) ->MngrAgentErrorOutput(mngr delivery with an abstractchoose_recipientspolicy hook) ->RandomMngrAgentErrorOutput(today's uniform-random behavior). A future error-fixing recipient policy is a one-method subclass.commands.pyholds the shared never-raisingCommandRunnersubprocess seam;watcher.pynow just wires the three layers and runs the poll loop;testing.pyholds shared test doubles; tests are split per module.run_one_pollis replaced byErrorRouter.run_once(), and the alert wording was made source-agnostic. The style guide's "interface classes use MutableModel + pydantic Field" convention was deliberately not followed (this lib has no pydantic dependency and uses NamedTuple/Callable throughout); the hard "no Protocol for interface classes" rule is followed.Three follow-up fixes from
/autofixare included: ruff formatting (92048676), hoisting a set out of the prune loop (09972d5b), and a comment-cleanup commit that was then reverted per author preference (de806aa8+ac10f4ae).Testing (supersedes the "71 passed" line above):
uv run pytest libs/error_watcher-> 77 passed (63 unit/integration + 14 ratchets);ruff format --check,ruff check, andpyrightonlibs/error_watcher/all clean.Branch hygiene
The unrelated
vendor/mngr/subtree sync that had been snapshotted in WIP commit39e2cb05has been reverted (commit9185978a), so this PR's net diff is now error_watcher-only (libs/error_watcher/,services.toml, and the feature's ownuv.lockentry) with novendor/mngr/changes. Per the repo's "never rebase" rule the WIP commit was not stripped from history; it and its revert remain as canceling entries (squash at merge if a fully clean history is wanted).