Skip to content

error_watcher: scan tmux windows for errors and alert a random agent#171

Draft
pseay-imbue wants to merge 22 commits into
mainfrom
preston/error-checker
Draft

error_watcher: scan tmux windows for errors and alert a random agent#171
pseay-imbue wants to merge 22 commits into
mainfrom
preston/error-checker

Conversation

@pseay-imbue

@pseay-imbue pseay-imbue commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Adds a new error-watcher background service. It scans every tmux window in the agent's session every 5 seconds for output matching /error|exception/i and, when a new match appears, sends one batched message to a randomly selected messageable mngr agent so a service that errored gets noticed instead of scrolling past.

Key behaviors:

  • Discovers its own session via tmux display-message -p '#S'; enumerates and captures every window's visible pane.
  • Skips its own svc-error-watcher window so its alert text (which contains "error") cannot re-trigger a match.
  • Alerts only on newly-appeared output (per-window in-memory dedup), so a static error on screen is reported once.
  • Enumerates agents via mngr list --format json and sends via mngr message; if no agent is messageable it logs and skips without erroring.
  • Multiple windows erroring in one poll produce a single batched message.
  • Match pattern is overridable via the ERROR_WATCHER_PATTERN env var.

Registered as [services.error-watcher] in services.toml (restart = "on-failure"), so the bootstrap manager runs it in its own svc-error-watcher window.

Review resolution (post-review fixes)

After the initial implementation a full review filed 12 findings. All are now resolved or consciously deferred, one fix per commit:

Fix Commit What it does
#1 lost-alert ordering 6735584e Record an error as "alerted" only after the message is delivered (split read-only unseen_matches from mark_alerted), so a failed/skipped send no longer drops the error permanently — it's retried on a later poll.
#2 send-failure fallback dd2b89d3 Try messageable agents in uniformly-random order, falling back to the next on a failed send instead of giving up after one bad pick.
#3 messageable filter 0a67eb66 Restrict recipients to type: claude agents (mirroring list_claude_agent_names) in addition to excluding STOPPED, so the non-interactive system-services agent is never messaged.
#4 volatile-line dedup 507e214a Dedup on a number-insensitive key (digit runs → #), so a re-timestamped error line alerts once instead of every 5s.
#5 bounded state 36740cda Prune dedup state for closed windows + cap keys per window, so the permanent process doesn't leak memory.
#6 runner/list hardening f129139b Parse mngr list even on a non-zero exit; distinct RUNNER_FAILURE_RETURNCODE so a runner failure isn't confused with a real exit-1; preserve a timed-out command's partial output.
#6 follow-up 444ea1c9 subprocess.TimeoutExpired.stdout is bytes even under text=True, so the #6 partial-output preservation needed an explicit decode to actually work (found by autofix; the original test never exercised the path).
#8/#10/#11 840b5b9b Handle SIGHUP (bootstrap's real stop signal) alongside SIGTERM/SIGINT; add the if __name__ == "__main__" guard; lift + test the signal handler.

Deliberately not fixed (rationale in specs/window-error-watcher/review.md): #7 (duplicate window names can't arise — bootstrap names windows uniquely), #9 (centralizing the mngr argv builders would couple this simple FCT-side script to the imbue-mngr runtime), #12 (re-fetching the session each poll is negligible and the fix would churn every test).

Design notes

All tmux/mngr I/O goes through an injected command runner that never raises, so a vanished window, a hung command, or a missing binary is logged and skipped rather than crashing the poll loop. The pure core (matching, dedup, alert formatting, mngr argv builders, agent parsing, recipient selection) is fully unit-tested; run_one_poll is integration-tested end-to-end with an injected fake runner. The mngr argv builders are validated against the live mngr CLI surface via assert_mngr_argv_valid.

Testing

  • uv run pytest libs/error_watcher -> 71 passed (57 unit/integration + 14 ratchets), coverage enabled.
  • Real-tmux end-to-end verification is intentionally manual per the repo CLAUDE.md; manually verified working in a minds Docker agent on this branch.

The spec, plan, and review for this feature live in the mngr monorepo under specs/window-error-watcher/ (mngr PR #2179).


Update — refactor into input / routing / output layers

Since the review-resolution commits above, the service was refactored from the single watcher.py module into three loosely-coupled layers, so the error source and the alert destination can each be swapped without touching the core (commit f68318ee):

  • Input layer (inputs.py): ErrorInput (abc.ABC) with TmuxWindowErrorInput wrapping the tmux session-discovery / pane-capture work behind a source-agnostic ErrorReading. A systemd/journald reader would be a drop-in sibling.
  • Routing layer (routing.py): ErrorRouter — the main work — matching, number-insensitive per-source dedup, batching, and delivery-gated dedup bookkeeping; depends only on the two layer interfaces (not on tmux or mngr).
  • Output layer (outputs.py): ErrorOutput (abc.ABC) -> MngrAgentErrorOutput (mngr delivery with an abstract choose_recipients policy hook) -> RandomMngrAgentErrorOutput (today's uniform-random behavior). A future error-fixing recipient policy is a one-method subclass.
  • commands.py holds the shared never-raising CommandRunner subprocess seam; watcher.py now just wires the three layers and runs the poll loop; testing.py holds shared test doubles; tests are split per module.

run_one_poll is replaced by ErrorRouter.run_once(), and the alert wording was made source-agnostic. The style guide's "interface classes use MutableModel + pydantic Field" convention was deliberately not followed (this lib has no pydantic dependency and uses NamedTuple/Callable throughout); the hard "no Protocol for interface classes" rule is followed.

Three follow-up fixes from /autofix are included: ruff formatting (92048676), hoisting a set out of the prune loop (09972d5b), and a comment-cleanup commit that was then reverted per author preference (de806aa8 + ac10f4ae).

Testing (supersedes the "71 passed" line above): uv run pytest libs/error_watcher -> 77 passed (63 unit/integration + 14 ratchets); ruff format --check, ruff check, and pyright on libs/error_watcher/ all clean.

Branch hygiene

The unrelated vendor/mngr/ subtree sync that had been snapshotted in WIP commit 39e2cb05 has been reverted (commit 9185978a), so this PR's net diff is now error_watcher-only (libs/error_watcher/, services.toml, and the feature's own uv.lock entry) with no vendor/mngr/ changes. Per the repo's "never rebase" rule the WIP commit was not stripped from history; it and its revert remain as canceling entries (squash at merge if a fully clean history is wanted).

Preston Seay and others added 22 commits June 16, 2026 12:15
Add the new error-watcher library (mirroring app_watcher's layout) and
implement its pure, side-effect-free core: case-insensitive error/exception
line matching, per-window dedup of already-alerted lines, batched alert
formatting, the mngr list/message argv builders, agent-summary parsing, and
uniform-random recipient selection. Includes unit tests for every pure
function (the argv builders are validated against the live mngr CLI) and the
package's ratchet file. The polling loop and tmux/mngr I/O are wired up next.

Registers error-watcher as a workspace member so its tests resolve.

Co-authored-by: Sculptor <sculptor@imbue.com>
Implement main() as a long-lived 5s poll loop with SIGTERM/SIGINT handlers
that exit cleanly. Each poll discovers the tmux session, enumerates and
captures every window (skipping the watcher's own svc-error-watcher window),
runs the dedup core, and on newly-appeared matches enumerates messageable
mngr agents (STOPPED excluded) and sends one batched alert to a random one.

All tmux/mngr I/O goes through an injected command runner that never raises,
so a vanished window, a hung command, or a missing binary is logged and
skipped rather than crashing the loop. Adds run_one_poll integration tests
driven by a fake runner (no real tmux), plus unit tests for the messageable
filter and the pattern-override compiler.

Co-authored-by: Sculptor <sculptor@imbue.com>
Register [services.error-watcher] in services.toml so the bootstrap manager
runs the watcher in its own svc-error-watcher tmux window with an on-failure
restart policy. Expand the lib README to describe the scan/match/alert
behavior, the own-window exclusion, the random-agent notification, and the v1
non-goals (naive matching, in-memory dedup, visible-pane-only). Add the FCT
changelog entry for the branch.

Co-authored-by: Sculptor <sculptor@imbue.com>
Architecture review flagged that the hard-coded OWN_WINDOW constant
("svc-error-watcher") silently couples to the [services.error-watcher] key in
services.toml: renaming the service without updating the constant would
re-enable the self-alert feedback loop. Cross-reference the coupling in both
places so the dependency is explicit. No behavior change.

Co-authored-by: Sculptor <sculptor@imbue.com>
Problem: run_one_poll recorded matching lines into the dedup state (via
new_matches) before sending the alert. If the alert could not be delivered --
no messageable agent (all STOPPED), an mngr list enumeration failure, or a
non-zero mngr message send -- those lines were already marked seen and were
never re-alerted, even though the error stayed on screen and a recipient could
become available later. This silently dropped real errors, contradicting the
documented 'alerts exactly once' / 'logs the match and skips sending' behavior
(libs/error_watcher/src/error_watcher/watcher.py).

Fix: split the dedup into a read-only unseen_matches() (computes new lines
without mutating seen) and mark_alerted() (records a delivered batch).
run_one_poll now records lines only after _alert_random_agent returns a
recipient, and _alert_random_agent returns None when the send itself fails so a
failed send is retried too. Added regression tests covering re-alert after a
no-messageable-agent poll and after a failed send.

Co-authored-by: Sculptor <sculptor@imbue.com>
…iew #1)

Problem: run_one_poll recorded matching lines into the dedup state (via
new_matches) before sending the alert. If the alert could not be delivered --
no messageable agent (all STOPPED), an mngr list enumeration failure, or a
non-zero mngr message send -- those lines were already marked seen and were
never re-alerted, even though the error stayed on screen and a recipient could
become available later. This silently dropped real errors, contradicting the
intent of REQ-MATCH-3 ("output it has already reported").

Fix: split the dedup into a read-only unseen_matches() (computes new lines
without mutating seen) and mark_alerted() (records a delivered batch).
run_one_poll now records lines only after _alert_random_agent returns a
recipient, and _alert_random_agent returns None when the send itself fails so a
failed send is retried too. Added regression tests covering re-alert after a
no-messageable-agent poll and after a failed send.

Co-authored-by: Sculptor <sculptor@imbue.com>
Previously _alert_random_agent picked exactly one recipient and gave up if the
send failed, so one bad pick (e.g. an agent that stopped between `mngr list` and
`mngr message`) dropped the alert even when other agents were reachable.

Now recipients are tried in uniformly random order: the first pick stays uniform
across the pool (REQ-NOTIFY-5), and on a failed send the watcher falls back to
the next agent within the same poll. Only when every messageable agent's send
fails does it return None (leaving the error unrecorded so a later poll retries
it). Added tests for the fallback-to-another-agent and all-agents-fail paths;
refocused the failed-send retry test on a single-agent pool.

Co-authored-by: Sculptor <sculptor@imbue.com>
select_messageable_names excluded only STOPPED agents, so it could pick the
`main`-type system-services agent (no interactive claude inbox, nobody watching)
and message it pointlessly, diverging from the cited reference
list_claude_agent_names which filters to `type == "claude"`.

AgentSummary now carries the agent `type` (parsed from `mngr list --format
json`, defaulting to "" when absent), and select_messageable_names keeps only
non-STOPPED claude agents. STOPPED remains the sole excluded lifecycle state --
that matches mngr's actual send-path rule (api/message.py rejects only STOPPED),
and transient failures for other states are absorbed by the in-poll recipient
fallback rather than by pre-filtering states the spec does not call out. Added
unit and integration tests covering the type filter.

Co-authored-by: Sculptor <sculptor@imbue.com>
Dedup keyed on the exact line string, so an error line carrying a timestamp,
counter, or numeric request id ('[12:00:05] ERROR x' then '[12:00:10] ERROR x')
counted as a brand-new match every 5s poll and fired a fresh alert to a random
agent -- an alert storm the spec's "benign false positive" non-goals do not
excuse.

Introduced dedup_key(), which collapses runs of digits to '#', and keyed the
per-window `seen` set on it (unseen_matches compares keys, mark_alerted records
keys). A re-stamped copy of one error now shares a key and alerts once. The
tradeoff -- two errors differing only in numbers are treated as one -- is
documented and suits a "something errored" nudge. Added unit and integration
tests for the volatile-line case and a guard that genuinely different text still
alerts.

Co-authored-by: Sculptor <sculptor@imbue.com>
The per-window dedup sets only ever grew and window keys were never evicted, so
in this permanent process `seen` would accumulate forever -- one entry per
window that ever existed, each set growing with every distinct error line.

Added prune_seen_windows(), called each poll with the live window list, to drop
dedup state for windows that have closed (skipped when enumeration returns empty,
which signals a failed list rather than a window-less session). Also capped the
keys retained per window at MAX_SEEN_KEYS_PER_WINDOW as a hard memory ceiling --
number-insensitive keys (finding #4) keep this far from being hit in practice.
Added unit tests for both bounds and an integration test that a closed window's
state is forgotten.

Co-authored-by: Sculptor <sculptor@imbue.com>
Two robustness gaps:

- _alert_random_agent returned before parsing whenever `mngr list` exited
  non-zero, so a non-zero exit that still emitted a valid {"agents": [...]} body
  (e.g. one provider failed) needlessly skipped the alert. It now parses the
  payload regardless of exit status and only treats a non-zero exit as fatal
  when no usable agents were parsed.

- _default_command_runner mapped every spawn failure and timeout to
  returncode=1, colliding with a real exit-1, and discarded a timed-out
  command's partial stdout. It now returns a distinct RUNNER_FAILURE_RETURNCODE
  (-1, never a real exit status) and preserves whatever a timed-out command
  emitted. The timeout is now an injectable parameter so the timeout branch is
  testable without a 30s wait.

Added runner tests (success, missing binary, timeout, sentinel contract) and
integration tests for the non-zero-list-with-payload and non-zero-list-without-
payload cases.

Co-authored-by: Sculptor <sculptor@imbue.com>
- Install a SIGHUP handler alongside SIGTERM/SIGINT. The bootstrap manager stops
  a service with `tmux kill-window`, which delivers SIGHUP, so this is the signal
  on the real stop path; the process now exits cleanly via the installed handler
  rather than the default action (review #10).

- Add the conventional `if __name__ == "__main__": main()` guard so the module
  runs via `python -m error_watcher.watcher`, matching the sibling services
  (review #11).

- Lift the signal handler to a module-level `_handle_signal` and add a test that
  it exits 0, closing part of the I/O-shell test gap (review #8). The runner and
  alert-failure branches that finding also cited are already covered by the tests
  added for review #1, #2, and #6.

Co-authored-by: Sculptor <sculptor@imbue.com>
Problem: _default_command_runner's TimeoutExpired branch guarded the
partial payload with isinstance(e.stdout, str), but on the timeout path
subprocess hands the buffered output back as bytes even under text=True
(it skips the decode it does on a normal return). The guard was therefore
always False on a real timeout, so the partial stdout/stderr was always
discarded -- the exact loss finding #6's code claimed to prevent (e.g. a
complete agent list from a hung mngr list was thrown away). The only
timeout test emitted no output, so it never caught this.

Fix: decode the bytes payload via a small _decode_timeout_output helper
(None -> "", bytes -> decode(errors="replace") so a truncated multibyte
char can't raise into the never-raise runner, str -> as-is) and use it
for both stdout and stderr. Added a regression test that prints before
hanging and asserts the partial stdout survives the timeout.

Co-authored-by: Sculptor <sculptor@imbue.com>
Refactor the error-watcher service into three loosely-coupled layers so the
source of errors and the destination of alerts can each be swapped without
touching the core logic:

- inputs.py: ErrorInput (ABC) + TmuxWindowErrorInput, wrapping the tmux
  session-discovery and pane-capture work behind a source-agnostic
  ErrorReading. A systemd/journald input is a drop-in sibling.
- routing.py: ErrorRouter, the main work between the two layers -- matching,
  per-source number-insensitive dedup, batching, and delivery-gated dedup
  bookkeeping. Depends only on the layer interfaces.
- outputs.py: ErrorOutput (ABC) -> MngrAgentErrorOutput (mngr delivery with an
  abstract choose_recipients policy) -> RandomMngrAgentErrorOutput. A future
  error-fixing recipient policy is a one-method subclass.
- commands.py: the shared CommandRunner subprocess seam.
- watcher.py: wires the three layers and runs the poll loop.

Behavior is unchanged; the alert wording is now source-agnostic. Tests
reorganized per module (77 passing, including 14 ratchets).

Co-authored-by: Sculptor <sculptor@imbue.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit ONLY exists to satisfy the FCT stop hook, which requires a clean
working tree. It snapshots a large, pre-existing, in-progress vendor/mngr
subtree sync (new mngr_azure lib, aws-stop/separate-snapshots/test-hang
changelog reshuffles, provider edits, plus the regenerated uv.lock) that was
already present in this worktree and was NOT authored or reviewed by me. My
own task -- the error_watcher input/routing/output refactor -- is the separate
commit immediately before this one (f68318e).

Whoever is driving the vendor sync can recover their uncommitted state with
'git reset --soft HEAD^' (or 'git reset HEAD^' to also unstage); nothing is
lost.

Co-authored-by: Sculptor <sculptor@imbue.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem: outputs.py, outputs_test.py, and routing_test.py were not formatted to
the project's ruff standard, so `ruff format --check libs/error_watcher/` failed
(an over-parenthesized single-line f-string header plus several over-88-char
lines the formatter would wrap). The repo enforces ruff and CI runs the format
check, so the branch failed formatting CI.
Fix: ran `ruff format libs/error_watcher/` to apply canonical formatting to the
three files (whitespace/line-wrapping only, no semantic change).

Co-authored-by: Sculptor <sculptor@imbue.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem: prune_seen_sources rebuilt set(live_sources) on every element of the
seen dict (it lived inside the comprehension's membership test), making each
prune O(len(seen) * len(live_sources)) per poll in the permanent watcher process.
Fix: compute the live-source set once before the comprehension and test against
it, restoring O(len(seen) + len(live_sources)). The deletion list is still
materialized before any del, so the snapshot-then-mutate behavior is unchanged.

Co-authored-by: Sculptor <sculptor@imbue.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ments

Problem: Eight comments/docstrings across commands.py, routing.py, outputs.py,
commands_test.py, and outputs_test.py back-referenced numbered code-review
findings ("(finding #5)", "(review finding #4)", "(finding #6)"). Those numbers
are defined only in vendor/mngr/specs/window-error-watcher/review.md, an
ephemeral per-iteration review document that does not ship with the wheel, so
the parentheticals are dangling pointers to a past review process rather than
descriptions of current behavior (user_request_artifacts_left_in_code).
Fix: Stripped only the "(finding #N)" / "(review finding #N)" fragments,
preserving the surrounding explanatory prose. The legitimate REQ-* spec
requirement references are intentionally left untouched. No code logic changed.

Co-authored-by: Sculptor <sculptor@imbue.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the concrete tmux input and mngr output out of inputs.py/outputs.py
so each interface file holds only its contract:

- inputs.py keeps the ErrorInput ABC and the ErrorReading/ErrorSource
  value types; TmuxWindowErrorInput and its tmux helpers move to a new
  tmux_window_error_input.py.
- outputs.py keeps the ErrorOutput ABC and the ErrorAlert value type;
  the mngr delivery classes and helpers move to a new
  mngr_agent_error_output.py.

Repoint imports in watcher.py and the tests, rename the unit tests to
match their new source files, and update the README architecture section.
No behavior changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Sculptor <sculptor@imbue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant