feat: controller-level stuck-agent sweep (non-LLM detection) (#571)#714
Open
rileywhite wants to merge 10 commits intogastownhall:mainfrom
Open
Conversation
…tion Introduces the pure-predicate stuckTracker alongside the existing idle/crash/wispGC trackers. The tracker receives a session name, recent pane output, the wisp UpdatedAt, last-activity time, and now, and returns (stuck, reason, matched_pattern). No time.Now, no store I/O, no Peek inside the tracker — all I/O stays in the forthcoming CityRuntime wrapper. Adds five opt-in [daemon] fields — stuck_sweep, stuck_wisp_threshold, stuck_error_patterns, stuck_peek_lines, stuck_warrant_label — plus helper methods and duration validation. The SDK ships with empty pattern defaults (no vendor strings) and a configurable warrant label (default "pool:dog") so downstream formulas can route warrants without any role-name control flow in Go. Detection composes convergent evidence: stale wisp AND (regex match OR progress-mismatch axis where activity predates the wisp). Matches are sorted and joined deterministically so repeated evaluations produce identical warrant metadata. Invalid patterns fail fast at construction with an indexed error. Table-driven tests cover the disabled-noop path, empty patterns, invalid regex, pattern-match detection, progress-mismatch detection, fresh-wisp suppression, boundary equality, and deterministic match joining. CityRuntime wiring, wispFreshness plumbing, and warrant filing land in a follow-up commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…#571) Adds the controller-level stuck-agent sweep glue on top of iteration 1's pure-predicate stuckTracker: - `cmd/gc/wisp_freshness.go`: opaque `wispFreshness` snapshot and `sweepWispFreshness()` — one `bd list --json` shellout per sweep, parses `updated_at`, groups by `session_name`, applies most-recent-updated_at tiebreaker. Raw bd JSON types do not leak. - `cmd/gc/stuck_sweep.go`: `CityRuntime.runStuckSweep()` iterates running sessions, collects pane output + last-activity, consults the tracker predicate, checks idempotence via `ListByLabel` + metadata.target match, re-checks `ProcessAlive` before Create, files a warrant bead (`type=warrant`, configurable label, metadata `{target, reason, requester, matched_pattern, wisp_id, wisp_age}`), and emits `events.AgentStuck` only on successful Create. Whole-sweep fail-open on shellout failure; per-session fail-open on Peek/activity errors. - `cmd/gc/city_runtime.go`: adds `stuck` and `runner` fields; builds the tracker in `newCityRuntime` with stderr fallback to noop on regex-compile error; rebuilds on config reload; invokes `runStuckSweep` inside the halt gate, after wispGC, before order dispatch. - `internal/events/events.go`: adds `AgentStuck = "agent.stuck"` event type constant. - `engdocs/architecture/controller.md`, `engdocs/architecture/health-patrol.md`: document the new tick step, tracker key type, `[daemon]` config block, code-map entries, the `bd list --json updated_at` known-limitation, the multi-wisp tiebreaker, and the event-on-successful-Create rule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…astownhall#571) Adds cmd/gc/stuck_sweep_test.go covering five acceptance criteria: no-agents self-sufficiency (AC16/E21), idempotence across ticks (E11), halt-gate suppression (E14), whole-sweep fail-open on runner error (E24/AC18), and agent.stuck event emission on Create success (AC17). Regenerates docs/reference/config.md, docs/reference/cli.md, and docs/schema/city-schema.json via `go run ./cmd/genschema` so the stuck_* daemon keys appear in the generated reference. Operator guide tracked separately as ga-60k. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…townhall#571) M3: StuckSweepEnabled() no longer requires patterns. Zero patterns now disables the regex axis only; the progress-mismatch axis stays live. The startup log positively describes which axes are active. M2: Document the single-goroutine ownership invariant for cr.stuck and cr.runner with field comments and a Concurrency Model subsection in engdocs/architecture/health-patrol.md. M1: Add cmd/gc/wisp_freshness_test.go to pin JSON tags id, updated_at, and metadata.session_name against future refactors. Covers tiebreaker, empty/malformed/error/empty-array paths. O4: Drop unused Type/Status fields from freshnessEntry; bd query already filters --type=wisp --status=in_progress. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…townhall#571) S1: New gc.agent.stuck_warrants.total Int64Counter incremented exactly once per successful warrant Create, with low-cardinality axis attribute ("regex" or "progress_mismatch"). Emitted from runStuckSweep after the existing AgentStuck event so metric and event always agree. S2: New daemon-stuck-sweep doctor check surfaces the effective sweep configuration: disabled / misconfigured / regex-axis state / progress-mismatch state, threshold, label. Calls StuckSweepEnabled() rather than re-deriving the predicate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nts, docs (gastownhall#571) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2 tasks
…oml v1.6 (gastownhall#571) BurntSushi/toml v1.6.0's isEmpty() does not handle reflect.Int — it falls through to return false, so an int=0 field is never omitted even with the omitempty tag. This caused [daemon] to appear in marshaled TOML for configs with an empty daemon section (TestMarshalOmitsEmptyDaemonSection failure). Fix: change StuckPeekLines int → *int, matching the existing *int pattern used by MaxRestarts and ProbeConcurrency. Update StuckPeekLinesOrDefault() to treat nil the same as zero (return the default). Update test helpers to use pointer literals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2 tasks
gastownhall#571) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
CI failure here is the known flake tracked in #716 — This PR does not touch Marking ready for review. |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements a
stuckTrackerin the controller reconciliation tick that detects agents whose process is alive but conversation context is broken (e.g. Claude Code sessions wedged on a 400 tool_use concurrency error). Detection is non-LLM: pattern-match on pane output + wisp staleness cross-check, with a parallel progress-mismatch axis (activity predates the wisp). On detection, files apool:dogwarrant bead into the existingmol-shutdown-dancerecovery flow rather than killing the session directly.This is the alternative-design counterpart to #659. Where #659 hashes Peek output and kills the session in-controller, #714 separates observation from action: the controller files a warrant, and the existing dog/recovery flow handles termination. Detection composes two independent axes (regex + progress-mismatch) so the sweep stays useful even with zero error patterns configured.
cmd/gc/stuck_tracker.go— pure-predicatestuckTrackerinterface +noopStuckTracker+memoryStuckTracker. Immutable compiled-regex slice. Convergent evidence: stale wisp AND (error pattern OR progress mismatch).cmd/gc/wisp_freshness.go+wisp_freshness_test.go—wispFreshnessopaque struct +sweepWispFreshness()using existingCommandRunnerpattern (onebd list --jsonper sweep, not per session). Tests pin theid,updated_at,metadata.session_nameJSON tags against future bd refactors.cmd/gc/stuck_sweep.go—CityRuntime.runStuckSweep(): halts inside the halt gate, idempotence-checked (no duplicate warrants on store errors),ProcessAliverace-check before Create,agent.stuckevent +gc.agent.stuck_warrants.totalOTEL counter on successful warrant Create only.internal/config/— five new[daemon]fields:stuck_sweep,stuck_wisp_threshold,stuck_error_patterns,stuck_peek_lines,stuck_warrant_label. Pre-start pattern validation (ValidateStuckPatterns) with indexed errors.internal/events/events.go—AgentStuck = "agent.stuck"constant.internal/telemetry/recorder.go—gc.agent.stuck_warrants.totalInt64Counter, low-cardinalityaxisattribute (regexorprogress_mismatch). Incremented exactly once per successful Create, after theAgentStuckevent so metric and event always agree.internal/doctor/checks_semantic.go+ tests —daemon-stuck-sweepdoctor check surfaces the effective configuration: disabled / misconfigured / regex-axis state / progress-mismatch state, threshold, label. CallsStuckSweepEnabled()rather than re-deriving the predicate.cmd/gc/cmd_doctor.go,cmd_start.go,cmd_supervisor.go— wire the new doctor check and ensure stuck-sweep startup logging describes which axes are active.engdocs/architecture/controller.md,engdocs/architecture/health-patrol.md(incl. new Concurrency Model subsection documenting single-goroutine ownership ofcr.stuckandcr.runner),docs/reference/config.md,docs/reference/cli.md,docs/schema/city-schema.jsonregenerated viago run ./cmd/genschema.Design decisions
stuck_sweepdefaults tofalse(opt-in) — false-positive risk is high-consequencestuck_error_patternsdefaults to empty in Go (provider-neutral); example patterns live in user TOMLstuck_warrant_labelconfigurable (default"pool:dog") for ZFC role-neutralityhasOpenStuckWarrantstore error (assume warrant exists → skip)cr.stuckandcr.runner— built innewCityRuntime, rebuilt only on config reload, accessed only from the reconciler tickWhat was tested
make fmt-check— PASSmake lint— PASSmake vet— PASScmd/gc/stuck_tracker_test.go(11 tests: detection logic, config accessors, edge cases)cmd/gc/wisp_freshness_test.go(tiebreaker, empty/malformed/error/empty-array paths, JSON tag pinning)cmd/gc/stuck_sweep_test.go(5 tests: no-agents, idempotence, halt-gate, fail-open, event-on-create)internal/doctor/checks_semantic_test.go(daemon-stuck-sweep check)Review outcome
Two-round review council (Code, UAT, Docs, Alignment, Blast Radius, Invariants, Concurrency reviewers). Round 1 found 9 blocking issues (all fixed): idempotence direction, vacuous halt test, illusory interface, operator visibility gaps, doc dissonances. Round 2 added observability and policy refinements: OTEL counter (S1), doctor check (S2), relaxed
StuckSweepEnabled()so zero patterns no longer disables the sweep entirely (M3), Concurrency Model docs (M2), JSON-tag pinning tests (M1), and dropped unusedfreshnessEntryfields (O4). Final pass: clean.Docs updated
engdocs/architecture/controller.md— tracker list, main loop, code map, config blockengdocs/architecture/health-patrol.md— ASCII diagram, data flow prose, key types, events inventory, known limitations, Concurrency Model subsectiondocs/reference/config.md,docs/reference/cli.md,docs/schema/city-schema.json— regenerated viago run ./cmd/genschemadocs/guides/stuck-agent-detection.md)Breaking changes
None. All new config fields default to disabled/empty; existing behavior unchanged.
Closes #571
🤖 Generated with Claude Code