Skip to content

feat: controller-level stuck-agent sweep (non-LLM detection) (#571)#714

Open
rileywhite wants to merge 10 commits intogastownhall:mainfrom
rileywhite:feat/571-feat-controller-level-stuck-agent-sweep-non-llm-de
Open

feat: controller-level stuck-agent sweep (non-LLM detection) (#571)#714
rileywhite wants to merge 10 commits intogastownhall:mainfrom
rileywhite:feat/571-feat-controller-level-stuck-agent-sweep-non-llm-de

Conversation

@rileywhite
Copy link
Copy Markdown
Contributor

@rileywhite rileywhite commented Apr 14, 2026

Summary

Implements a stuckTracker in the controller reconciliation tick that detects agents whose process is alive but conversation context is broken (e.g. Claude Code sessions wedged on a 400 tool_use concurrency error). Detection is non-LLM: pattern-match on pane output + wisp staleness cross-check, with a parallel progress-mismatch axis (activity predates the wisp). On detection, files a pool:dog warrant bead into the existing mol-shutdown-dance recovery flow rather than killing the session directly.

This is the alternative-design counterpart to #659. Where #659 hashes Peek output and kills the session in-controller, #714 separates observation from action: the controller files a warrant, and the existing dog/recovery flow handles termination. Detection composes two independent axes (regex + progress-mismatch) so the sweep stays useful even with zero error patterns configured.

  • cmd/gc/stuck_tracker.go — pure-predicate stuckTracker interface + noopStuckTracker + memoryStuckTracker. Immutable compiled-regex slice. Convergent evidence: stale wisp AND (error pattern OR progress mismatch).
  • cmd/gc/wisp_freshness.go + wisp_freshness_test.gowispFreshness opaque struct + sweepWispFreshness() using existing CommandRunner pattern (one bd list --json per sweep, not per session). Tests pin the id, updated_at, metadata.session_name JSON tags against future bd refactors.
  • cmd/gc/stuck_sweep.goCityRuntime.runStuckSweep(): halts inside the halt gate, idempotence-checked (no duplicate warrants on store errors), ProcessAlive race-check before Create, agent.stuck event + gc.agent.stuck_warrants.total OTEL counter on successful warrant Create only.
  • internal/config/ — five new [daemon] fields: stuck_sweep, stuck_wisp_threshold, stuck_error_patterns, stuck_peek_lines, stuck_warrant_label. Pre-start pattern validation (ValidateStuckPatterns) with indexed errors.
  • internal/events/events.goAgentStuck = "agent.stuck" constant.
  • internal/telemetry/recorder.gogc.agent.stuck_warrants.total Int64Counter, low-cardinality axis attribute (regex or progress_mismatch). Incremented exactly once per successful Create, after the AgentStuck event so metric and event always agree.
  • internal/doctor/checks_semantic.go + testsdaemon-stuck-sweep doctor check surfaces the effective configuration: disabled / misconfigured / regex-axis state / progress-mismatch state, threshold, label. Calls StuckSweepEnabled() rather than re-deriving the predicate.
  • cmd/gc/cmd_doctor.go, cmd_start.go, cmd_supervisor.go — wire the new doctor check and ensure stuck-sweep startup logging describes which axes are active.
  • Docsengdocs/architecture/controller.md, engdocs/architecture/health-patrol.md (incl. new Concurrency Model subsection documenting single-goroutine ownership of cr.stuck and cr.runner), docs/reference/config.md, docs/reference/cli.md, docs/schema/city-schema.json regenerated via go run ./cmd/genschema.

Design decisions

  • stuck_sweep defaults to false (opt-in) — false-positive risk is high-consequence
  • stuck_error_patterns defaults to empty in Go (provider-neutral); example patterns live in user TOML
  • stuck_warrant_label configurable (default "pool:dog") for ZFC role-neutrality
  • Zero patterns disables the regex axis only (not the whole sweep) — progress-mismatch axis stays live so wisp/activity divergence still triggers warrants
  • Fail-open on wisp query error (sweep skips, no warrant flood)
  • Fail-open on hasOpenStuckWarrant store error (assume warrant exists → skip)
  • Single-goroutine ownership of cr.stuck and cr.runner — built in newCityRuntime, rebuilt only on config reload, accessed only from the reconciler tick

What was tested

  • make fmt-check — PASS
  • make lint — PASS
  • make vet — PASS
  • Unit tests: cmd/gc/stuck_tracker_test.go (11 tests: detection logic, config accessors, edge cases)
  • Unit tests: cmd/gc/wisp_freshness_test.go (tiebreaker, empty/malformed/error/empty-array paths, JSON tag pinning)
  • Integration tests: cmd/gc/stuck_sweep_test.go (5 tests: no-agents, idempotence, halt-gate, fail-open, event-on-create)
  • Doctor tests: internal/doctor/checks_semantic_test.go (daemon-stuck-sweep check)

Review outcome

Two-round review council (Code, UAT, Docs, Alignment, Blast Radius, Invariants, Concurrency reviewers). Round 1 found 9 blocking issues (all fixed): idempotence direction, vacuous halt test, illusory interface, operator visibility gaps, doc dissonances. Round 2 added observability and policy refinements: OTEL counter (S1), doctor check (S2), relaxed StuckSweepEnabled() so zero patterns no longer disables the sweep entirely (M3), Concurrency Model docs (M2), JSON-tag pinning tests (M1), and dropped unused freshnessEntry fields (O4). Final pass: clean.

Docs updated

  • engdocs/architecture/controller.md — tracker list, main loop, code map, config block
  • engdocs/architecture/health-patrol.md — ASCII diagram, data flow prose, key types, events inventory, known limitations, Concurrency Model subsection
  • docs/reference/config.md, docs/reference/cli.md, docs/schema/city-schema.json — regenerated via go run ./cmd/genschema
  • Follow-up issue filed: ga-60k — operator guide (docs/guides/stuck-agent-detection.md)

Breaking changes

None. All new config fields default to disabled/empty; existing behavior unchanged.

Closes #571

🤖 Generated with Claude Code

rileywhite and others added 5 commits April 14, 2026 15:47
…tion

Introduces the pure-predicate stuckTracker alongside the existing
idle/crash/wispGC trackers. The tracker receives a session name, recent
pane output, the wisp UpdatedAt, last-activity time, and now, and
returns (stuck, reason, matched_pattern). No time.Now, no store I/O,
no Peek inside the tracker — all I/O stays in the forthcoming
CityRuntime wrapper.

Adds five opt-in [daemon] fields — stuck_sweep, stuck_wisp_threshold,
stuck_error_patterns, stuck_peek_lines, stuck_warrant_label — plus
helper methods and duration validation. The SDK ships with empty
pattern defaults (no vendor strings) and a configurable warrant label
(default "pool:dog") so downstream formulas can route warrants without
any role-name control flow in Go.

Detection composes convergent evidence: stale wisp AND (regex match OR
progress-mismatch axis where activity predates the wisp). Matches are
sorted and joined deterministically so repeated evaluations produce
identical warrant metadata. Invalid patterns fail fast at construction
with an indexed error.

Table-driven tests cover the disabled-noop path, empty patterns,
invalid regex, pattern-match detection, progress-mismatch detection,
fresh-wisp suppression, boundary equality, and deterministic match
joining.

CityRuntime wiring, wispFreshness plumbing, and warrant filing land in
a follow-up commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…#571)

Adds the controller-level stuck-agent sweep glue on top of iteration 1's
pure-predicate stuckTracker:

- `cmd/gc/wisp_freshness.go`: opaque `wispFreshness` snapshot and
  `sweepWispFreshness()` — one `bd list --json` shellout per sweep,
  parses `updated_at`, groups by `session_name`, applies
  most-recent-updated_at tiebreaker. Raw bd JSON types do not leak.
- `cmd/gc/stuck_sweep.go`: `CityRuntime.runStuckSweep()` iterates
  running sessions, collects pane output + last-activity, consults the
  tracker predicate, checks idempotence via `ListByLabel` +
  metadata.target match, re-checks `ProcessAlive` before Create, files
  a warrant bead (`type=warrant`, configurable label, metadata
  `{target, reason, requester, matched_pattern, wisp_id, wisp_age}`),
  and emits `events.AgentStuck` only on successful Create. Whole-sweep
  fail-open on shellout failure; per-session fail-open on Peek/activity
  errors.
- `cmd/gc/city_runtime.go`: adds `stuck` and `runner` fields; builds
  the tracker in `newCityRuntime` with stderr fallback to noop on
  regex-compile error; rebuilds on config reload; invokes
  `runStuckSweep` inside the halt gate, after wispGC, before order
  dispatch.
- `internal/events/events.go`: adds `AgentStuck = "agent.stuck"` event
  type constant.
- `engdocs/architecture/controller.md`,
  `engdocs/architecture/health-patrol.md`: document the new tick step,
  tracker key type, `[daemon]` config block, code-map entries, the
  `bd list --json updated_at` known-limitation, the multi-wisp
  tiebreaker, and the event-on-successful-Create rule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…astownhall#571)

Adds cmd/gc/stuck_sweep_test.go covering five acceptance criteria:
no-agents self-sufficiency (AC16/E21), idempotence across ticks (E11),
halt-gate suppression (E14), whole-sweep fail-open on runner error
(E24/AC18), and agent.stuck event emission on Create success (AC17).

Regenerates docs/reference/config.md, docs/reference/cli.md, and
docs/schema/city-schema.json via `go run ./cmd/genschema` so the
stuck_* daemon keys appear in the generated reference.

Operator guide tracked separately as ga-60k.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the status/needs-triage Inbox — we haven't looked at it yet label Apr 14, 2026
rileywhite and others added 3 commits April 14, 2026 17:23
…townhall#571)

M3: StuckSweepEnabled() no longer requires patterns. Zero patterns now
disables the regex axis only; the progress-mismatch axis stays live.
The startup log positively describes which axes are active.

M2: Document the single-goroutine ownership invariant for cr.stuck and
cr.runner with field comments and a Concurrency Model subsection in
engdocs/architecture/health-patrol.md.

M1: Add cmd/gc/wisp_freshness_test.go to pin JSON tags id, updated_at,
and metadata.session_name against future refactors. Covers tiebreaker,
empty/malformed/error/empty-array paths.

O4: Drop unused Type/Status fields from freshnessEntry; bd query already
filters --type=wisp --status=in_progress.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…townhall#571)

S1: New gc.agent.stuck_warrants.total Int64Counter incremented exactly
once per successful warrant Create, with low-cardinality axis attribute
("regex" or "progress_mismatch"). Emitted from runStuckSweep after the
existing AgentStuck event so metric and event always agree.

S2: New daemon-stuck-sweep doctor check surfaces the effective sweep
configuration: disabled / misconfigured / regex-axis state /
progress-mismatch state, threshold, label. Calls StuckSweepEnabled()
rather than re-deriving the predicate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nts, docs (gastownhall#571)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oml v1.6 (gastownhall#571)

BurntSushi/toml v1.6.0's isEmpty() does not handle reflect.Int — it falls
through to return false, so an int=0 field is never omitted even with the
omitempty tag. This caused [daemon] to appear in marshaled TOML for configs
with an empty daemon section (TestMarshalOmitsEmptyDaemonSection failure).

Fix: change StuckPeekLines int → *int, matching the existing *int pattern
used by MaxRestarts and ProbeConcurrency. Update StuckPeekLinesOrDefault()
to treat nil the same as zero (return the default). Update test helpers to
use pointer literals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rileywhite
Copy link
Copy Markdown
Contributor Author

CI failure here is the known flake tracked in #716TestACPConformance/SharedSession/Nudge_RunningSession failing with write |1: broken pipe, preceded by the same acp: readLoop exit: read |0: file already closed flood.

This PR does not touch internal/runtime/acp (changes are confined to cmd/gc/, internal/config, internal/events, internal/telemetry, internal/doctor, and docs/), so the failure is unrelated to the work in this branch.

Marking ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/needs-triage Inbox — we haven't looked at it yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Controller-level stuck-agent sweep (non-LLM detection half of soft-recovery)

2 participants