Skip to content

feat: controller-level stuck-agent sweep (#571)#659

Open
rileywhite wants to merge 6 commits intogastownhall:mainfrom
rileywhite:feat/571-controller-level-stuck-agent-sweep
Open

feat: controller-level stuck-agent sweep (#571)#659
rileywhite wants to merge 6 commits intogastownhall:mainfrom
rileywhite:feat/571-controller-level-stuck-agent-sweep

Conversation

@rileywhite
Copy link
Copy Markdown
Contributor

Summary

Adds a fifth health-patrol tracker — stuckTracker — to the controller reconciler. It detects sessions whose terminal output (SHA-256 hash of Peek output) has not changed for longer than a configurable stuck_timeout, kills them, and marks them for immediate re-wake. A built-in circuit breaker quarantines sessions that repeatedly get stuck.

Key changes

  • Core tracker (cmd/gc/stuck_tracker.go): stuckTracker interface + memoryStuckTracker with doCheckStuck pure function, grace period, and own circuit breaker (quarantine after 3 kills in 2 * timeout window)
  • Reconciler integration (cmd/gc/session_reconciler.go): Stuck check after idle check with drain-in-progress guard and CanPeek gating
  • Config (internal/config/config.go): stuck_timeout field on DaemonConfig with duration accessor and validation
  • Runtime (internal/runtime/probe.go): CanPeek capability flag (true for tmux/k8s/fake, false for subprocess/exec/acp)
  • Wake bypass (cmd/gc/session_sleep.go): configWakeSuppressed exception for stuck-timeout (matches idle-timeout behavior)
  • Telemetry (internal/telemetry/recorder.go): RecordAgentStuckKill counter + log event
  • Dashboard (cmd/gc/dashboard/helpers.go): 🧊 ice icon, category, summary for session.stuck_killed
  • Doctor (internal/doctor/checks_semantic.go): Informational daemon-stuck-timeout check
  • Docs: Updated event-bus.md (full table replacement with correct session.* names), health-patrol.md (fifth tracker), controller.md, primitive-test.md (ZFC clarification)

Design decisions

  • Hash comparison is transport, not cognition (ZFC clean): The controller observes that output has not changed; it does not classify or interpret content
  • Opt-in, city-wide: stuck_timeout disabled by default. Set to accommodate the slowest-running agent
  • resetDetection vs clearSession: After a stuck-kill, detection state resets but kill history is preserved for the circuit breaker

What was tested

  • make fmt-check — pass
  • make lint — 0 issues
  • make vet — pass
  • 19 targeted tests pass (stuck_tracker_test.go + session_reconciler_test.go stuck integration tests)

Review outcome

Five parallel reviews (adversarial, what-about, UAT, docs, alignment) + neutral judge.

Critical finding (fixed): clearSession wiped kill history after every stuck-kill, making the circuit breaker unreachable in production. Fixed by adding resetDetection() that preserves kills while resetting detection state.

Judge verdict: CLEARED. The implementation matches the design brief, adversarial findings were handled appropriately, and the change is aligned with Gas City's principles (ZFC, Bitter Lesson, Primitive Test all pass).

Review Story

The stuck-agent sweep adds the third leg of the controller's health triad (crash, idle, stuck). The design brief laid out a clear architecture; the implementation followed it faithfully. Five independent reviews converged on one critical bug — the circuit breaker was unreachable because clearSession wiped the kill history it depended on — and the fix was structurally correct. Secondary findings (CanPeek granularity, test coverage gaps, cosmetic truncation) are real but non-blocking. The implementation is aligned with Gas City's core principles: hash comparison is pure transport observation (ZFC clean), stuck detection remains necessary regardless of model capability (Bitter Lesson pass), and no role names appear in the code (zero hardcoded roles).

Residual risks (non-blocking)

  1. CanPeek granularity: Provider-level, not per-session. Hybrid/auto with ACP backend could theoretically false-positive. Mitigated by empty-output guard and idle-fires-first ordering.
  2. City-wide timeout: No per-agent override yet. Documented limitation.

Closes #571

@github-actions github-actions bot added the status/needs-triage Inbox — we haven't looked at it yet label Apr 13, 2026
@rileywhite
Copy link
Copy Markdown
Contributor Author

rileywhite commented Apr 13, 2026

FYI: I'm running a parallel implementation with a different formula as a comparison test. I expect the older formula's PR (this one) to be closed on delivery. I plan to compare the results.

I'm going to go ahead and open it for review. It's a significant change, though, and my formula allowed for context compactions during dev that may have affected quality. If you pick this up for review, then please keep that in mind. This may just be too much for one go, and while #571 did get labeling indicating it's available for work, I did move forward before getting a clear signal that the direction is architecturally aligned, especially with respect to ZFC. (Though I did find the argument compelling.)

@rileywhite rileywhite force-pushed the feat/571-controller-level-stuck-agent-sweep branch from d7955d4 to fea8fac Compare April 13, 2026 03:02
@rileywhite rileywhite marked this pull request as ready for review April 13, 2026 03:08
@julianknutsen julianknutsen added priority/p3 Backlog — good idea, reviewed when there's bandwidth kind/feature New capability labels Apr 13, 2026
@julianknutsen julianknutsen modified the milestones: 1.0, 1.0+ Apr 13, 2026
@github-actions github-actions bot removed the status/needs-triage Inbox — we haven't looked at it yet label Apr 13, 2026
rileywhite and others added 5 commits April 13, 2026 06:57
Witness salvage commit: polecat-gc-yjr7xh died with uncommitted work.
All changes preserved from orphaned worktree.

Co-Authored-By: polecat-gc-yjr7xh <noreply@gc>
…sts, docs

Work salvaged by witness patrol. Polecat polecat-gc-qp9y2t was orphaned with
uncommitted changes in worktree.

Changes:
- cmd/gc/stuck_tracker.go: adjustments
- cmd/gc/dashboard/helpers.go: dashboard integration
- cmd/gc/session_reconciler_test.go: reconciler tests (+196 lines)
- cmd/gc/cmd_doctor.go: doctor check
- internal/doctor/checks_semantic.go: stuck-timeout check
- engdocs/architecture/controller.md: update tracker count
- engdocs/architecture/event-bus.md: event table corrections
- engdocs/architecture/health-patrol.md: add stuckTracker
- engdocs/contributors/primitive-test.md: ZFC note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on tests

- Add configWakeSuppressed exception for sleep_reason=stuck-timeout
  so stuck-killed sessions are always eligible for re-wake (matches
  idle-timeout behavior)
- Add TestReconcileSessionBeads_IdleFires_BeforeStuck verifying idle
  check takes priority over stuck check in reconciler ordering
- Fix drainTracker API call in stuck-draining test (set → correct API)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update controller.md: five trackers (not four), add stuck_timeout to
  DaemonConfig field list, add stuck tracker to nil-guard pattern list
- Update health-patrol.md: add stuck tracker to data flow prose, fix
  agent.suspended → session.suspended, update event names in dependency
  table, add stuck_tracker_test.go and session_reconciler_test.go to
  testing table
- Update event-bus.md: fix stale agent.* event names in dependency table,
  add session_reconciler.go row, fix JSONL example
- Fix stale doc comment in checks_semantic.go (StatusInfo → StatusOK)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resetDetection() to stuckTracker that preserves kill history
  while resetting detection state (hash, changedAt, firstSeen). The
  reconciler now calls resetDetection instead of clearSession after
  stuck-kills, so the circuit breaker can accumulate kills across
  restarts.
- Fix TestReconcileSessionBeads_StuckKill_OwnCircuitBreaker: place
  prior kills within the quarantine window so they survive pruning.
- Add RecordAgentStuckKill telemetry function (counter + log event)
  matching the pattern of RecordAgentCrash and RecordAgentIdleKill.
- Add TestStuckTracker_ResetDetection_PreservesKills verifying the
  split between detection reset and kill history preservation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rileywhite rileywhite force-pushed the feat/571-controller-level-stuck-agent-sweep branch from fea8fac to 1737fd6 Compare April 13, 2026 10:57
…l#571)

resetDetection zeroed firstSeen, causing doCheckStuck to skip the
grace period (now.Sub(time.Time{}) is always huge). Restarted sessions
could be re-killed after just one timeout instead of the intended
grace + timeout window.

Fix: doCheckStuck re-initializes zero firstSeen to now, giving the
restarted session a real grace period. Add test proving the grace
period holds after reset.

Also fix controller.md: tracker rebuild doc was inaccurate — the stuck
tracker preserves state when its timeout is unchanged, unlike other
trackers that rebuild unconditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rileywhite
Copy link
Copy Markdown
Contributor Author

Review Fix: Grace Period After Stuck-Kill Reset

Adversarial review discovered a critical bug: resetDetection zeroed firstSeen, causing doCheckStuck to bypass the grace period after a stuck-kill restart. A restarted session could be re-killed after just one stuck_timeout duration instead of the intended grace + timeout window.

Fix (commit 2045361):

  • doCheckStuck now re-initializes zero firstSeen to now, giving restarted sessions a real grace period
  • Added TestStuckTracker_ResetDetection_GracePeriod proving the fix
  • Fixed controller.md: tracker rebuild doc was inaccurate about stuck tracker's state-preserving behavior

Quality gates: make fmt-check, make vet, make lint — all pass, 0 issues.

Review summary:

  • Adversarial: 1 Critical (fixed), 4 Important (1 fixed doc, others assessed as acceptable)
  • UAT: PASS — all user-facing surfaces coherent
  • Docs: FINDINGS — minor gaps (telemetry counter, CanPeek capability not documented in engdocs)
  • GC Alignment: ALIGNED — passes ZFC, Bitter Lesson, Primitive Test, all product alignment checks
  • What-About: FINDINGS — 4 concerns, 1 fixed (grace period), others assessed as acceptable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature New capability priority/p3 Backlog — good idea, reviewed when there's bandwidth

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Controller-level stuck-agent sweep (non-LLM detection half of soft-recovery)

2 participants