feat: controller-level stuck-agent sweep (#571)#659
feat: controller-level stuck-agent sweep (#571)#659rileywhite wants to merge 6 commits intogastownhall:mainfrom
Conversation
|
I'm going to go ahead and open it for review. It's a significant change, though, and my formula allowed for context compactions during dev that may have affected quality. If you pick this up for review, then please keep that in mind. This may just be too much for one go, and while #571 did get labeling indicating it's available for work, I did move forward before getting a clear signal that the direction is architecturally aligned, especially with respect to ZFC. (Though I did find the argument compelling.) |
d7955d4 to
fea8fac
Compare
Witness salvage commit: polecat-gc-yjr7xh died with uncommitted work. All changes preserved from orphaned worktree. Co-Authored-By: polecat-gc-yjr7xh <noreply@gc>
…sts, docs Work salvaged by witness patrol. Polecat polecat-gc-qp9y2t was orphaned with uncommitted changes in worktree. Changes: - cmd/gc/stuck_tracker.go: adjustments - cmd/gc/dashboard/helpers.go: dashboard integration - cmd/gc/session_reconciler_test.go: reconciler tests (+196 lines) - cmd/gc/cmd_doctor.go: doctor check - internal/doctor/checks_semantic.go: stuck-timeout check - engdocs/architecture/controller.md: update tracker count - engdocs/architecture/event-bus.md: event table corrections - engdocs/architecture/health-patrol.md: add stuckTracker - engdocs/contributors/primitive-test.md: ZFC note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on tests - Add configWakeSuppressed exception for sleep_reason=stuck-timeout so stuck-killed sessions are always eligible for re-wake (matches idle-timeout behavior) - Add TestReconcileSessionBeads_IdleFires_BeforeStuck verifying idle check takes priority over stuck check in reconciler ordering - Fix drainTracker API call in stuck-draining test (set → correct API) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update controller.md: five trackers (not four), add stuck_timeout to DaemonConfig field list, add stuck tracker to nil-guard pattern list - Update health-patrol.md: add stuck tracker to data flow prose, fix agent.suspended → session.suspended, update event names in dependency table, add stuck_tracker_test.go and session_reconciler_test.go to testing table - Update event-bus.md: fix stale agent.* event names in dependency table, add session_reconciler.go row, fix JSONL example - Fix stale doc comment in checks_semantic.go (StatusInfo → StatusOK) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resetDetection() to stuckTracker that preserves kill history while resetting detection state (hash, changedAt, firstSeen). The reconciler now calls resetDetection instead of clearSession after stuck-kills, so the circuit breaker can accumulate kills across restarts. - Fix TestReconcileSessionBeads_StuckKill_OwnCircuitBreaker: place prior kills within the quarantine window so they survive pruning. - Add RecordAgentStuckKill telemetry function (counter + log event) matching the pattern of RecordAgentCrash and RecordAgentIdleKill. - Add TestStuckTracker_ResetDetection_PreservesKills verifying the split between detection reset and kill history preservation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fea8fac to
1737fd6
Compare
…l#571) resetDetection zeroed firstSeen, causing doCheckStuck to skip the grace period (now.Sub(time.Time{}) is always huge). Restarted sessions could be re-killed after just one timeout instead of the intended grace + timeout window. Fix: doCheckStuck re-initializes zero firstSeen to now, giving the restarted session a real grace period. Add test proving the grace period holds after reset. Also fix controller.md: tracker rebuild doc was inaccurate — the stuck tracker preserves state when its timeout is unchanged, unlike other trackers that rebuild unconditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review Fix: Grace Period After Stuck-Kill ResetAdversarial review discovered a critical bug: Fix (commit 2045361):
Quality gates: Review summary:
|
Summary
Adds a fifth health-patrol tracker —
stuckTracker— to the controller reconciler. It detects sessions whose terminal output (SHA-256 hash of Peek output) has not changed for longer than a configurablestuck_timeout, kills them, and marks them for immediate re-wake. A built-in circuit breaker quarantines sessions that repeatedly get stuck.Key changes
cmd/gc/stuck_tracker.go):stuckTrackerinterface +memoryStuckTrackerwithdoCheckStuckpure function, grace period, and own circuit breaker (quarantine after 3 kills in2 * timeoutwindow)cmd/gc/session_reconciler.go): Stuck check after idle check with drain-in-progress guard and CanPeek gatinginternal/config/config.go):stuck_timeoutfield onDaemonConfigwith duration accessor and validationinternal/runtime/probe.go):CanPeekcapability flag (true for tmux/k8s/fake, false for subprocess/exec/acp)cmd/gc/session_sleep.go):configWakeSuppressedexception forstuck-timeout(matches idle-timeout behavior)internal/telemetry/recorder.go):RecordAgentStuckKillcounter + log eventcmd/gc/dashboard/helpers.go): 🧊 ice icon, category, summary forsession.stuck_killedinternal/doctor/checks_semantic.go): Informationaldaemon-stuck-timeoutcheckDesign decisions
stuck_timeoutdisabled by default. Set to accommodate the slowest-running agentWhat was tested
make fmt-check— passmake lint— 0 issuesmake vet— passReview outcome
Five parallel reviews (adversarial, what-about, UAT, docs, alignment) + neutral judge.
Critical finding (fixed):
clearSessionwiped kill history after every stuck-kill, making the circuit breaker unreachable in production. Fixed by addingresetDetection()that preserves kills while resetting detection state.Judge verdict: CLEARED. The implementation matches the design brief, adversarial findings were handled appropriately, and the change is aligned with Gas City's principles (ZFC, Bitter Lesson, Primitive Test all pass).
Review Story
The stuck-agent sweep adds the third leg of the controller's health triad (crash, idle, stuck). The design brief laid out a clear architecture; the implementation followed it faithfully. Five independent reviews converged on one critical bug — the circuit breaker was unreachable because clearSession wiped the kill history it depended on — and the fix was structurally correct. Secondary findings (CanPeek granularity, test coverage gaps, cosmetic truncation) are real but non-blocking. The implementation is aligned with Gas City's core principles: hash comparison is pure transport observation (ZFC clean), stuck detection remains necessary regardless of model capability (Bitter Lesson pass), and no role names appear in the code (zero hardcoded roles).
Residual risks (non-blocking)
Closes #571