Skip to content

Commit 2045361

Browse files
rileywhiteclaude
andcommitted
fix: restore grace period after stuck-kill resetDetection (#571)
resetDetection zeroed firstSeen, causing doCheckStuck to skip the grace period (now.Sub(time.Time{}) is always huge). Restarted sessions could be re-killed after just one timeout instead of the intended grace + timeout window. Fix: doCheckStuck re-initializes zero firstSeen to now, giving the restarted session a real grace period. Add test proving the grace period holds after reset. Also fix controller.md: tracker rebuild doc was inaccurate — the stuck tracker preserves state when its timeout is unchanged, unlike other trackers that rebuild unconditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1737fd6 commit 2045361

File tree

3 files changed

+39
-3
lines changed

3 files changed

+39
-3
lines changed

cmd/gc/stuck_tracker.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,13 @@ func (m *memoryStuckTracker) resetDetection(sessionName string) {
160160
// This function is independently testable without any Provider mock.
161161
func doCheckStuck(entry stuckEntry, output string, now time.Time, timeout time.Duration) (stuck bool, updated stuckEntry) {
162162
updated = entry
163+
164+
// After resetDetection, firstSeen is zero — re-initialize so the
165+
// restarted session gets a real grace period from this tick.
166+
if updated.firstSeen.IsZero() {
167+
updated.firstSeen = now
168+
}
169+
163170
h := sha256.Sum256([]byte(output))
164171

165172
if h != entry.hash {

cmd/gc/stuck_tracker_test.go

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,33 @@ func TestStuckTracker_ResetDetection_PreservesKills(t *testing.T) {
125125
}
126126
}
127127

128+
func TestStuckTracker_ResetDetection_GracePeriod(t *testing.T) {
129+
st := newStuckTracker(5 * time.Minute)
130+
now := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)
131+
132+
// Establish session.
133+
st.checkStuck("sess1", "frozen", now)
134+
135+
// Simulate stuck-kill: reset detection state.
136+
st.resetDetection("sess1")
137+
138+
// After reset, first observation with new output — hash change, not stuck.
139+
restartTime := now.Add(1 * time.Hour)
140+
if st.checkStuck("sess1", "new output", restartTime) {
141+
t.Fatal("expected false on first observation after reset")
142+
}
143+
144+
// Same output within grace period (timeout/2 < timeout) — must NOT fire.
145+
if st.checkStuck("sess1", "new output", restartTime.Add(3*time.Minute)) {
146+
t.Fatal("expected false: should be within grace period after reset")
147+
}
148+
149+
// Same output past grace period AND past timeout — should fire.
150+
if !st.checkStuck("sess1", "new output", restartTime.Add(11*time.Minute)) {
151+
t.Fatal("expected true: past grace period and timeout after reset")
152+
}
153+
}
154+
128155
func TestStuckTracker_MultipleSessionsIndependent(t *testing.T) {
129156
st := newStuckTracker(5 * time.Minute)
130157
now := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)

engdocs/architecture/controller.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -164,9 +164,11 @@ indicate bugs.
164164
startup; changing it requires a controller restart.
165165

166166
- **Tracker rebuild is atomic per tick**: When config reloads, all five
167-
trackers (crash, idle, stuck, wisp GC, order) are rebuilt in the same
168-
tick before reconciliation runs. No tick ever uses a mix of old and new
169-
tracker state.
167+
trackers (crash, idle, stuck, wisp GC, order) are updated in the same
168+
tick before reconciliation runs. Most trackers are rebuilt unconditionally;
169+
the stuck tracker preserves accumulated state when its timeout value is
170+
unchanged (only rebuilt when `stuck_timeout` changes or is removed).
171+
No tick ever uses a mix of old and new tracker config.
170172

171173
- **Dirty flag is edge-triggered, not level-triggered**: The `atomic.Bool`
172174
is set by the fsnotify goroutine and cleared by `dirty.Swap(false)` at

0 commit comments

Comments
 (0)