Fix deferred-cleanup drain: tolerate `can't find session` + run reaper on 4h health check by lukstafi · Pull Request #560 · lukstafi/ludics

lukstafi · 2026-06-06T19:37:24Z

Summary

Fixes the deferred-cleanup queue (mag/cleanup-pending.json) never draining — orphan tmux sessions / worktrees / branches from long-merged tasks piled up (360-entry backlog going back to 2026-05-14). Two defects:

Drain resilience — the reaper re-queued the whole entry whenever any sub-step failed. The dominant driver was the t3code arm marking failure while integration is paused (server intentionally down, gh-ludics-539). Replaced the all-or-nothing failed flag with a per-step retryableFailure, and made the paused-t3code arm a skip, not a failure (consults t3codeIntegrationEnabled() before probing the server).
Cadence — the reaper ran from only one place (briefing precompute, ~1×/day). Added a second trigger on the executed 4h health-check path so reaping is evenly spaced; the briefing call is preserved.

The round-1 completion (this PR's delta)

The above shipped in the first commit but left a host-dependent gap: tmux kill-session -t <missing> returns can't find session: <name> when a tmux server is running (dev/worker box), which the benign matcher (no server running / session not found) did not tolerate — so a benign no-op flipped retryableFailure and pinned otherwise-complete entries forever (green in CI, red on a live host). This PR:

Extracts an exported isBenignTmuxKillStderr helper tolerating all three "session already gone" shapes as a class.
Refactors the single safeSyncOutput call in deferred-cleanup.ts to a namespace import so the non-benign retain path is spy-testable.
Adds the AC(c) health-check tests (executed path invokes the reaper once; gate-skip and peek paths do not).

Tests

isBenignTmuxKillStderr — all three benign shapes + non-benign/empty negatives
benign worktree/branch/tmux no-op drains (was failing); non-benign tmux failure retains
paused-t3code drains without probing the server; enabled+unreachable retains; explicit harnessDir threads into serverStatus({ harnessDir }); within-grace not reaped early
health-check cadence: executed → reaper called once; gate-skip / peek → not called

Full suite 2870 pass / 0 fail (was 2862 / 1). typecheck, all 15 CI lints, and bun run build clean.

Scope

No manual edit of mag/cleanup-pending.json — the backlog self-drains once the fix ships. No change to cleanup_delay_hours / 25h default / 72h cap. t3code stays paused. No data-shape change (CleanupEntry unchanged).

Proposal: docs/proposals/deferred-cleanup-drain-and-cadence.md

🤖 Generated with Claude Code

… 4h cadence reaper Defect 1 completion — the benign-tmux-no-op matcher in processDeferredCleanups only tolerated "no server running" / "session not found", so on a host with a *running* tmux server `tmux kill-session -t <missing>` returns `can't find session: <name>`, which flipped retryableFailure and pinned an otherwise-complete entry in cleanup-pending.json forever (green in CI with no tmux server, red on a dev box). Extract the three benign "session already gone" shapes into an exported `isBenignTmuxKillStderr` helper and tolerate all three as a class. Refactor the spawn import to a namespace import so the non-benign retain path is testable (single call site). Defect 2 completion — the 4h health-check reaper call already existed in resolveQueueRequestCommand; add the AC(c) tests proving the executed path ticks processDeferredCleanups exactly once, and that a gate-skipped tick and the peek path do NOT (reaping rides the executed cadence only). Tests: isBenignTmuxKillStderr (all 3 benign + non-benign/empty); non-benign tmux failure retains; explicit harnessDir threads into serverStatus({harnessDir}); AC(c) positive + gate-skip + peek negatives. Full suite 2870 pass / 0 fail (was 2862/1). No data-shape change; backlog self-drains (no manual purge). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lukstafi · 2026-06-06T19:48:45Z

@codex review Focus on bugs, correctness issues, and edge cases. Do not check adherence to a spec or plan.

chatgpt-codex-connector · 2026-06-06T19:51:40Z

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lukstafi · 2026-06-06T19:54:19Z

Suggest Refactor — task-703d0553 (deferred-cleanup drain)

What I'd do differently next time, and follow-ups worth considering.

1. A shared "benign external-CLI stderr" classifier instead of ad-hoc `includes` chains

This PR introduced isBenignTmuxKillStderr to close the can't find session
gap, but the same "is this error actually a no-op success?" pattern recurs
across the codebase (src/adapters/tmux.ts kill-session, deleteBranches /
removeWorktreeByPath in worktrees.ts, peer-sync removal). Each reimplements
its own tolerated-string set. A small shared module (e.g.
src/orchestration/benign-errors.ts) with named, individually-tested
predicates per tool — isBenignTmuxKill, isBenignBranchDelete,
isBenignWorktreeRemove — would (a) make the tolerated set discoverable in one
place, (b) force the "enumerate ALL variants" discipline at the type level, and
(c) let src/adapters/tmux.ts:44 (currently fire-and-forget) opt into the same
classification if it ever needs to gate on the result. Out of scope here; worth
a follow-up if a third site needs the same treatment.

2. The root cause was an environment-dependent test, caught only by running the baseline on the worker host

tmux kill-session returns no server running in CI (no daemon) but
can't find session: <name> on a box with a live server. The committed
implementation passed CI and was red only on the worker. Lesson for next time:
when a cleanup path shells out to a daemon-backed CLI, probe the tool's real
stderr across server-up / server-down empirically before drafting the matcher
(tmux kill-session -t nope; tmux list-sessions), and treat "CI green" as
necessary-not-sufficient. A cheap guard would be a test that fakes each known
stderr shape through the (now-extracted) predicate — which this PR added — so
the variant set is pinned regardless of the host the suite runs on.

3. Bounded-retry / hard-age ceiling was scoped out but is the real durability fix

Both defects fixed here are about which failures retain an entry. Neither
bounds how long a genuinely-stuck entry can churn: an entry that legitimately
keeps failing a retryable step (e.g. t3code enabled + a server that never comes
back) re-queues forever, re-attempting every tick. The proposal listed a
bounded-retry / age-ceiling as optional defense-in-depth; I deferred it to keep
the PR tight. Worth a follow-up task: drop an entry (with a loud
console.error) once it is older than, say, 2× the 72h cap or after N attempts,
so no permanently-unreapable step can accumulate unboundedly — closing the
backlog-growth class, not just the two instances we hit.

4. Reaper trigger duplication (briefing + health-check) is now two call sites

processDeferredCleanups() is invoked from both briefingPrecomputeContext
and the health-check branch with identical dynamic-import + try/catch
boilerplate. Cheap and fine for two sites, but if a third trigger is ever added
(e.g. slot-clear), extract a runDeferredCleanupSafely() wrapper so the
import-and-log shape lives in one place and can't drift between call sites.

Process note

The implementation was already committed when planning began; the high-leverage
move was running the baseline suite first and reading the one failing test as
a real defect rather than flaky-host noise. Trusting the green-CI signal would
have shipped a no-op-looking-but-broken matcher. Keep "run baseline on the
actual host, investigate every failure" as a hard gate, not a formality.

lukstafi and others added 2 commits June 6, 2026 21:09

[round 1] awaiting

5555824

lukstafi merged commit 727e810 into main Jun 6, 2026
1 check passed

lukstafi deleted the ludics/task-703d0553-s2/root branch June 6, 2026 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deferred-cleanup drain: tolerate `can't find session` + run reaper on 4h health check#560

Fix deferred-cleanup drain: tolerate `can't find session` + run reaper on 4h health check#560
lukstafi merged 2 commits into
mainfrom
ludics/task-703d0553-s2/root

lukstafi commented Jun 6, 2026

Uh oh!

lukstafi commented Jun 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 6, 2026

Uh oh!

Uh oh!

lukstafi commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lukstafi commented Jun 6, 2026

Summary

The round-1 completion (this PR's delta)

Tests

Scope

Uh oh!

lukstafi commented Jun 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 6, 2026

Uh oh!

Uh oh!

lukstafi commented Jun 6, 2026

Suggest Refactor — task-703d0553 (deferred-cleanup drain)

1. A shared "benign external-CLI stderr" classifier instead of ad-hoc includes chains

2. The root cause was an environment-dependent test, caught only by running the baseline on the worker host

3. Bounded-retry / hard-age ceiling was scoped out but is the real durability fix

4. Reaper trigger duplication (briefing + health-check) is now two call sites

Process note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. A shared "benign external-CLI stderr" classifier instead of ad-hoc `includes` chains