Skip to content

Fix deferred-cleanup drain: tolerate can't find session + run reaper on 4h health check#560

Merged
lukstafi merged 2 commits into
mainfrom
ludics/task-703d0553-s2/root
Jun 6, 2026
Merged

Fix deferred-cleanup drain: tolerate can't find session + run reaper on 4h health check#560
lukstafi merged 2 commits into
mainfrom
ludics/task-703d0553-s2/root

Conversation

@lukstafi

@lukstafi lukstafi commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the deferred-cleanup queue (mag/cleanup-pending.json) never draining — orphan tmux sessions / worktrees / branches from long-merged tasks piled up (360-entry backlog going back to 2026-05-14). Two defects:

  1. Drain resilience — the reaper re-queued the whole entry whenever any sub-step failed. The dominant driver was the t3code arm marking failure while integration is paused (server intentionally down, gh-ludics-539). Replaced the all-or-nothing failed flag with a per-step retryableFailure, and made the paused-t3code arm a skip, not a failure (consults t3codeIntegrationEnabled() before probing the server).

  2. Cadence — the reaper ran from only one place (briefing precompute, ~1×/day). Added a second trigger on the executed 4h health-check path so reaping is evenly spaced; the briefing call is preserved.

The round-1 completion (this PR's delta)

The above shipped in the first commit but left a host-dependent gap: tmux kill-session -t <missing> returns can't find session: <name> when a tmux server is running (dev/worker box), which the benign matcher (no server running / session not found) did not tolerate — so a benign no-op flipped retryableFailure and pinned otherwise-complete entries forever (green in CI, red on a live host). This PR:

  • Extracts an exported isBenignTmuxKillStderr helper tolerating all three "session already gone" shapes as a class.
  • Refactors the single safeSyncOutput call in deferred-cleanup.ts to a namespace import so the non-benign retain path is spy-testable.
  • Adds the AC(c) health-check tests (executed path invokes the reaper once; gate-skip and peek paths do not).

Tests

  • isBenignTmuxKillStderr — all three benign shapes + non-benign/empty negatives
  • benign worktree/branch/tmux no-op drains (was failing); non-benign tmux failure retains
  • paused-t3code drains without probing the server; enabled+unreachable retains; explicit harnessDir threads into serverStatus({ harnessDir }); within-grace not reaped early
  • health-check cadence: executed → reaper called once; gate-skip / peek → not called

Full suite 2870 pass / 0 fail (was 2862 / 1). typecheck, all 15 CI lints, and bun run build clean.

Scope

No manual edit of mag/cleanup-pending.json — the backlog self-drains once the fix ships. No change to cleanup_delay_hours / 25h default / 72h cap. t3code stays paused. No data-shape change (CleanupEntry unchanged).

Proposal: docs/proposals/deferred-cleanup-drain-and-cadence.md

🤖 Generated with Claude Code

lukstafi and others added 2 commits June 6, 2026 21:09
… 4h cadence reaper

Defect 1 completion — the benign-tmux-no-op matcher in processDeferredCleanups
only tolerated "no server running" / "session not found", so on a host with a
*running* tmux server `tmux kill-session -t <missing>` returns
`can't find session: <name>`, which flipped retryableFailure and pinned an
otherwise-complete entry in cleanup-pending.json forever (green in CI with no
tmux server, red on a dev box). Extract the three benign "session already gone"
shapes into an exported `isBenignTmuxKillStderr` helper and tolerate all three
as a class. Refactor the spawn import to a namespace import so the non-benign
retain path is testable (single call site).

Defect 2 completion — the 4h health-check reaper call already existed in
resolveQueueRequestCommand; add the AC(c) tests proving the executed path ticks
processDeferredCleanups exactly once, and that a gate-skipped tick and the peek
path do NOT (reaping rides the executed cadence only).

Tests: isBenignTmuxKillStderr (all 3 benign + non-benign/empty); non-benign
tmux failure retains; explicit harnessDir threads into serverStatus({harnessDir});
AC(c) positive + gate-skip + peek negatives. Full suite 2870 pass / 0 fail
(was 2862/1). No data-shape change; backlog self-drains (no manual purge).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lukstafi

lukstafi commented Jun 6, 2026

Copy link
Copy Markdown
Owner Author

@codex review Focus on bugs, correctness issues, and edge cases. Do not check adherence to a spec or plan.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@lukstafi lukstafi merged commit 727e810 into main Jun 6, 2026
1 check passed
@lukstafi lukstafi deleted the ludics/task-703d0553-s2/root branch June 6, 2026 19:53
@lukstafi

lukstafi commented Jun 6, 2026

Copy link
Copy Markdown
Owner Author

Suggest Refactor — task-703d0553 (deferred-cleanup drain)

What I'd do differently next time, and follow-ups worth considering.

1. A shared "benign external-CLI stderr" classifier instead of ad-hoc includes chains

This PR introduced isBenignTmuxKillStderr to close the can't find session
gap, but the same "is this error actually a no-op success?" pattern recurs
across the codebase (src/adapters/tmux.ts kill-session, deleteBranches /
removeWorktreeByPath in worktrees.ts, peer-sync removal). Each reimplements
its own tolerated-string set. A small shared module (e.g.
src/orchestration/benign-errors.ts) with named, individually-tested
predicates per tool — isBenignTmuxKill, isBenignBranchDelete,
isBenignWorktreeRemove — would (a) make the tolerated set discoverable in one
place, (b) force the "enumerate ALL variants" discipline at the type level, and
(c) let src/adapters/tmux.ts:44 (currently fire-and-forget) opt into the same
classification if it ever needs to gate on the result. Out of scope here; worth
a follow-up if a third site needs the same treatment.

2. The root cause was an environment-dependent test, caught only by running the baseline on the worker host

tmux kill-session returns no server running in CI (no daemon) but
can't find session: <name> on a box with a live server. The committed
implementation passed CI and was red only on the worker. Lesson for next time:
when a cleanup path shells out to a daemon-backed CLI, probe the tool's real
stderr across server-up / server-down empirically
before drafting the matcher
(tmux kill-session -t nope; tmux list-sessions), and treat "CI green" as
necessary-not-sufficient. A cheap guard would be a test that fakes each known
stderr shape through the (now-extracted) predicate — which this PR added — so
the variant set is pinned regardless of the host the suite runs on.

3. Bounded-retry / hard-age ceiling was scoped out but is the real durability fix

Both defects fixed here are about which failures retain an entry. Neither
bounds how long a genuinely-stuck entry can churn: an entry that legitimately
keeps failing a retryable step (e.g. t3code enabled + a server that never comes
back) re-queues forever, re-attempting every tick. The proposal listed a
bounded-retry / age-ceiling as optional defense-in-depth; I deferred it to keep
the PR tight. Worth a follow-up task: drop an entry (with a loud
console.error) once it is older than, say, 2× the 72h cap or after N attempts,
so no permanently-unreapable step can accumulate unboundedly — closing the
backlog-growth class, not just the two instances we hit.

4. Reaper trigger duplication (briefing + health-check) is now two call sites

processDeferredCleanups() is invoked from both briefingPrecomputeContext
and the health-check branch with identical dynamic-import + try/catch
boilerplate. Cheap and fine for two sites, but if a third trigger is ever added
(e.g. slot-clear), extract a runDeferredCleanupSafely() wrapper so the
import-and-log shape lives in one place and can't drift between call sites.

Process note

The implementation was already committed when planning began; the high-leverage
move was running the baseline suite first and reading the one failing test as
a real defect rather than flaky-host noise. Trusting the green-CI signal would
have shipped a no-op-looking-but-broken matcher. Keep "run baseline on the
actual host, investigate every failure" as a hard gate, not a formality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant