fix(worktrees): preserve unmerged work on non-main repos and recover stale GC lock#2093
Conversation
…stale GC lock Two reliability bugs in the worktree garbage collector: - Graveyard preservation compared agent branches against a hardcoded `main`, so on a repository whose default branch is not `main` the rev-list check failed and was read as "nothing to preserve", letting cleanup_all_stale delete unmerged agent commits. The base is now resolved from the repo default (origin/HEAD, main, master, and their remote refs), and an inconclusive check preserves the branch to the graveyard instead of dropping it. - The GC lock had no stale-lock recovery: a crashed or killed GC left the lock behind and wedged every future `bernstein worktrees gc`. The lock already recorded the owner pid and start time; acquisition now reclaims a lock once when the owning process is gone or the lock is older than a generous bound, while still refusing a lock held by a live recent process.
Reviewer's GuideThis PR hardens worktree garbage collection by correctly detecting unmerged agent work on repos whose default branch is not Sequence diagram for graveyard preservation on non-main default branchessequenceDiagram
participant WorktreeGC as cleanup_all_stale
participant GitRepo as git
WorktreeGC->>GitRepo: _resolve_graveyard_base(repo_root)
GitRepo-->>WorktreeGC: base_ref
WorktreeGC->>GitRepo: _count_unmerged_commits(repo_root, branch, base_ref)
alt rev_list succeeds and output > 0
GitRepo-->>WorktreeGC: unmerged_count > 0
WorktreeGC->>WorktreeGC: preserve_branch_to_graveyard(repo_root, session_id, branch)
else rev_list fails but branch exists
GitRepo-->>WorktreeGC: -1
WorktreeGC->>WorktreeGC: preserve_branch_to_graveyard(repo_root, session_id, branch)
else branch missing or fully merged
GitRepo-->>WorktreeGC: 0
WorktreeGC->>WorktreeGC: delete stale worktree
end
Sequence diagram for recoverable GC lock acquisitionsequenceDiagram
participant GcRunner as lock_gc
participant FS as _acquire_gc_lock_fd
GcRunner->>FS: _acquire_gc_lock_fd(lock_path)
alt lock file does not exist
FS-->>GcRunner: fd
else lock file exists
FS->>FS: _read_gc_lock(lock_path)
alt _gc_lock_is_stale(meta) is False
FS-->>GcRunner: GcLockError
else _gc_lock_is_stale(meta) is True
FS->>FS: lock_path.unlink()
alt retry os.open succeeds
FS-->>GcRunner: fd
else retry os.open FileExistsError
FS-->>GcRunner: GcLockError
end
end
end
GcRunner->>GcRunner: write {pid, started_at} to lock file
GcRunner->>GcRunner: run GC and remove lock on exit
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Sonar insights (advisory, no merge-block)Snapshot of
Run This comment is a soft signal. The Sonar scan runs on push to |
📝 WalkthroughWalkthroughGC lock acquisition now reclaims stale lock files after PID and age checks. Worktree cleanup now resolves the repository’s default branch before counting unmerged commits, and preserves stale worktrees when the count is nonzero or inconclusive. ChangesGC lock recovery
Graveyard base resolution
Sequence Diagram(s)GC lock recovery sequenceDiagram
participant lock_gc
participant _acquire_gc_lock_fd
participant _gc_lock_is_stale
participant is_process_alive
participant lock_file
lock_gc->>_acquire_gc_lock_fd: acquire GC lock
_acquire_gc_lock_fd->>lock_file: read JSON payload
_acquire_gc_lock_fd->>_gc_lock_is_stale: check pid and started_at
_gc_lock_is_stale->>is_process_alive: check recorded PID
alt stale lock
_acquire_gc_lock_fd->>lock_file: remove stale lock file
_acquire_gc_lock_fd->>lock_file: retry acquisition
else live owner
_acquire_gc_lock_fd-->>lock_gc: raise GcLockError
end
Graveyard cleanup sequenceDiagram
participant cleanup_all_stale
participant _resolve_graveyard_base
participant _count_unmerged_commits
cleanup_all_stale->>_resolve_graveyard_base: resolve base_ref
_resolve_graveyard_base-->>cleanup_all_stale: default branch
cleanup_all_stale->>_count_unmerged_commits: count commits against base_ref
_count_unmerged_commits-->>cleanup_all_stale: 0, >0, or -1
alt unmerged > 0 or inconclusive
cleanup_all_stale->>cleanup_all_stale: preserve stale worktree
else unmerged == 0
cleanup_all_stale->>cleanup_all_stale: delete stale worktree
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
bernstein doctor observe for PR #2093 ( sonar -- OK (project bernstein)
code-scanning -- WARN (1 open alert(s))
Skipped backends (credentials not configured)
See docs/observability/unified-doctor.md for backend setup notes. |
Review-bot acknowledgement summary
All must-address findings are resolved or acknowledged. |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
_acquire_gc_lock_fd, when constructing theownerstring you accessmeta['pid']directly, which can raiseKeyErrorif the lock metadata dict lacks apid; usemeta.get('pid')consistently to avoid blowing up on malformed lock files. - The git subprocess invocation pattern for checking branch/base existence is duplicated between
_branch_existsand_resolve_graveyard_base; consider extracting a small helper to avoid repetition and keep the graveyard resolution logic easier to maintain.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_acquire_gc_lock_fd`, when constructing the `owner` string you access `meta['pid']` directly, which can raise `KeyError` if the lock metadata dict lacks a `pid`; use `meta.get('pid')` consistently to avoid blowing up on malformed lock files.
- The git subprocess invocation pattern for checking branch/base existence is duplicated between `_branch_exists` and `_resolve_graveyard_base`; consider extracting a small helper to avoid repetition and keep the graveyard resolution logic easier to maintain.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bernstein/cli/commands/worktrees_cmd.py`:
- Around line 140-142: Move the new GC lock age tuning constant out of
worktrees_cmd and into core/defaults.py as required by the constants policy. Add
the constant in core/defaults.py, then import and use it from the worktrees_cmd
module where the GC lock handling logic references _GC_LOCK_MAX_AGE_S, keeping
the existing behavior unchanged.
- Around line 176-186: The stale-lock reclaim path in the worktree GC lock
acquisition flow is vulnerable to a TOCTOU race because multiple callers can
evaluate the same stale metadata and then delete/recreate the same lock. Update
the lock handling in the worktrees command (around the stale reclaim branch in
the lock acquisition helper) to serialize reclaim before unlinking `lock_path`,
either by introducing a separate exclusive reclaim guard and re-reading
`_read_gc_lock` under that guard or by using an OS-level file lock before
deleting the file. Keep the existing `GcLockError`, `_gc_lock_is_stale`, and
`logger.warning` flow, but make sure only one process can reclaim and recreate
the lock at a time.
- Around line 145-168: The GC lock helpers currently use untyped dict metadata
and a type-ignore on lock_gc, which should be replaced with a private TypedDict
for the lock payload. Define a dedicated typed payload for {pid, started_at},
then update _read_gc_lock and _gc_lock_is_stale to accept/return that type
instead of dict[str, object], and change lock_gc to be annotated as
Iterator[Path] so the no-untyped-def ignore can be removed. Use the existing
helper names _read_gc_lock, _gc_lock_is_stale, and lock_gc to keep the changes
localized.
In `@src/bernstein/core/git/worktree.py`:
- Around line 912-914: Hoist the _resolve_graveyard_base call out of the
per-entry stale-worktree loop in the worktree cleanup logic: base_ref only
depends on self.repo_root, so compute it once before iterating and reuse it for
each branch. Update the surrounding code near _count_unmerged_commits to pass
the cached base_ref into each check instead of resolving it repeatedly.
In `@tests/unit/test_worktrees_cmd.py`:
- Around line 847-854: The live-owner GC lock test only checks that GcLockError
is raised, so it could miss accidental deletion or rewriting of the existing
lock. Update the test around lock_gc and GcLockError to capture the lock_path
contents before entering the context and assert the same JSON payload is still
present afterward, using the existing lock_path setup in test_worktrees_cmd.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9ea5bdc5-2d17-4b60-89d9-76e66178ab2b
📒 Files selected for processing (4)
src/bernstein/cli/commands/worktrees_cmd.pysrc/bernstein/core/git/worktree.pytests/unit/test_worktree_graveyard.pytests/unit/test_worktrees_cmd.py
| # A GC that has "owned" the lock longer than this is treated as a crashed | ||
| # leftover. Generous so a legitimately long sweep is never reclaimed under it. | ||
| _GC_LOCK_MAX_AGE_S = 6 * 3600 |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Move _GC_LOCK_MAX_AGE_S to src/bernstein/core/defaults.py.
Line 142 adds a new tuning constant inline in the CLI module. Put it in core/defaults.py and import it here.
As per coding guidelines, “New constants must go in core/defaults.py, not inline in other modules.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/bernstein/cli/commands/worktrees_cmd.py` around lines 140 - 142, Move the
new GC lock age tuning constant out of worktrees_cmd and into core/defaults.py
as required by the constants policy. Add the constant in core/defaults.py, then
import and use it from the worktrees_cmd module where the GC lock handling logic
references _GC_LOCK_MAX_AGE_S, keeping the existing behavior unchanged.
Source: Coding guidelines
| def _read_gc_lock(lock_path: Path) -> dict[str, object] | None: | ||
| """Return the lock's recorded ``{pid, started_at}`` payload, or None.""" | ||
| try: | ||
| data = json.loads(lock_path.read_text(encoding="utf-8")) | ||
| except (OSError, ValueError): | ||
| return None | ||
| return data if isinstance(data, dict) else None | ||
|
|
||
|
|
||
| def _gc_lock_is_stale(meta: dict[str, object] | None) -> bool: | ||
| """True when the lock's owning process is gone or the lock is too old. | ||
|
|
||
| An unreadable / mid-write payload is NOT treated as stale, so a lock another | ||
| process just created (between ``O_EXCL`` and the write) is never reclaimed. | ||
| The crashed-GC case the recovery targets always leaves a fully written | ||
| ``{pid, started_at}`` payload, which the pid-liveness check below detects. | ||
| """ | ||
| if not isinstance(meta, dict): | ||
| return False | ||
| pid = meta.get("pid") | ||
| if isinstance(pid, int) and pid > 0 and not is_process_alive(pid): | ||
| return True | ||
| started = meta.get("started_at") | ||
| return isinstance(started, (int, float)) and (time.time() - started) > _GC_LOCK_MAX_AGE_S |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify strict typing after replacing raw dict metadata and the no-untyped-def ignore.
uv run pyright src/
uv run mypy srcRepository: sipyourdrink-ltd/bernstein
Length of output: 240
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file and related symbols.
git ls-files src/bernstein/cli/commands/worktrees_cmd.py
wc -l src/bernstein/cli/commands/worktrees_cmd.py
cat -n src/bernstein/cli/commands/worktrees_cmd.py | sed -n '120,230p'
printf '\n--- symbol search ---\n'
rg -n "\block_gc\b|_read_gc_lock|_gc_lock_is_stale|GC_LOCK" src/bernstein/cli/commands/worktrees_cmd.pyRepository: sipyourdrink-ltd/bernstein
Length of output: 6365
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file and related symbols.
git ls-files src/bernstein/cli/commands/worktrees_cmd.py
wc -l src/bernstein/cli/commands/worktrees_cmd.py
cat -n src/bernstein/cli/commands/worktrees_cmd.py | sed -n '120,230p'
printf '\n--- symbol search ---\n'
rg -n "\block_gc\b|_read_gc_lock|_gc_lock_is_stale|GC_LOCK" src/bernstein/cli/commands/worktrees_cmd.pyRepository: sipyourdrink-ltd/bernstein
Length of output: 6365
Type the GC lock helpers in src/bernstein/cli/commands/worktrees_cmd.py:145-192
- Replace the raw
dict[str, object]metadata with a privateTypedDictfor{pid, started_at}. - Update
_read_gc_lock,_gc_lock_is_stale, andlock_gcto use that type, and annotatelock_gcasIterator[Path]instead of# type: ignore[no-untyped-def].
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/bernstein/cli/commands/worktrees_cmd.py` around lines 145 - 168, The GC
lock helpers currently use untyped dict metadata and a type-ignore on lock_gc,
which should be replaced with a private TypedDict for the lock payload. Define a
dedicated typed payload for {pid, started_at}, then update _read_gc_lock and
_gc_lock_is_stale to accept/return that type instead of dict[str, object], and
change lock_gc to be annotated as Iterator[Path] so the no-untyped-def ignore
can be removed. Use the existing helper names _read_gc_lock, _gc_lock_is_stale,
and lock_gc to keep the changes localized.
Sources: Coding guidelines, Path instructions
| meta = _read_gc_lock(lock_path) | ||
| if not _gc_lock_is_stale(meta): | ||
| owner = f" held by pid {meta['pid']}" if isinstance(meta, dict) and meta.get("pid") else "" | ||
| raise GcLockError(f"another worktree GC is already running ({lock_path}{owner})") from exc | ||
| logger.warning("Reclaiming stale worktree GC lock %s (previous owner gone): %s", lock_path, meta) | ||
| # A competing acquirer may win the race and re-create the lock; the | ||
| # retry below resolves that case. | ||
| with contextlib.suppress(OSError): | ||
| lock_path.unlink() | ||
| try: | ||
| return os.open(str(lock_path), os.O_CREAT | os.O_EXCL | os.O_WRONLY, 0o600) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🔴 Critical | 🏗️ Heavy lift
Serialize stale-lock reclaim before unlinking lock_path.
Line 183 is a TOCTOU: two GC processes can both read the same stale payload; after the first unlinks and recreates the lock, the second can unlink that new lock and also acquire it. Since downstream reaping relies on this single lock, this can run two reapers concurrently. Add a separate exclusive reclaim guard and re-read the payload under that guard, or switch this path to an OS-level file lock before deleting lock_path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/bernstein/cli/commands/worktrees_cmd.py` around lines 176 - 186, The
stale-lock reclaim path in the worktree GC lock acquisition flow is vulnerable
to a TOCTOU race because multiple callers can evaluate the same stale metadata
and then delete/recreate the same lock. Update the lock handling in the
worktrees command (around the stale reclaim branch in the lock acquisition
helper) to serialize reclaim before unlinking `lock_path`, either by introducing
a separate exclusive reclaim guard and re-reading `_read_gc_lock` under that
guard or by using an OS-level file lock before deleting the file. Keep the
existing `GcLockError`, `_gc_lock_is_stale`, and `logger.warning` flow, but make
sure only one process can reclaim and recreate the lock at a time.
| base_ref = _resolve_graveyard_base(self.repo_root) | ||
| try: | ||
| unmerged = _count_unmerged_commits(self.repo_root, branch_name, base="main") | ||
| except Exception as exc: # defensive - never block cleanup | ||
| unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref) |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win
Hoist _resolve_graveyard_base out of the per-entry loop.
base_ref only depends on self.repo_root, which is constant across the loop, so resolving it inside the loop spawns up to 5 identical git subprocesses per stale worktree. Resolve once before the loop.
♻️ Resolve the base ref once before iterating
if not self._base_dir.exists():
return 0
cleaned = 0
+ base_ref = _resolve_graveyard_base(self.repo_root)
for entry in self._base_dir.iterdir(): branch_name = f"agent/{session_id}"
- base_ref = _resolve_graveyard_base(self.repo_root)
try:
unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| base_ref = _resolve_graveyard_base(self.repo_root) | |
| try: | |
| unmerged = _count_unmerged_commits(self.repo_root, branch_name, base="main") | |
| except Exception as exc: # defensive - never block cleanup | |
| unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref) | |
| @@ | |
| if not self._base_dir.exists(): | |
| return 0 | |
| cleaned = 0 | |
| base_ref = _resolve_graveyard_base(self.repo_root) | |
| for entry in self._base_dir.iterdir(): | |
| @@ | |
| branch_name = f"agent/{session_id}" | |
| - base_ref = _resolve_graveyard_base(self.repo_root) | |
| try: | |
| unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/bernstein/core/git/worktree.py` around lines 912 - 914, Hoist the
_resolve_graveyard_base call out of the per-entry stale-worktree loop in the
worktree cleanup logic: base_ref only depends on self.repo_root, so compute it
once before iterating and reuse it for each branch. Update the surrounding code
near _count_unmerged_commits to pass the cached base_ref into each check instead
of resolving it repeatedly.
| lock_path = repo_root / GC_LOCK_RELPATH | ||
| lock_path.parent.mkdir(parents=True, exist_ok=True) | ||
| lock_path.write_text(_json.dumps({"pid": _os.getpid(), "started_at": _time.time()}), encoding="utf-8") | ||
|
|
||
| with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True): | ||
| with pytest.raises(GcLockError): | ||
| with lock_gc(repo_root): | ||
| pass |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Assert the live-owner lock remains unchanged.
Line 852 only verifies GcLockError; the test would still pass if lock_gc deleted or rewrote the live owner’s lock before raising. Preserve the written payload and assert it is still present afterward.
Concrete diff
- lock_path.write_text(_json.dumps({"pid": _os.getpid(), "started_at": _time.time()}), encoding="utf-8")
+ payload = _json.dumps({"pid": _os.getpid(), "started_at": _time.time()})
+ lock_path.write_text(payload, encoding="utf-8")
with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True):
with pytest.raises(GcLockError):
with lock_gc(repo_root):
pass
+ assert lock_path.read_text(encoding="utf-8") == payload📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lock_path = repo_root / GC_LOCK_RELPATH | |
| lock_path.parent.mkdir(parents=True, exist_ok=True) | |
| lock_path.write_text(_json.dumps({"pid": _os.getpid(), "started_at": _time.time()}), encoding="utf-8") | |
| with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True): | |
| with pytest.raises(GcLockError): | |
| with lock_gc(repo_root): | |
| pass | |
| lock_path = repo_root / GC_LOCK_RELPATH | |
| lock_path.parent.mkdir(parents=True, exist_ok=True) | |
| payload = _json.dumps({"pid": _os.getpid(), "started_at": _time.time()}) | |
| lock_path.write_text(payload, encoding="utf-8") | |
| with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True): | |
| with pytest.raises(GcLockError): | |
| with lock_gc(repo_root): | |
| pass | |
| assert lock_path.read_text(encoding="utf-8") == payload |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_worktrees_cmd.py` around lines 847 - 854, The live-owner GC
lock test only checks that GcLockError is raised, so it could miss accidental
deletion or rewriting of the existing lock. Update the test around lock_gc and
GcLockError to capture the lock_path contents before entering the context and
assert the same JSON payload is still present afterward, using the existing
lock_path setup in test_worktrees_cmd.
Two reliability bugs in the worktree garbage collector.
Unmerged work lost on a non-
maindefault branchcleanup_all_stalepreserved an agent branch to the graveyard only whengit rev-list agent/<sid> ^mainreported commits. On a repo whose default branch is notmain, that command fails (nomainref) and the failure was read as "nothing to preserve", so the worktree and its unmerged commits were deleted. The base branch is now resolved from the repo default (origin/HEAD, thenmain/masterand their remote refs), and an inconclusive check returns-1so the caller preserves the branch rather than dropping it. Net effect: unmerged agent work is never silently lost on a non-mainrepo.Crashed GC wedged all future runs
The GC lock was created with
O_EXCLand never reclaimed, so a crashed or killedbernstein worktrees gcleft the lock file behind and every subsequent run raisedGcLockErrorforever. The lock already records{pid, started_at}; acquisition now reclaims it once when the owning process is gone or the lock is older than a generous bound, while still refusing a lock held by a live, recent process.Tests
master-default repo (the old path returned 0/delete; it now returns -1/preserve and counts accurately against the resolved base).ruffclean, 54 worktree tests pass locally.Summary by Sourcery
Improve worktree garbage collection reliability by preserving unmerged work on non-main default branch repos and reclaiming stale GC locks left by crashed processes.
Bug Fixes:
Enhancements:
Tests:
Summary by CodeRabbit
main.