Skip to content

fix(worktrees): preserve unmerged work on non-main repos and recover stale GC lock#2093

Merged
chernistry merged 1 commit into
mainfrom
fix/worktree-gc-reliability
Jun 25, 2026
Merged

fix(worktrees): preserve unmerged work on non-main repos and recover stale GC lock#2093
chernistry merged 1 commit into
mainfrom
fix/worktree-gc-reliability

Conversation

@chernistry

@chernistry chernistry commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Two reliability bugs in the worktree garbage collector.

Unmerged work lost on a non-main default branch

cleanup_all_stale preserved an agent branch to the graveyard only when git rev-list agent/<sid> ^main reported commits. On a repo whose default branch is not main, that command fails (no main ref) and the failure was read as "nothing to preserve", so the worktree and its unmerged commits were deleted. The base branch is now resolved from the repo default (origin/HEAD, then main/master and their remote refs), and an inconclusive check returns -1 so the caller preserves the branch rather than dropping it. Net effect: unmerged agent work is never silently lost on a non-main repo.

Crashed GC wedged all future runs

The GC lock was created with O_EXCL and never reclaimed, so a crashed or killed bernstein worktrees gc left the lock file behind and every subsequent run raised GcLockError forever. The lock already records {pid, started_at}; acquisition now reclaims it once when the owning process is gone or the lock is older than a generous bound, while still refusing a lock held by a live, recent process.

Tests

  • New graveyard tests prove preservation on a master-default repo (the old path returned 0/delete; it now returns -1/preserve and counts accurately against the resolved base).
  • New lock tests prove a dead-owner lock is reclaimed and a live-owner lock is still respected.
  • ruff clean, 54 worktree tests pass locally.

Summary by Sourcery

Improve worktree garbage collection reliability by preserving unmerged work on non-main default branch repos and reclaiming stale GC locks left by crashed processes.

Bug Fixes:

  • Ensure graveyard preservation logic respects the repository's actual default branch and never drops unmerged agent work on an inconclusive check.
  • Allow GC lock acquisition to reclaim stale locks owned by dead or overly old processes instead of wedging future runs.

Enhancements:

  • Add helper utilities for resolving the graveyard base branch and detecting branch existence to support more robust unmerged-commit counting.
  • Refine GC lock handling with structured lock metadata reading and staleness checks based on pid liveness and lock age.

Tests:

  • Add graveyard tests for non-main default branch repositories to verify correct base resolution and unmerged commit counting semantics.
  • Add GC lock tests covering stale detection, reclaiming locks from dead owners, respecting live owners, and handling unreadable lock payloads.

Summary by CodeRabbit

  • Bug Fixes
    • Improved worktree cleanup so it uses the repository’s actual default branch instead of assuming main.
    • Cleanup now preserves worktrees when it can’t confidently verify unmerged changes, avoiding accidental deletion.
    • Garbage collection is more resilient to stale locks and can recover from lock files left behind by crashed runs.

…stale GC lock

Two reliability bugs in the worktree garbage collector:

- Graveyard preservation compared agent branches against a hardcoded
  `main`, so on a repository whose default branch is not `main` the
  rev-list check failed and was read as "nothing to preserve", letting
  cleanup_all_stale delete unmerged agent commits. The base is now
  resolved from the repo default (origin/HEAD, main, master, and their
  remote refs), and an inconclusive check preserves the branch to the
  graveyard instead of dropping it.

- The GC lock had no stale-lock recovery: a crashed or killed GC left the
  lock behind and wedged every future `bernstein worktrees gc`. The lock
  already recorded the owner pid and start time; acquisition now reclaims
  a lock once when the owning process is gone or the lock is older than a
  generous bound, while still refusing a lock held by a live recent
  process.
@sourcery-ai

sourcery-ai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Reviewer's Guide

This PR hardens worktree garbage collection by correctly detecting unmerged agent work on repos whose default branch is not main, and by making the GC lock recoverable when the owning process crashes or the lock goes stale, backed by new unit tests for both behaviors.

Sequence diagram for graveyard preservation on non-main default branches

sequenceDiagram
    participant WorktreeGC as cleanup_all_stale
    participant GitRepo as git

    WorktreeGC->>GitRepo: _resolve_graveyard_base(repo_root)
    GitRepo-->>WorktreeGC: base_ref

    WorktreeGC->>GitRepo: _count_unmerged_commits(repo_root, branch, base_ref)
    alt rev_list succeeds and output > 0
        GitRepo-->>WorktreeGC: unmerged_count > 0
        WorktreeGC->>WorktreeGC: preserve_branch_to_graveyard(repo_root, session_id, branch)
    else rev_list fails but branch exists
        GitRepo-->>WorktreeGC: -1
        WorktreeGC->>WorktreeGC: preserve_branch_to_graveyard(repo_root, session_id, branch)
    else branch missing or fully merged
        GitRepo-->>WorktreeGC: 0
        WorktreeGC->>WorktreeGC: delete stale worktree
    end
Loading

Sequence diagram for recoverable GC lock acquisition

sequenceDiagram
    participant GcRunner as lock_gc
    participant FS as _acquire_gc_lock_fd

    GcRunner->>FS: _acquire_gc_lock_fd(lock_path)
    alt lock file does not exist
        FS-->>GcRunner: fd
    else lock file exists
        FS->>FS: _read_gc_lock(lock_path)
        alt _gc_lock_is_stale(meta) is False
            FS-->>GcRunner: GcLockError
        else _gc_lock_is_stale(meta) is True
            FS->>FS: lock_path.unlink()
            alt retry os.open succeeds
                FS-->>GcRunner: fd
            else retry os.open FileExistsError
                FS-->>GcRunner: GcLockError
            end
        end
    end

    GcRunner->>GcRunner: write {pid, started_at} to lock file
    GcRunner->>GcRunner: run GC and remove lock on exit
Loading

File-Level Changes

Change Details Files
Graveyard unmerged-commit detection now respects the repository’s actual default branch and treats inconclusive rev-list checks as "preserve" instead of "drop".
  • Introduce helper to check if a branch exists via git rev-parse --verify --quiet with timeouts and failure handling.
  • Add _resolve_graveyard_base to determine the default-base ref using origin/HEAD and conventional branch names (local and remote) with a fallback to main.
  • Change _count_unmerged_commits to return -1 on inconclusive results when the branch still exists, 0 only when nothing is at risk, and update its docstring to reflect the semantics.
  • Update cleanup_all_stale to resolve the base ref via _resolve_graveyard_base, treat any non-zero unmerged count (including -1) as "preserve to graveyard", and improve logging to include the resolved base and unknown-count cases.
src/bernstein/core/git/worktree.py
GC lock acquisition now supports stale-lock recovery while still refusing locks held by live, recent processes.
  • Define _GC_LOCK_MAX_AGE_S as an upper bound on legitimate GC lock age.
  • Add _read_gc_lock to parse the lock file JSON payload into a {pid, started_at} dict or None with robust error handling.
  • Implement _gc_lock_is_stale to consider a lock stale when its pid is no longer alive or its age exceeds _GC_LOCK_MAX_AGE_S, while treating unreadable/meta-less locks as non-stale.
  • Add _acquire_gc_lock_fd to encapsulate O_EXCL lock acquisition, reclaim stale locks once (including unlink-and-retry logic), and raise GcLockError for live competing owners with informative messages.
  • Refactor lock_gc to use _acquire_gc_lock_fd, preserving its context-manager semantics and updated docstring for the new recovery behavior.
src/bernstein/cli/commands/worktrees_cmd.py
New tests validate graveyard behavior on non-main default branches and GC lock stale recovery semantics.
  • Add tests in test_worktree_graveyard.py that initialize repos with master as default, verify _resolve_graveyard_base picks master, ensure _count_unmerged_commits returns -1 when the base ref is missing but the branch exists, and confirm accurate unmerged counts when using the resolved base.
  • Introduce helper _init_repo_on in test_worktree_graveyard.py to set up test repos on a specified default branch with a seeded commit.
  • Add GC lock unit tests in test_worktrees_cmd.py for _gc_lock_is_stale (dead pid -> stale, live pid -> not stale, too-old lock -> stale, unreadable meta -> not stale) and for lock_gc reclaiming locks from dead owners while respecting locks held by live owners and raising GcLockError accordingly.
tests/unit/test_worktree_graveyard.py
tests/unit/test_worktrees_cmd.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions

Copy link
Copy Markdown
Contributor

Sonar insights (advisory, no merge-block)

Snapshot of bernstein on the configured Sonar instance:

Metric Value
Coverage 80.1
Code smells 0
Bugs 0
Vulnerabilities 0
Security hotspots 0

Run bernstein doctor sonar locally for the full surface.

This comment is a soft signal. The Sonar scan runs on push to main; the PR check itself never fails on smells.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

GC lock acquisition now reclaims stale lock files after PID and age checks. Worktree cleanup now resolves the repository’s default branch before counting unmerged commits, and preserves stale worktrees when the count is nonzero or inconclusive.

Changes

GC lock recovery

Layer / File(s) Summary
Stale GC lock acquisition
src/bernstein/cli/commands/worktrees_cmd.py (28, 140-204), tests/unit/test_worktrees_cmd.py (777-854)
lock_gc now reads lock payloads, checks the recorded PID and age, removes stale locks, retries acquisition once, and the unit tests cover stale, live-owner, and reclaim cases.

Graveyard base resolution

Layer / File(s) Summary
Default branch detection and count fallback
src/bernstein/core/git/worktree.py (339-437), tests/unit/test_worktree_graveyard.py (247-304)
_resolve_graveyard_base picks the repo’s default branch, _count_unmerged_commits returns -1 when the branch exists but the count is inconclusive, and the tests cover master and missing-main repos.
Cleanup preservation wiring
src/bernstein/core/git/worktree.py (912-925)
cleanup_all_stale now uses the resolved base ref, keeps stale worktrees when the unmerged count is nonzero or inconclusive, and reports the resolved base in the warning path.

Sequence Diagram(s)

GC lock recovery

sequenceDiagram
  participant lock_gc
  participant _acquire_gc_lock_fd
  participant _gc_lock_is_stale
  participant is_process_alive
  participant lock_file

  lock_gc->>_acquire_gc_lock_fd: acquire GC lock
  _acquire_gc_lock_fd->>lock_file: read JSON payload
  _acquire_gc_lock_fd->>_gc_lock_is_stale: check pid and started_at
  _gc_lock_is_stale->>is_process_alive: check recorded PID

  alt stale lock
    _acquire_gc_lock_fd->>lock_file: remove stale lock file
    _acquire_gc_lock_fd->>lock_file: retry acquisition
  else live owner
    _acquire_gc_lock_fd-->>lock_gc: raise GcLockError
  end
Loading

Graveyard cleanup

sequenceDiagram
  participant cleanup_all_stale
  participant _resolve_graveyard_base
  participant _count_unmerged_commits

  cleanup_all_stale->>_resolve_graveyard_base: resolve base_ref
  _resolve_graveyard_base-->>cleanup_all_stale: default branch
  cleanup_all_stale->>_count_unmerged_commits: count commits against base_ref
  _count_unmerged_commits-->>cleanup_all_stale: 0, >0, or -1

  alt unmerged > 0 or inconclusive
    cleanup_all_stale->>cleanup_all_stale: preserve stale worktree
  else unmerged == 0
    cleanup_all_stale->>cleanup_all_stale: delete stale worktree
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/l, core, cli, tests

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the change well, but it omits the required checklist and documentation-duty sections from the template. Add the template's What/Why/How/Checklist sections, including the documentation-duty checkboxes and test/status items.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the two main changes: preserving non-main worktree cleanup and recovering stale GC locks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/worktree-gc-reliability

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown
Contributor

bernstein doctor observe for PR #2093 (fix/worktree-gc-reliability): ok=1, warn=1, fail=0, error=0, skipped=2

sonar -- OK (project bernstein)

metric value delta threshold status
coverage_pct 80.1% new 80.0% ok
code_smells 0 new 50 ok
bugs 0 new 0 ok
vulnerabilities 0 new 0 ok
security_hotspots 0 new 0 ok

code-scanning -- WARN (1 open alert(s))

metric value delta threshold status
open_alerts 1 new 0 warn
critical_alerts 0 new 0 ok
high_alerts 1 new 0 warn
medium_alerts 0 new - ok
low_alerts 0 new - ok
Skipped backends (credentials not configured)
  • glitchtip: BERNSTEIN_GLITCHTIP_TOKEN not set
  • dt: DTRACK_URL/TOKEN/PROJECT not set

See docs/observability/unified-doctor.md for backend setup notes.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review-bot acknowledgement summary

  • Must-address findings: 0 (0 acknowledged, 0 open)
  • Informational findings: 5

All must-address findings are resolved or acknowledged.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In _acquire_gc_lock_fd, when constructing the owner string you access meta['pid'] directly, which can raise KeyError if the lock metadata dict lacks a pid; use meta.get('pid') consistently to avoid blowing up on malformed lock files.
  • The git subprocess invocation pattern for checking branch/base existence is duplicated between _branch_exists and _resolve_graveyard_base; consider extracting a small helper to avoid repetition and keep the graveyard resolution logic easier to maintain.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_acquire_gc_lock_fd`, when constructing the `owner` string you access `meta['pid']` directly, which can raise `KeyError` if the lock metadata dict lacks a `pid`; use `meta.get('pid')` consistently to avoid blowing up on malformed lock files.
- The git subprocess invocation pattern for checking branch/base existence is duplicated between `_branch_exists` and `_resolve_graveyard_base`; consider extracting a small helper to avoid repetition and keep the graveyard resolution logic easier to maintain.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bernstein/cli/commands/worktrees_cmd.py`:
- Around line 140-142: Move the new GC lock age tuning constant out of
worktrees_cmd and into core/defaults.py as required by the constants policy. Add
the constant in core/defaults.py, then import and use it from the worktrees_cmd
module where the GC lock handling logic references _GC_LOCK_MAX_AGE_S, keeping
the existing behavior unchanged.
- Around line 176-186: The stale-lock reclaim path in the worktree GC lock
acquisition flow is vulnerable to a TOCTOU race because multiple callers can
evaluate the same stale metadata and then delete/recreate the same lock. Update
the lock handling in the worktrees command (around the stale reclaim branch in
the lock acquisition helper) to serialize reclaim before unlinking `lock_path`,
either by introducing a separate exclusive reclaim guard and re-reading
`_read_gc_lock` under that guard or by using an OS-level file lock before
deleting the file. Keep the existing `GcLockError`, `_gc_lock_is_stale`, and
`logger.warning` flow, but make sure only one process can reclaim and recreate
the lock at a time.
- Around line 145-168: The GC lock helpers currently use untyped dict metadata
and a type-ignore on lock_gc, which should be replaced with a private TypedDict
for the lock payload. Define a dedicated typed payload for {pid, started_at},
then update _read_gc_lock and _gc_lock_is_stale to accept/return that type
instead of dict[str, object], and change lock_gc to be annotated as
Iterator[Path] so the no-untyped-def ignore can be removed. Use the existing
helper names _read_gc_lock, _gc_lock_is_stale, and lock_gc to keep the changes
localized.

In `@src/bernstein/core/git/worktree.py`:
- Around line 912-914: Hoist the _resolve_graveyard_base call out of the
per-entry stale-worktree loop in the worktree cleanup logic: base_ref only
depends on self.repo_root, so compute it once before iterating and reuse it for
each branch. Update the surrounding code near _count_unmerged_commits to pass
the cached base_ref into each check instead of resolving it repeatedly.

In `@tests/unit/test_worktrees_cmd.py`:
- Around line 847-854: The live-owner GC lock test only checks that GcLockError
is raised, so it could miss accidental deletion or rewriting of the existing
lock. Update the test around lock_gc and GcLockError to capture the lock_path
contents before entering the context and assert the same JSON payload is still
present afterward, using the existing lock_path setup in test_worktrees_cmd.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9ea5bdc5-2d17-4b60-89d9-76e66178ab2b

📥 Commits

Reviewing files that changed from the base of the PR and between 7269e73 and d66d5cd.

📒 Files selected for processing (4)
  • src/bernstein/cli/commands/worktrees_cmd.py
  • src/bernstein/core/git/worktree.py
  • tests/unit/test_worktree_graveyard.py
  • tests/unit/test_worktrees_cmd.py

Comment on lines +140 to +142
# A GC that has "owned" the lock longer than this is treated as a crashed
# leftover. Generous so a legitimately long sweep is never reclaimed under it.
_GC_LOCK_MAX_AGE_S = 6 * 3600

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move _GC_LOCK_MAX_AGE_S to src/bernstein/core/defaults.py.

Line 142 adds a new tuning constant inline in the CLI module. Put it in core/defaults.py and import it here.

As per coding guidelines, “New constants must go in core/defaults.py, not inline in other modules.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bernstein/cli/commands/worktrees_cmd.py` around lines 140 - 142, Move the
new GC lock age tuning constant out of worktrees_cmd and into core/defaults.py
as required by the constants policy. Add the constant in core/defaults.py, then
import and use it from the worktrees_cmd module where the GC lock handling logic
references _GC_LOCK_MAX_AGE_S, keeping the existing behavior unchanged.

Source: Coding guidelines

Comment on lines +145 to +168
def _read_gc_lock(lock_path: Path) -> dict[str, object] | None:
"""Return the lock's recorded ``{pid, started_at}`` payload, or None."""
try:
data = json.loads(lock_path.read_text(encoding="utf-8"))
except (OSError, ValueError):
return None
return data if isinstance(data, dict) else None


def _gc_lock_is_stale(meta: dict[str, object] | None) -> bool:
"""True when the lock's owning process is gone or the lock is too old.

An unreadable / mid-write payload is NOT treated as stale, so a lock another
process just created (between ``O_EXCL`` and the write) is never reclaimed.
The crashed-GC case the recovery targets always leaves a fully written
``{pid, started_at}`` payload, which the pid-liveness check below detects.
"""
if not isinstance(meta, dict):
return False
pid = meta.get("pid")
if isinstance(pid, int) and pid > 0 and not is_process_alive(pid):
return True
started = meta.get("started_at")
return isinstance(started, (int, float)) and (time.time() - started) > _GC_LOCK_MAX_AGE_S

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify strict typing after replacing raw dict metadata and the no-untyped-def ignore.
uv run pyright src/
uv run mypy src

Repository: sipyourdrink-ltd/bernstein

Length of output: 240


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file and related symbols.
git ls-files src/bernstein/cli/commands/worktrees_cmd.py
wc -l src/bernstein/cli/commands/worktrees_cmd.py
cat -n src/bernstein/cli/commands/worktrees_cmd.py | sed -n '120,230p'

printf '\n--- symbol search ---\n'
rg -n "\block_gc\b|_read_gc_lock|_gc_lock_is_stale|GC_LOCK" src/bernstein/cli/commands/worktrees_cmd.py

Repository: sipyourdrink-ltd/bernstein

Length of output: 6365


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file and related symbols.
git ls-files src/bernstein/cli/commands/worktrees_cmd.py
wc -l src/bernstein/cli/commands/worktrees_cmd.py
cat -n src/bernstein/cli/commands/worktrees_cmd.py | sed -n '120,230p'

printf '\n--- symbol search ---\n'
rg -n "\block_gc\b|_read_gc_lock|_gc_lock_is_stale|GC_LOCK" src/bernstein/cli/commands/worktrees_cmd.py

Repository: sipyourdrink-ltd/bernstein

Length of output: 6365


Type the GC lock helpers in src/bernstein/cli/commands/worktrees_cmd.py:145-192

  • Replace the raw dict[str, object] metadata with a private TypedDict for {pid, started_at}.
  • Update _read_gc_lock, _gc_lock_is_stale, and lock_gc to use that type, and annotate lock_gc as Iterator[Path] instead of # type: ignore[no-untyped-def].
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bernstein/cli/commands/worktrees_cmd.py` around lines 145 - 168, The GC
lock helpers currently use untyped dict metadata and a type-ignore on lock_gc,
which should be replaced with a private TypedDict for the lock payload. Define a
dedicated typed payload for {pid, started_at}, then update _read_gc_lock and
_gc_lock_is_stale to accept/return that type instead of dict[str, object], and
change lock_gc to be annotated as Iterator[Path] so the no-untyped-def ignore
can be removed. Use the existing helper names _read_gc_lock, _gc_lock_is_stale,
and lock_gc to keep the changes localized.

Sources: Coding guidelines, Path instructions

Comment on lines +176 to +186
meta = _read_gc_lock(lock_path)
if not _gc_lock_is_stale(meta):
owner = f" held by pid {meta['pid']}" if isinstance(meta, dict) and meta.get("pid") else ""
raise GcLockError(f"another worktree GC is already running ({lock_path}{owner})") from exc
logger.warning("Reclaiming stale worktree GC lock %s (previous owner gone): %s", lock_path, meta)
# A competing acquirer may win the race and re-create the lock; the
# retry below resolves that case.
with contextlib.suppress(OSError):
lock_path.unlink()
try:
return os.open(str(lock_path), os.O_CREAT | os.O_EXCL | os.O_WRONLY, 0o600)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🔴 Critical | 🏗️ Heavy lift

Serialize stale-lock reclaim before unlinking lock_path.

Line 183 is a TOCTOU: two GC processes can both read the same stale payload; after the first unlinks and recreates the lock, the second can unlink that new lock and also acquire it. Since downstream reaping relies on this single lock, this can run two reapers concurrently. Add a separate exclusive reclaim guard and re-read the payload under that guard, or switch this path to an OS-level file lock before deleting lock_path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bernstein/cli/commands/worktrees_cmd.py` around lines 176 - 186, The
stale-lock reclaim path in the worktree GC lock acquisition flow is vulnerable
to a TOCTOU race because multiple callers can evaluate the same stale metadata
and then delete/recreate the same lock. Update the lock handling in the
worktrees command (around the stale reclaim branch in the lock acquisition
helper) to serialize reclaim before unlinking `lock_path`, either by introducing
a separate exclusive reclaim guard and re-reading `_read_gc_lock` under that
guard or by using an OS-level file lock before deleting the file. Keep the
existing `GcLockError`, `_gc_lock_is_stale`, and `logger.warning` flow, but make
sure only one process can reclaim and recreate the lock at a time.

Comment on lines +912 to +914
base_ref = _resolve_graveyard_base(self.repo_root)
try:
unmerged = _count_unmerged_commits(self.repo_root, branch_name, base="main")
except Exception as exc: # defensive - never block cleanup
unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Hoist _resolve_graveyard_base out of the per-entry loop.

base_ref only depends on self.repo_root, which is constant across the loop, so resolving it inside the loop spawns up to 5 identical git subprocesses per stale worktree. Resolve once before the loop.

♻️ Resolve the base ref once before iterating
         if not self._base_dir.exists():
             return 0
         cleaned = 0
+        base_ref = _resolve_graveyard_base(self.repo_root)
         for entry in self._base_dir.iterdir():
                 branch_name = f"agent/{session_id}"
-                base_ref = _resolve_graveyard_base(self.repo_root)
                 try:
                     unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
base_ref = _resolve_graveyard_base(self.repo_root)
try:
unmerged = _count_unmerged_commits(self.repo_root, branch_name, base="main")
except Exception as exc: # defensive - never block cleanup
unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref)
@@
if not self._base_dir.exists():
return 0
cleaned = 0
base_ref = _resolve_graveyard_base(self.repo_root)
for entry in self._base_dir.iterdir():
@@
branch_name = f"agent/{session_id}"
- base_ref = _resolve_graveyard_base(self.repo_root)
try:
unmerged = _count_unmerged_commits(self.repo_root, branch_name, base=base_ref)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bernstein/core/git/worktree.py` around lines 912 - 914, Hoist the
_resolve_graveyard_base call out of the per-entry stale-worktree loop in the
worktree cleanup logic: base_ref only depends on self.repo_root, so compute it
once before iterating and reuse it for each branch. Update the surrounding code
near _count_unmerged_commits to pass the cached base_ref into each check instead
of resolving it repeatedly.

Comment on lines +847 to +854
lock_path = repo_root / GC_LOCK_RELPATH
lock_path.parent.mkdir(parents=True, exist_ok=True)
lock_path.write_text(_json.dumps({"pid": _os.getpid(), "started_at": _time.time()}), encoding="utf-8")

with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True):
with pytest.raises(GcLockError):
with lock_gc(repo_root):
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert the live-owner lock remains unchanged.

Line 852 only verifies GcLockError; the test would still pass if lock_gc deleted or rewrote the live owner’s lock before raising. Preserve the written payload and assert it is still present afterward.

Concrete diff
-    lock_path.write_text(_json.dumps({"pid": _os.getpid(), "started_at": _time.time()}), encoding="utf-8")
+    payload = _json.dumps({"pid": _os.getpid(), "started_at": _time.time()})
+    lock_path.write_text(payload, encoding="utf-8")
 
     with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True):
         with pytest.raises(GcLockError):
             with lock_gc(repo_root):
                 pass
+    assert lock_path.read_text(encoding="utf-8") == payload
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
lock_path = repo_root / GC_LOCK_RELPATH
lock_path.parent.mkdir(parents=True, exist_ok=True)
lock_path.write_text(_json.dumps({"pid": _os.getpid(), "started_at": _time.time()}), encoding="utf-8")
with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True):
with pytest.raises(GcLockError):
with lock_gc(repo_root):
pass
lock_path = repo_root / GC_LOCK_RELPATH
lock_path.parent.mkdir(parents=True, exist_ok=True)
payload = _json.dumps({"pid": _os.getpid(), "started_at": _time.time()})
lock_path.write_text(payload, encoding="utf-8")
with patch("bernstein.cli.commands.worktrees_cmd.is_process_alive", return_value=True):
with pytest.raises(GcLockError):
with lock_gc(repo_root):
pass
assert lock_path.read_text(encoding="utf-8") == payload
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_worktrees_cmd.py` around lines 847 - 854, The live-owner GC
lock test only checks that GcLockError is raised, so it could miss accidental
deletion or rewriting of the existing lock. Update the test around lock_gc and
GcLockError to capture the lock_path contents before entering the context and
assert the same JSON payload is still present afterward, using the existing
lock_path setup in test_worktrees_cmd.

@chernistry chernistry merged commit 27da59e into main Jun 25, 2026
81 of 83 checks passed
@chernistry chernistry deleted the fix/worktree-gc-reliability branch June 25, 2026 11:24
This was referenced Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant