-
-
Notifications
You must be signed in to change notification settings - Fork 378
Description
Problem
When the GSD doctor runs inside postUnitPreVerification during auto-mode (with fix: true), it deletes both .gsd/auto.lock and the .gsd.lock/ OS-level lock directory — because isLockProcessAlive() returns false for the current process's own PID. This breaks the session lock, causing proper-lockfile to fire onCompromised, and the next iteration's newSession() to fail or timeout. Auto-mode stops silently (no user-facing notification via cleanupAfterLoopExit).
Root Cause
crash-recovery.js:74:
export function isLockProcessAlive(lock) {
const pid = lock.pid;
if (!Number.isInteger(pid) || pid <= 0) return false;
if (pid === process.pid) return false; // BUG: treats own PID as "dead"
// ...
}The pid === process.pid guard was added in PR #362 for startAuto() context, where a matching PID means a recycled PID from a prior crashed process. But isLockProcessAlive is also called from doctor-checks.js (lines 414, 447, 551) during live auto-mode execution, where pid === process.pid means "we are the lock holder" — very much alive.
The doctor then:
- Concludes the lock is stale -> deletes
.gsd/auto.lock(clearLock) - Concludes
.gsd.lock/is stranded -> deletes it (rmSync)
This destroys proper-lockfile's OS-level lock. The next heartbeat update tick gets ENOENT -> setLockAsCompromised -> the "Lock heartbeat caught up after Ns" message. The subsequent loop iteration either fails newSession() (30s timeout -> stopAuto) or encounters other lock-related issues.
Failure Sequence (observed)
- Auto-mode runs
research-slice/M018/S01successfully (PID 62594) postUnitPreVerification->runGSDDoctor(basePath, { fix: true })- Doctor reads
auto.lock-> PID 62594 ->isLockProcessAlive({ pid: 62594 })->62594 === process.pid-> false - Doctor applies fix: "cleared stale auto.lock" + "removed stranded lock directory .gsd.lock"
- proper-lockfile update timer fires ->
stat(.gsd.lock/)-> ENOENT ->setLockAsCompromised onCompromisedhandler suppresses (within stale window) -> stderr: "Lock heartbeat caught up after 70s"- Iteration 2 ->
newSession()times out or lock validation fails -> auto-mode stops cleanupAfterLoopExit()runs — no user notification (onlystopAutoshows notifications)
Doctor history confirms (2026-03-25T08:13:51Z):
errors: 2
codes: ["stale_crash_lock", "stranded_lock_directory"]
fixDescriptions: ["cleared stale auto.lock", "removed stranded lock directory .../.gsd.lock"]
Expected Behavior
The doctor should never delete lock files that belong to the currently running auto-mode session. When auto-mode is active and the lock PID matches process.pid, the lock is not stale — it's ours.
Suggested Fix
Option A (simplest): Skip stale_crash_lock and stranded_lock_directory checks in doctor-checks.js when auto-mode is active:
import { isAutoActive } from "./auto.js";
// In the stale crash lock check:
if (lock && !isAutoActive()) {
const alive = isLockProcessAlive(lock);
// ...
}Option B (more precise): Add context parameter to isLockProcessAlive:
export function isLockProcessAlive(lock, { allowSelfPid = false } = {}) {
const pid = lock.pid;
if (!Number.isInteger(pid) || pid <= 0) return false;
if (pid === process.pid) return allowSelfPid ? true : false;
// ...
}Then doctor-checks.js calls with { allowSelfPid: true } and startAuto keeps the default false.
Option A is safer — avoids changing the isLockProcessAlive contract for all callers.
Related Issues
- guided-flow crash lock detection lacks PID self-check → infinite 'Interrupted Session' loop when all milestones complete #1398 — Same
pid === process.pidmisinterpretation inguided-flow.ts(fixed there with a PID check, butisLockProcessAlivewas not patched) - Auto-mode reports session takeover after advisory verification failure and lock loss #1501 — Session takeover misdiagnosis after lock loss (symptoms overlap, different root cause path)
- PR fix(auto): stale lock detection, SIGTERM handler, live-session guard #362 — Introduced
isLockProcessAlivewith the self-PID guard
Environment
- GSD version: 2.45.0
- Model: claude-opus-4-6
- Unit: research-slice/M018/S01 (completed successfully, doctor ran post-unit)
Auto-generated by /gsd forensics