Skip to content

Doctor deletes own auto.lock mid-session — isLockProcessAlive treats self-PID as dead #2470

@sabal202

Description

@sabal202

Problem

When the GSD doctor runs inside postUnitPreVerification during auto-mode (with fix: true), it deletes both .gsd/auto.lock and the .gsd.lock/ OS-level lock directory — because isLockProcessAlive() returns false for the current process's own PID. This breaks the session lock, causing proper-lockfile to fire onCompromised, and the next iteration's newSession() to fail or timeout. Auto-mode stops silently (no user-facing notification via cleanupAfterLoopExit).

Root Cause

crash-recovery.js:74:

export function isLockProcessAlive(lock) {
    const pid = lock.pid;
    if (!Number.isInteger(pid) || pid <= 0) return false;
    if (pid === process.pid) return false;  // BUG: treats own PID as "dead"
    // ...
}

The pid === process.pid guard was added in PR #362 for startAuto() context, where a matching PID means a recycled PID from a prior crashed process. But isLockProcessAlive is also called from doctor-checks.js (lines 414, 447, 551) during live auto-mode execution, where pid === process.pid means "we are the lock holder" — very much alive.

The doctor then:

  1. Concludes the lock is stale -> deletes .gsd/auto.lock (clearLock)
  2. Concludes .gsd.lock/ is stranded -> deletes it (rmSync)

This destroys proper-lockfile's OS-level lock. The next heartbeat update tick gets ENOENT -> setLockAsCompromised -> the "Lock heartbeat caught up after Ns" message. The subsequent loop iteration either fails newSession() (30s timeout -> stopAuto) or encounters other lock-related issues.

Failure Sequence (observed)

  1. Auto-mode runs research-slice/M018/S01 successfully (PID 62594)
  2. postUnitPreVerification -> runGSDDoctor(basePath, { fix: true })
  3. Doctor reads auto.lock -> PID 62594 -> isLockProcessAlive({ pid: 62594 }) -> 62594 === process.pid -> false
  4. Doctor applies fix: "cleared stale auto.lock" + "removed stranded lock directory .gsd.lock"
  5. proper-lockfile update timer fires -> stat(.gsd.lock/) -> ENOENT -> setLockAsCompromised
  6. onCompromised handler suppresses (within stale window) -> stderr: "Lock heartbeat caught up after 70s"
  7. Iteration 2 -> newSession() times out or lock validation fails -> auto-mode stops
  8. cleanupAfterLoopExit() runs — no user notification (only stopAuto shows notifications)

Doctor history confirms (2026-03-25T08:13:51Z):

errors: 2
codes: ["stale_crash_lock", "stranded_lock_directory"]
fixDescriptions: ["cleared stale auto.lock", "removed stranded lock directory .../.gsd.lock"]

Expected Behavior

The doctor should never delete lock files that belong to the currently running auto-mode session. When auto-mode is active and the lock PID matches process.pid, the lock is not stale — it's ours.

Suggested Fix

Option A (simplest): Skip stale_crash_lock and stranded_lock_directory checks in doctor-checks.js when auto-mode is active:

import { isAutoActive } from "./auto.js";

// In the stale crash lock check:
if (lock && !isAutoActive()) {
    const alive = isLockProcessAlive(lock);
    // ...
}

Option B (more precise): Add context parameter to isLockProcessAlive:

export function isLockProcessAlive(lock, { allowSelfPid = false } = {}) {
    const pid = lock.pid;
    if (!Number.isInteger(pid) || pid <= 0) return false;
    if (pid === process.pid) return allowSelfPid ? true : false;
    // ...
}

Then doctor-checks.js calls with { allowSelfPid: true } and startAuto keeps the default false.

Option A is safer — avoids changing the isLockProcessAlive contract for all callers.

Related Issues

Environment

  • GSD version: 2.45.0
  • Model: claude-opus-4-6
  • Unit: research-slice/M018/S01 (completed successfully, doctor ran post-unit)

Auto-generated by /gsd forensics

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions