Skip to content

Fix PID-recycling race in orphan-lock detection#1989

Merged
tyrielv merged 1 commit into
microsoft:masterfrom
tyrielv:tyrielv/fix-orphan-lock-pid-race
May 27, 2026
Merged

Fix PID-recycling race in orphan-lock detection#1989
tyrielv merged 1 commit into
microsoft:masterfrom
tyrielv:tyrielv/fix-orphan-lock-pid-race

Conversation

@tyrielv
Copy link
Copy Markdown
Contributor

@tyrielv tyrielv commented May 27, 2026

Summary

Fix a PID-recycling race in GVFSLock's orphan-lock detection. When the lock holder exits and Windows quickly recycles its PID to an unrelated process, IsProcessActive(pid) returns true and GVFS concludes the lock is still held — the next git command then either waits the full 300 s timeout or prints a spurious "Waiting for '' to release the lock" message and an empty stderr assertion fails.

This surfaced as flake in the OrphanedGVFSLockIsCleanedUp functional test on busy CI agents (two failure modes: 300 s timeout, and the "Waiting for…" assertion). The same race can affect users when the OS quickly reuses a freed PID.

The fix

Process-identity tracking via creation time:

  • When the lock is granted to an external requestor, capture the process's creation timestamp (GetProcessTimes, an opaque 100-ns FILETIME) alongside its PID.
  • On every orphan check, look up the current creation timestamp at that PID and compare. The process is gone → orphan. A different process at the same PID → start-time mismatch → orphan.

Design notes

  • No protocol change. Capture happens server-side in Mount when the lock request arrives. The hook's git.exe parent is blocked on the synchronous pipe send/receive, so it's guaranteed alive when Mount captures.
  • GetProcessTimes alone is not enough — it still returns a creation time for terminated processes whose kernel object lingers due to an outstanding handle elsewhere. The Windows implementation combines GetExitCodeProcess (STILL_ACTIVE check) with GetProcessTimes.
  • Acquisition does not fail if start-time capture fails. The existing IsProcessActiveImplementation has a Process.GetProcessById fallback for cross-integrity callers that OpenProcess(QueryLimitedInformation) cannot open; we preserve that compatibility surface. A HasStartTime flag is stored alongside the PID; when false, the orphan check falls back to the legacy IsProcessActive call (the pre-fix behavior). The fallback is recorded in telemetry (StartTimeUnavailable=true on TryAcquireLockExternal).
  • Improved telemetry. ExternalLockHolderExited now carries an ExternalHolderTerminationReason field (ProcessNotActive | PidRecycled | Unknown) for easy post-mortems.

Files changed (7, +231/-10)

File Purpose
GVFS.Common/NativeMethods.Shared.cs Add GetProcessTimes P/Invoke
GVFS.Common/GVFSPlatform.cs New abstract TryGetActiveProcessStartTime
GVFS.Platform.Windows/WindowsPlatform.Shared.cs TryGetActiveProcessStartTimeImplementation (combined active-check + start-time)
GVFS.Platform.Windows/WindowsPlatform.cs Override
GVFS.Common/GVFSLock.cs Capture start time at acquisition, identity-check on orphan detection, plumb termination reason to telemetry
GVFS.UnitTests/Mock/Common/MockPlatform.cs Mock + ProcessStartTimes dictionary for tests
GVFS.UnitTests/Common/GVFSLockTests.cs 2 new tests: WhenHolderPidRecycled (orphan detected via start-time mismatch), WhenHolderStartTimeMatches (live holder correctly seen as active)

What is intentionally NOT changed

  • No change to the named-pipe protocol (LockData wire format is unchanged — full backward compat with older clients/servers).
  • No change to GVFS.FunctionalTests.LockHolder.
  • No change to the OrphanedGVFSLockIsCleanedUp functional test. The test was correctly exposing the product bug; with the product fix, it passes naturally without any test-side workaround.
  • No change to GVFS.Hooks startup PID check (Program.cs:251). That's a one-shot check on the hook's git.exe parent at startup, not a polling loop, so no PID-reuse race.

Validation

  • dotnet build GVFS.UnitTests — clean.
  • Full scripts\Build.bat Debug — clean.
  • GVFS.UnitTests.exe815/815 pass, including the two new tests for the identity-check path and all existing tests (which now exercise the start-time-unavailable fallback path).
  • Local functional run not informative for this flake (CI-specific).

@tyrielv tyrielv force-pushed the tyrielv/fix-orphan-lock-pid-race branch from 37dd08b to 05a8273 Compare May 27, 2026 19:06
GVFS's orphan-lock check (GVFSLock.LockHolder.GetExternalHolder) only
asked the OS "is there a process at this PID?" via IsProcessActive. If
the original lock-holder exited and Windows recycled its PID to an
unrelated process before the next git command arrived, GVFS saw a live
process at that PID and concluded the lock was still held. The git
command then either waited the full 300s timeout or printed a spurious
"Waiting for '<holder>' to release the lock" message.

This is a real product bug, not just a test issue. It surfaced as
flakiness in the OrphanedGVFSLockIsCleanedUp functional test on busy CI
agents (two failure modes: 300s timeout, and assertion failure with
non-empty stderr), but the same race can affect users when a PID is
quickly reassigned by the OS scheduler.

The fix is process-identity tracking. When the lock is granted to an
external requestor, capture that process's creation timestamp
(GetProcessTimes, an opaque 100-ns FILETIME value) alongside its PID.
On every orphan check, look up the current creation timestamp at that
PID and compare. If the process is gone the lookup fails; if a
different process is now at the PID the timestamps differ. Either way
the lock is correctly recognized as orphaned.

Implementation notes:

  * New abstract GVFSPlatform.TryGetActiveProcessStartTime returns the
    raw 64-bit creation time. It is documented as identity-only; the
    value is never decoded as a date.
  * The Windows implementation combines GetExitCodeProcess (STILL_ACTIVE
    check) and GetProcessTimes. GetProcessTimes alone is not sufficient
    because it still returns a creation time for terminated processes
    whose kernel object is kept alive by an outstanding handle
    elsewhere.
  * Acquisition does not fail if start-time capture fails. The existing
    IsProcessActiveImplementation has a Process.GetProcessById fallback
    for cross-integrity callers that OpenProcess(QueryLimitedInformation)
    cannot open, and we preserve the same compatibility surface here. A
    HasStartTime flag is stored alongside the PID; when it is false the
    orphan check falls back to the legacy IsProcessActive call. The
    fallback is recorded in telemetry (StartTimeUnavailable=true on the
    TryAcquireLockExternal event) so we can see if it ever becomes
    common in the field.
  * The ExternalLockHolderExited telemetry event now carries an
    ExternalHolderTerminationReason field (ProcessNotActive | PidRecycled
    | Unknown) to make future post-mortems straightforward.
  * The OrphanedGVFSLockIsCleanedUp functional test is unchanged. It
    polls Process.GetProcessById to detect when the lock holder has
    exited, then runs git status. With this product fix, git status
    succeeds cleanly even if the PID has already been recycled to an
    unrelated process by the time mount processes the request, so the
    test passes naturally without any test-side workaround.
  * Two new unit tests cover the new identity-check path:
    TryAcquireLockForExternalRequestor_WhenHolderPidRecycled (verifies
    that an orphan is detected when start times differ at the same PID)
    and TryAcquireLockForExternalRequestor_WhenHolderStartTimeMatches
    (the inverse: a still-alive holder is correctly seen as active and
    a new acquisition is denied).

Assisted-by: Claude Opus 4.7
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
@tyrielv tyrielv force-pushed the tyrielv/fix-orphan-lock-pid-race branch from 05a8273 to abe5da6 Compare May 27, 2026 20:15
@tyrielv tyrielv marked this pull request as ready for review May 27, 2026 20:40
@tyrielv tyrielv requested a review from KeithIsSleeping May 27, 2026 21:04
@tyrielv tyrielv enabled auto-merge May 27, 2026 21:04
Copy link
Copy Markdown
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tyrielv tyrielv merged commit 9d0cb3c into microsoft:master May 27, 2026
51 checks passed
@tyrielv tyrielv mentioned this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants