Skip to content

[2.7] Fix FilePipe TOCTOU#4296

Open
YuanTingHsieh wants to merge 6 commits intoNVIDIA:2.7from
YuanTingHsieh:fix_file_pipe_toctou
Open

[2.7] Fix FilePipe TOCTOU#4296
YuanTingHsieh wants to merge 6 commits intoNVIDIA:2.7from
YuanTingHsieh:fix_file_pipe_toctou

Conversation

@YuanTingHsieh
Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh commented Mar 11, 2026

Description

Fix FilePipe TOCTOU race condition causing false PEER_GONE during heavy I/O

Problem

FilePipe has a time-of-check/time-of-use (TOCTOU) race in its heartbeat read path that causes false PEER_GONE detection, killing active training runs.

Root cause — two failure points in _get_next:

  1. files.sort(key=os.path.getmtime) raises FileNotFoundError if a file disappears between listdir and the sort
  2. _read_file raises BrokenPipeError("pipe closed") if the file disappears between the sort and the rename

Both exceptions propagate uncaught through _get_next -> _get_from_dir -> receive() -> _try_read() -> _read(), where they are caught as generic exceptions and converted to PEER_GONE — even though the subprocess is still alive.

How the race is triggered:

T+0.0s  subprocess:  creates y/REQ._HEARTBEAT_.xxx  (5s send timeout)
        executor:    _read thread delayed by heavy I/O (e.g. P2P model streaming)
T+5.0s  subprocess:  send timeout -> os.remove(y/REQ._HEARTBEAT_.xxx)
T+5.0s  executor:    os.listdir(y/) -> sees filename in directory cache
T+5.0s  executor:    os.path.getmtime(file) -> FileNotFoundError  <- crash
         -> BrokenPipeError propagates -> _read thread dies -> false PEER_GONE

Observed in CCWF Swarm with large models (Llama 1B/8B) where the aggregator node's _read thread is delayed >5s by concurrent P2P model streaming. The 8B case has a higher reproduction rate due to heavier I/O (~15GB vs ~2.8GB per round).

Fix

file_pipe.py — fix _get_next (root cause):

  • Replace os.path.getmtime sort key with a _safe_mtime wrapper that returns float("inf") on FileNotFoundError, preventing the sort from crashing
  • Iterate over all candidate files and catch BrokenPipeError per-file — a vanished file is skipped, not fatal; return None so the read thread continues its normal polling loop

pipe_handler.py — harden _try_read (defense in depth):

  • Wrap p.receive() in a try/except BrokenPipeError — any BrokenPipeError that escapes FilePipe is a genuine failure (os.listdir failed or pipe already closed), not a transient TOCTOU race, so the read loop exits with break
  • PEER_GONE is only emitted when not self.asked_to_stop, matching the heartbeat-timeout guard — prevents a spurious PEER_GONE in the narrow race where stop() closes the pipe just as _try_read is about to call receive()

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 11, 2026 20:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a TOCTOU race in the FilePipe receive/heartbeat path that can incorrectly surface as PEER_GONE under heavy I/O, prematurely killing active runs.

Changes:

  • Make FilePipe._get_next() robust to disappearing files by using a safe mtime sort key and skipping per-file read failures.
  • Add a defense-in-depth BrokenPipeError catch around p.receive() in PipeHandler._try_read() to keep the reader loop alive on transient errors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
nvflare/fuel/utils/pipe/file_pipe.py Avoid crashing on TOCTOU during sort/read by tolerating FileNotFoundError and skipping vanished files.
nvflare/fuel/utils/pipe/pipe_handler.py Attempt to prevent transient BrokenPipeError from killing the reader thread by retrying next cycle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR fixes a TOCTOU (time-of-check/time-of-use) race condition in FilePipe that was causing false PEER_GONE events during heavy I/O (e.g. large model aggregation in CCWF Swarm), and adds a defense-in-depth guard in PipeHandler to prevent spurious PEER_GONE on graceful shutdown.

Key changes:

  • file_pipe.py_get_next now uses a _safe_mtime wrapper that returns float("inf") on FileNotFoundError (preventing sort crashes), and iterates over all candidate files catching BrokenPipeError per-file instead of propagating it fatally.
  • pipe_handler.py_try_read wraps p.receive() in an explicit BrokenPipeError catch; PEER_GONE is only emitted when not self.asked_to_stop, preventing a spurious event in the narrow window where stop() closes the pipe just as _try_read is about to call receive().
  • New unit tests cover all four TOCTOU scenarios for FilePipe._get_next, plus three PipeHandler scenarios for BrokenPipeError handling and graceful-stop suppression.

Confidence Score: 5/5

  • This PR is safe to merge; changes are narrowly scoped to error handling paths and cannot regress normal operation.
  • Both production changes are purely additive error-handling guards (a fallback mtime, a per-file BrokenPipeError catch, and an asked_to_stop check). Happy-path behavior is completely unchanged. The fix is well-targeted at the documented race, backed by four new FilePipe tests and three new PipeHandler tests. The only finding is a trivial unused variable in test code.
  • No files require special attention.

Important Files Changed

Filename Overview
nvflare/fuel/utils/pipe/file_pipe.py Replaces single-file read with a per-file loop that skips vanished files; adds _safe_mtime wrapper to prevent sort crash on FileNotFoundError. Logic is sound and correctly handles both TOCTOU failure modes.
nvflare/fuel/utils/pipe/pipe_handler.py Adds explicit BrokenPipeError catch around p.receive() with asked_to_stop guard to prevent spurious PEER_GONE during graceful shutdown. Clean defense-in-depth layer complementing the file_pipe.py fix.
tests/unit_test/fuel/utils/pipe/file_pipe_test.py Adds four well-targeted TOCTOU tests for _get_next; one test contains an unused variable (original_read_file) that is assigned but never referenced.
tests/unit_test/fuel/utils/pipe/pipe_handler_test.py New test file with three targeted tests covering BrokenPipeError propagation and graceful-stop suppression in PipeHandler. Tests are well-structured and the timing relies on the 10 ms read_interval buffer which is sufficient in practice.

Sequence Diagram

sequenceDiagram
    participant S as Subprocess
    participant FS as Filesystem
    participant FP as FilePipe._get_next
    participant PH as PipeHandler._try_read

    S->>FS: create y/REQ._HEARTBEAT_.xxx
    Note over FP: read thread delayed by heavy I/O

    S->>FS: send timeout, os.remove file
    FP->>FS: os.listdir sees filename in cache
    FP->>FS: _safe_mtime raises FileNotFoundError, returns inf
    Note over FP: sort succeeds, file sorted to end
    FP->>FS: _read_file raises BrokenPipeError
    Note over FP: BrokenPipeError caught per-file, continue loop
    FP-->>PH: return None, not PEER_GONE

    Note over PH: genuine failure: receive() raises BrokenPipeError
    PH->>PH: catch BrokenPipeError

    alt asked_to_stop is False
        PH->>PH: _add_message PEER_GONE
    else asked_to_stop is True
        Note over PH: PEER_GONE suppressed
    end
    PH->>PH: break, self.reader = None
Loading

Last reviewed commit: 9ad50fe

@YuanTingHsieh
Copy link
Collaborator Author

@greptile review again thanks

@YuanTingHsieh
Copy link
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants