[2.7] Fix FilePipe TOCTOU by YuanTingHsieh · Pull Request #4296 · NVIDIA/NVFlare

YuanTingHsieh · 2026-03-11T20:30:01Z

Description

Fix FilePipe TOCTOU race condition causing false PEER_GONE during heavy I/O

Problem

FilePipe has a time-of-check/time-of-use (TOCTOU) race in its heartbeat read path that causes false PEER_GONE detection, killing active training runs.

Root cause — two failure points in _get_next:

files.sort(key=os.path.getmtime) raises FileNotFoundError if a file disappears between listdir and the sort
_read_file raises BrokenPipeError("pipe closed") if the file disappears between the sort and the rename

Both exceptions propagate uncaught through _get_next -> _get_from_dir -> receive() -> _try_read() -> _read(), where they are caught as generic exceptions and converted to PEER_GONE — even though the subprocess is still alive.

How the race is triggered:

T+0.0s  subprocess:  creates y/REQ._HEARTBEAT_.xxx  (5s send timeout)
        executor:    _read thread delayed by heavy I/O (e.g. P2P model streaming)
T+5.0s  subprocess:  send timeout -> os.remove(y/REQ._HEARTBEAT_.xxx)
T+5.0s  executor:    os.listdir(y/) -> sees filename in directory cache
T+5.0s  executor:    os.path.getmtime(file) -> FileNotFoundError  <- crash
         -> BrokenPipeError propagates -> _read thread dies -> false PEER_GONE

Observed in CCWF Swarm with large models (Llama 1B/8B) where the aggregator node's _read thread is delayed >5s by concurrent P2P model streaming. The 8B case has a higher reproduction rate due to heavier I/O (~15GB vs ~2.8GB per round).

Fix

file_pipe.py — fix _get_next (root cause):

Replace os.path.getmtime sort key with a _safe_mtime wrapper that returns float("inf") on FileNotFoundError, preventing the sort from crashing
Iterate over all candidate files and catch BrokenPipeError per-file — a vanished file is skipped, not fatal; return None so the read thread continues its normal polling loop

pipe_handler.py — harden _try_read (defense in depth):

Wrap p.receive() in a try/except BrokenPipeError — any BrokenPipeError that escapes FilePipe is a genuine failure (os.listdir failed or pipe already closed), not a transient TOCTOU race, so the read loop exits with break
PEER_GONE is only emitted when not self.asked_to_stop, matching the heartbeat-timeout guard — prevents a spurious PEER_GONE in the narrow race where stop() closes the pipe just as _try_read is about to call receive()

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

Copilot

Pull request overview

This PR addresses a TOCTOU race in the FilePipe receive/heartbeat path that can incorrectly surface as PEER_GONE under heavy I/O, prematurely killing active runs.

Changes:

Make FilePipe._get_next() robust to disappearing files by using a safe mtime sort key and skipping per-file read failures.
Add a defense-in-depth BrokenPipeError catch around p.receive() in PipeHandler._try_read() to keep the reader loop alive on transient errors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
nvflare/fuel/utils/pipe/file_pipe.py	Avoid crashing on TOCTOU during sort/read by tolerating `FileNotFoundError` and skipping vanished files.
nvflare/fuel/utils/pipe/pipe_handler.py	Attempt to prevent transient `BrokenPipeError` from killing the reader thread by retrying next cycle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nvflare/fuel/utils/pipe/pipe_handler.py

nvflare/fuel/utils/pipe/file_pipe.py

greptile-apps · 2026-03-11T20:33:35Z

Greptile Summary

This PR fixes a TOCTOU (time-of-check/time-of-use) race condition in FilePipe that was causing false PEER_GONE events during heavy I/O (e.g. large model aggregation in CCWF Swarm), and adds a defense-in-depth guard in PipeHandler to prevent spurious PEER_GONE on graceful shutdown.

Key changes:

file_pipe.py — _get_next now uses a _safe_mtime wrapper that returns float("inf") on FileNotFoundError (preventing sort crashes), and iterates over all candidate files catching BrokenPipeError per-file instead of propagating it fatally.
pipe_handler.py — _try_read wraps p.receive() in an explicit BrokenPipeError catch; PEER_GONE is only emitted when not self.asked_to_stop, preventing a spurious event in the narrow window where stop() closes the pipe just as _try_read is about to call receive().
New unit tests cover all four TOCTOU scenarios for FilePipe._get_next, plus three PipeHandler scenarios for BrokenPipeError handling and graceful-stop suppression.

Confidence Score: 5/5

This PR is safe to merge; changes are narrowly scoped to error handling paths and cannot regress normal operation.
Both production changes are purely additive error-handling guards (a fallback mtime, a per-file BrokenPipeError catch, and an asked_to_stop check). Happy-path behavior is completely unchanged. The fix is well-targeted at the documented race, backed by four new FilePipe tests and three new PipeHandler tests. The only finding is a trivial unused variable in test code.
No files require special attention.

Important Files Changed

Filename	Overview
nvflare/fuel/utils/pipe/file_pipe.py	Replaces single-file read with a per-file loop that skips vanished files; adds `_safe_mtime` wrapper to prevent sort crash on `FileNotFoundError`. Logic is sound and correctly handles both TOCTOU failure modes.
nvflare/fuel/utils/pipe/pipe_handler.py	Adds explicit `BrokenPipeError` catch around `p.receive()` with `asked_to_stop` guard to prevent spurious `PEER_GONE` during graceful shutdown. Clean defense-in-depth layer complementing the `file_pipe.py` fix.
tests/unit_test/fuel/utils/pipe/file_pipe_test.py	Adds four well-targeted TOCTOU tests for `_get_next`; one test contains an unused variable (`original_read_file`) that is assigned but never referenced.
tests/unit_test/fuel/utils/pipe/pipe_handler_test.py	New test file with three targeted tests covering `BrokenPipeError` propagation and graceful-stop suppression in `PipeHandler`. Tests are well-structured and the timing relies on the 10 ms `read_interval` buffer which is sufficient in practice.

Sequence Diagram

sequenceDiagram
    participant S as Subprocess
    participant FS as Filesystem
    participant FP as FilePipe._get_next
    participant PH as PipeHandler._try_read

    S->>FS: create y/REQ._HEARTBEAT_.xxx
    Note over FP: read thread delayed by heavy I/O

    S->>FS: send timeout, os.remove file
    FP->>FS: os.listdir sees filename in cache
    FP->>FS: _safe_mtime raises FileNotFoundError, returns inf
    Note over FP: sort succeeds, file sorted to end
    FP->>FS: _read_file raises BrokenPipeError
    Note over FP: BrokenPipeError caught per-file, continue loop
    FP-->>PH: return None, not PEER_GONE

    Note over PH: genuine failure: receive() raises BrokenPipeError
    PH->>PH: catch BrokenPipeError

    alt asked_to_stop is False
        PH->>PH: _add_message PEER_GONE
    else asked_to_stop is True
        Note over PH: PEER_GONE suppressed
    end
    PH->>PH: break, self.reader = None

_{Last reviewed commit: 9ad50fe}

nvflare/fuel/utils/pipe/pipe_handler.py

YuanTingHsieh · 2026-03-11T23:03:19Z

@greptile review again thanks

YuanTingHsieh · 2026-03-11T23:10:53Z

/build

Fix FilePipe TOCTOU

972e400

Copilot AI review requested due to automatic review settings March 11, 2026 20:30

Copilot started reviewing on behalf of YuanTingHsieh March 11, 2026 20:30 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

nvflare/fuel/utils/pipe/pipe_handler.py Outdated Show resolved Hide resolved

nvflare/fuel/utils/pipe/file_pipe.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

nvflare/fuel/utils/pipe/pipe_handler.py Outdated Show resolved Hide resolved

YuanTingHsieh added 4 commits March 11, 2026 15:24

Address review comments

0f972b6

fix format

17d8c4c

address comments

442f2cf

Merge branch '2.7' into fix_file_pipe_toctou

fe36516

Merge branch '2.7' into fix_file_pipe_toctou

9ad50fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.7] Fix FilePipe TOCTOU#4296

[2.7] Fix FilePipe TOCTOU#4296
YuanTingHsieh wants to merge 6 commits intoNVIDIA:2.7from
YuanTingHsieh:fix_file_pipe_toctou

YuanTingHsieh commented Mar 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

YuanTingHsieh commented Mar 11, 2026

Uh oh!

YuanTingHsieh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YuanTingHsieh commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Fix

Types of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

YuanTingHsieh commented Mar 11, 2026

Uh oh!

YuanTingHsieh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YuanTingHsieh commented Mar 11, 2026 •

edited

Loading

greptile-apps bot commented Mar 11, 2026 •

edited

Loading