Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a TOCTOU race in the FilePipe receive/heartbeat path that can incorrectly surface as PEER_GONE under heavy I/O, prematurely killing active runs.
Changes:
- Make
FilePipe._get_next()robust to disappearing files by using a safe mtime sort key and skipping per-file read failures. - Add a defense-in-depth
BrokenPipeErrorcatch aroundp.receive()inPipeHandler._try_read()to keep the reader loop alive on transient errors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| nvflare/fuel/utils/pipe/file_pipe.py | Avoid crashing on TOCTOU during sort/read by tolerating FileNotFoundError and skipping vanished files. |
| nvflare/fuel/utils/pipe/pipe_handler.py | Attempt to prevent transient BrokenPipeError from killing the reader thread by retrying next cycle. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Greptile SummaryThis PR fixes a TOCTOU (time-of-check/time-of-use) race condition in Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant S as Subprocess
participant FS as Filesystem
participant FP as FilePipe._get_next
participant PH as PipeHandler._try_read
S->>FS: create y/REQ._HEARTBEAT_.xxx
Note over FP: read thread delayed by heavy I/O
S->>FS: send timeout, os.remove file
FP->>FS: os.listdir sees filename in cache
FP->>FS: _safe_mtime raises FileNotFoundError, returns inf
Note over FP: sort succeeds, file sorted to end
FP->>FS: _read_file raises BrokenPipeError
Note over FP: BrokenPipeError caught per-file, continue loop
FP-->>PH: return None, not PEER_GONE
Note over PH: genuine failure: receive() raises BrokenPipeError
PH->>PH: catch BrokenPipeError
alt asked_to_stop is False
PH->>PH: _add_message PEER_GONE
else asked_to_stop is True
Note over PH: PEER_GONE suppressed
end
PH->>PH: break, self.reader = None
Last reviewed commit: 9ad50fe |
Collaborator
Author
|
@greptile review again thanks |
Collaborator
Author
|
/build |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix FilePipe TOCTOU race condition causing false PEER_GONE during heavy I/O
Problem
FilePipehas a time-of-check/time-of-use (TOCTOU) race in its heartbeat read path that causes falsePEER_GONEdetection, killing active training runs.Root cause — two failure points in
_get_next:files.sort(key=os.path.getmtime)raisesFileNotFoundErrorif a file disappears betweenlistdirand the sort_read_fileraisesBrokenPipeError("pipe closed")if the file disappears between the sort and the renameBoth exceptions propagate uncaught through
_get_next->_get_from_dir->receive()->_try_read()->_read(), where they are caught as generic exceptions and converted toPEER_GONE— even though the subprocess is still alive.How the race is triggered:
Observed in CCWF Swarm with large models (Llama 1B/8B) where the aggregator node's
_readthread is delayed >5s by concurrent P2P model streaming. The 8B case has a higher reproduction rate due to heavier I/O (~15GB vs ~2.8GB per round).Fix
file_pipe.py— fix_get_next(root cause):os.path.getmtimesort key with a_safe_mtimewrapper that returnsfloat("inf")onFileNotFoundError, preventing the sort from crashingBrokenPipeErrorper-file — a vanished file is skipped, not fatal; returnNoneso the read thread continues its normal polling looppipe_handler.py— harden_try_read(defense in depth):p.receive()in atry/except BrokenPipeError— anyBrokenPipeErrorthat escapesFilePipeis a genuine failure (os.listdirfailed or pipe already closed), not a transient TOCTOU race, so the read loop exits withbreakPEER_GONEis only emitted whennot self.asked_to_stop, matching the heartbeat-timeout guard — prevents a spuriousPEER_GONEin the narrow race wherestop()closes the pipe just as_try_readis about to callreceive()Types of changes
./runtest.sh.