[2.7] Fix pt_init_client data_path to avoid concurrent CIFAR10 download corruption#4297
[2.7] Fix pt_init_client data_path to avoid concurrent CIFAR10 download corruption#4297YuanTingHsieh wants to merge 1 commit intoNVIDIA:2.7from
Conversation
Greptile SummaryThis PR fixes a data corruption bug in the
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Setup as Test Setup Step
participant Client1 as Client 1 (Trainer/Validator)
participant Client2 as Client 2 (Trainer/Validator)
participant FS as /tmp/nvflare/cifar10_data
Setup->>FS: Download CIFAR-10 (once, sequentially)
Note over Setup,FS: python -c "CIFAR10(root='/tmp/nvflare/cifar10_data', download=True)"
Client1->>FS: Read dataset (data_path now set — no download)
Client2->>FS: Read dataset (data_path now set — no download)
Note over Client1,Client2: Before fix: both fell through to ~/data<br/>and downloaded concurrently → corruption
Last reviewed commit: f0ec644 |
There was a problem hiding this comment.
Pull request overview
Fixes an integration-test runtime data corruption issue where multiple PyTorch clients may concurrently download CIFAR-10 into the same directory because the client config didn’t pass an explicit dataset path.
Changes:
- Passes
data_path="/tmp/nvflare/cifar10_data"toCifar10Trainerin the pt_init_client integration app config. - Passes
data_path="/tmp/nvflare/cifar10_data"toCifar10Validatorin the same config to ensure validation uses the pre-downloaded dataset as well.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Data corruption: CIFAR-10 dataset downloaded concurrently by two clients at runtime
Problem
The data_path argument is not passed in the trainer/validator job configs. The integration test suite pre-downloads CIFAR-10 to /tmp/nvflare/cifar10_data in a setup step, but because the configs omit that path, both client processes fall through to the default download logic at runtime. Two processes writing to the same directory simultaneously causes data corruption.
Fix
Pass data_path="/tmp/nvflare/cifar10_data" in the trainer and validator configs so the pre-downloaded dataset is used directly and no runtime download occurs.
Types of changes
./runtest.sh.