Skip to content

[2.7] Fix pt_init_client data_path to avoid concurrent CIFAR10 download corruption#4297

Open
YuanTingHsieh wants to merge 1 commit intoNVIDIA:2.7from
YuanTingHsieh:fix_ci_data_path_27
Open

[2.7] Fix pt_init_client data_path to avoid concurrent CIFAR10 download corruption#4297
YuanTingHsieh wants to merge 1 commit intoNVIDIA:2.7from
YuanTingHsieh:fix_ci_data_path_27

Conversation

@YuanTingHsieh
Copy link
Collaborator

Data corruption: CIFAR-10 dataset downloaded concurrently by two clients at runtime

Problem

The data_path argument is not passed in the trainer/validator job configs. The integration test suite pre-downloads CIFAR-10 to /tmp/nvflare/cifar10_data in a setup step, but because the configs omit that path, both client processes fall through to the default download logic at runtime. Two processes writing to the same directory simultaneously causes data corruption.

Fix

Pass data_path="/tmp/nvflare/cifar10_data" in the trainer and validator configs so the pre-downloaded dataset is used directly and no runtime download occurs.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 11, 2026 21:28
@YuanTingHsieh YuanTingHsieh added the cicd continuous integration/continuous development label Mar 11, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR fixes a data corruption bug in the pt_init_client integration test by explicitly passing data_path="/tmp/nvflare/cifar10_data" to both Cifar10Trainer and Cifar10Validator in the client config, matching the path where CIFAR-10 is pre-downloaded in the test setup step and preventing both client processes from falling through to the default ~/data download path at runtime.

  • The fix is minimal and correct: the path /tmp/nvflare/cifar10_data exactly matches the setup command in pt_app.yml (line 47), which already pre-downloads CIFAR-10 to that location before the test runs.
  • The change brings pt_init_client in line with the existing pt_use_path app config, which already used the same data_path value.
  • No logic changes are introduced — only the missing config argument is added.

Confidence Score: 5/5

  • This PR is safe to merge — it is a minimal, targeted config fix with no logic changes or risk of regression.
  • The change adds a single missing config argument (data_path) to two executor entries in one JSON file. The path /tmp/nvflare/cifar10_data is verified to match the pre-download step in the test suite's setup block. The fix is consistent with the already-correct pt_use_path app config, and no new code or dependencies are introduced.
  • No files require special attention.

Important Files Changed

Filename Overview
tests/integration_test/data/apps/pt_init_client/app/config/config_fed_client.json Adds data_path="/tmp/nvflare/cifar10_data" to both Cifar10Trainer and Cifar10Validator args, matching the path pre-downloaded in the test setup step and eliminating concurrent download corruption.

Sequence Diagram

sequenceDiagram
    participant Setup as Test Setup Step
    participant Client1 as Client 1 (Trainer/Validator)
    participant Client2 as Client 2 (Trainer/Validator)
    participant FS as /tmp/nvflare/cifar10_data

    Setup->>FS: Download CIFAR-10 (once, sequentially)
    Note over Setup,FS: python -c "CIFAR10(root='/tmp/nvflare/cifar10_data', download=True)"

    Client1->>FS: Read dataset (data_path now set — no download)
    Client2->>FS: Read dataset (data_path now set — no download)
    Note over Client1,Client2: Before fix: both fell through to ~/data<br/>and downloaded concurrently → corruption
Loading

Last reviewed commit: f0ec644

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an integration-test runtime data corruption issue where multiple PyTorch clients may concurrently download CIFAR-10 into the same directory because the client config didn’t pass an explicit dataset path.

Changes:

  • Passes data_path="/tmp/nvflare/cifar10_data" to Cifar10Trainer in the pt_init_client integration app config.
  • Passes data_path="/tmp/nvflare/cifar10_data" to Cifar10Validator in the same config to ensure validation uses the pre-downloaded dataset as well.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cicd continuous integration/continuous development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants