Skip to content

Fix: stabilize test after distributed training completion#375

Open
stop1one wants to merge 1 commit intoroboflow:developfrom
stop1one:fix/distributed-training-test-failure
Open

Fix: stabilize test after distributed training completion#375
stop1one wants to merge 1 commit intoroboflow:developfrom
stop1one:fix/distributed-training-test-failure

Conversation

@stop1one
Copy link

@stop1one stop1one commented Oct 1, 2025

Description

Related Issues

Fixes #316 #374

Summary of Changes

This PR fixes two problems for distributed training when run_test=True:

  1. Synchronization before testing

    • Added torch.distributed.barrier() when args.distributed is enabled, right after training and before testing.
    • Ensures all processes wait until rank 0 has finished saving checkpoint_best_total.pth before moving on to testing.
  2. Correct checkpoint loading in DDP

    • Changed from model.load_state_dict(best_state_dict) to model_without_ddp.load_state_dict(best_state_dict)
    • Prevents key mismatch errors when loading checkpoints in distributed environments.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How has this change been tested, please provide a testcase or example of how you tested the change?

train.py :

from rfdetr import RFDETRLarge

model = RFDETRLarge()

model.train(
    dataset_dir=<DATASET_PATH>,
    epochs=30,
    batch_size=8
    grad_accum_steps=2,
    lr=1e-4,
    output_dir=<OUTPUT_PATH>,
)

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

@CLAassistant
Copy link

CLAassistant commented Oct 1, 2025

CLA assistant check
All committers have signed the CLA.

@stop1one
Copy link
Author

stop1one commented Nov 6, 2025

Hi @probicheaux ,
I just realized I hadn’t signed the CLA earlier — that’s done now ✅
Would you mind taking another look at this PR when you have a moment?
Thanks again for your time!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes distributed training synchronization issues when running tests after training completion. The changes ensure proper checkpoint loading and process synchronization in multi-GPU environments.

Changes:

  • Added a distributed barrier before test execution to ensure all processes wait for rank 0 to complete checkpoint saving
  • Corrected checkpoint loading to use model_without_ddp instead of model to avoid DDP wrapper key mismatches

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Borda Borda added the bug Something isn't working label Jan 22, 2026
@Borda Borda requested a review from isaacrob as a code owner February 11, 2026 15:57
@Borda Borda force-pushed the develop branch 4 times, most recently from 60b16c1 to 523f9df Compare February 14, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working has conflicts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed Training Fails at End: FileNotFoundError and State Dict Mismatch Issues

3 participants

Comments