Skip to content

Fix wandb error on multi-GPU training#234

Merged
erinuclkwon merged 5 commits intomainfrom
eq-train
Feb 27, 2026
Merged

Fix wandb error on multi-GPU training#234
erinuclkwon merged 5 commits intomainfrom
eq-train

Conversation

@erinuclkwon
Copy link
Copy Markdown
Contributor

When training with multiple GPUs, subsequent ranks after rank 0, which has correct wandb object and directory, fail with FileNotFoundError when attempting to save model_config.yaml to a files/ subdirectory that is never created. model_config_path.parent.mkdir(parents=True, exist_ok=True) was added to create the directory first. Adding is_global_zero restricts config saving to rank 0 only, eliminating the error on subsequent ranks and automatically works on single GPU.

@erinuclkwon erinuclkwon requested a review from a team February 27, 2026 10:37
@erinuclkwon erinuclkwon changed the title Wandb error on multi-GPU training Fix wandb error on multi-GPU training Feb 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 27, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  icenet_mp
  model_service.py 230-235
  icenet_mp/models
  base_model.py
Project Total  

This report was generated by python-coverage-comment-action

dataset.end_date,
)
return DataLoader(dataset, shuffle=True, **self._common_dataloader_kwargs)
return DataLoader(dataset, shuffle=False, **self._common_dataloader_kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to disable shuffle?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not anymore. Thanks for pointing it out.

Comment thread icenet_mp/models/base_model.py
Comment thread icenet_mp/model_service.py
Copy link
Copy Markdown
Member

@jemrobinson jemrobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@erinuclkwon erinuclkwon merged commit 6eefed7 into main Feb 27, 2026
3 checks passed
@erinuclkwon erinuclkwon deleted the eq-train branch February 27, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants