Skip to content

Conversation

@kohya-ss
Copy link
Owner

@kohya-ss kohya-ss commented Mar 16, 2025

This may solve a potential issue and prevent loss=NaN in long training. Further testing is needed to see if existing training works as well as before.

@kohya-ss kohya-ss added the help wanted Extra attention is needed label Mar 16, 2025
@kohya-ss
Copy link
Owner Author

This fix appears to enable multi-GPU training of DDP, but further testing is required.

@FurkanGozukara
Copy link
Contributor

This fix appears to enable multi-GPU training of DDP, but further testing is required.

with flux we were never able to do multi gpu training with block swap

so it is possible?

@kohya-ss
Copy link
Owner Author

so it is possible?

I think so. I will add a new branch to support this on sd-scripts repo today.

@FurkanGozukara
Copy link
Contributor

so it is possible?

I think so. I will add a new branch to support this on sd-scripts repo today.

awesome

@kohya-ss
Copy link
Owner Author

This fix appears to enable multi-GPU training of DDP, but further testing is required.

Unfortunately this doesn't seem to work with nccl backend for now.

@kohya-ss
Copy link
Owner Author

This seems to be working fine on a Windows environment.
Testing on a Linux environment is welcome!

@xzuyn
Copy link
Contributor

xzuyn commented Sep 30, 2025

I've been using this on a single amd gpu on ubuntu for a bit, training qwen image. Haven't noticed any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants