Fixing multiple GPU Qwen Image Fine tuning training #674

FurkanGozukara · 2025-10-23T12:46:59Z

I am not 100% sure if this is accurate way but after this i was able to train

Haven't tested result of training yet

This fixes issue mentioned here : #672

FurkanGozukara · 2025-10-24T09:05:44Z

@kohya-ss good news i tested trained on this (1x gpu) and works perfect no issues

kohya-ss · 2025-10-25T10:50:27Z

src/musubi_tuner/qwen_image_train_network.py

It appears that this file has been unintentionally modified.

kohya-ss · 2025-10-25T10:54:14Z

src/musubi_tuner/hv_train_network.py

-            if args.ddp_gradient_as_bucket_view or args.ddp_static_graph
-            else None
+        DistributedDataParallelKwargs(
+            find_unused_parameters=True,


According to the PyTorch documentation, specifying find_unused_parameters=True when it is not necessary will slow down the training:
https://docs.pytorch.org/docs/stable/notes/ddp.html#internal-design

Therefore, as with other DDP-related options, it would be preferable to be able to specify it as an argument (for example, --ddp_find_unused_parameters).

kohya-ss · 2025-10-25T10:56:03Z

src/musubi_tuner/hv_train_network.py

+        # Ensure DDP is properly configured for models with unused parameters
+        if hasattr(transformer, 'module') and hasattr(transformer.module, 'find_unused_parameters'):
+            transformer.module.find_unused_parameters = True
+        elif hasattr(transformer, 'find_unused_parameters'):
+            transformer.find_unused_parameters = True


There seems to be no point in overriding find_unused_parameters here, it will already be True if configured correctly.

kohya-ss · 2025-10-25T10:57:24Z

src/musubi_tuner/hv_train_network.py

+        unwrapped_transformer = accelerator.unwrap_model(transformer)
+        logger.info(f"DiT dtype: {unwrapped_transformer.dtype}, device: {unwrapped_transformer.device}")


Defining a new local variable unwrapped_transformer may prevent garbage collection later, it is better to call it directly: accelerator.unwrap_model(transformer).dtype and accelerator.unwrap_model(transformer).device.

…ddp_find_unused_parameters argument to hv_train.py and hv_train_network.py - Remove unnecessary override of find_unused_parameters after accelerator.prepare() - Fix unwrapped_transformer variable to prevent garbage collection issues - Revert unintentional changes to qwen_image_train_network.py - Make DDP configuration consistent across all training scripts

FurkanGozukara · 2025-10-25T11:51:31Z

@kohya-ss i just tried to make the changes you requested

I am ok with any way of fixing thank you

…ddp_find_unused_parameters argument to hv_train.py and hv_train_network.py - Remove unnecessary override of find_unused_parameters after accelerator.prepare() - Fix unwrapped_transformer variable to prevent garbage collection issues - Keep intentional fix in qwen_image_train_network.py for multi-GPU training - Make DDP configuration consistent across all training scripts

kohya-ss · 2025-10-25T14:44:37Z

Thank you for the update. Ruff's lint check is reporting an error, so please format it using ruff format.

Also, it seems that parts of qwen_image_train_network.py that are unrelated to this fix have been updated. Is there any reason for this?

FurkanGozukara · 2025-10-25T15:38:32Z

Thank you for the update. Ruff's lint check is reporting an error, so please format it using ruff format.

Also, it seems that parts of qwen_image_train_network.py that are unrelated to this fix have been updated. Is there any reason for this?

sure i will work on that but have you seen this error?

Error 2

/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate launch --dynamo_backend no --gpu_ids 0,1 --mixed_precision bf16 --multi_gpu --num_processes 2 --num_machines 1 --num_cpu_threads_per_process 2 /workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py --config_file /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml --ddp_gradient_as_bucket_view

11:04:41-006719 INFO     Executing command: bash /tmp/tmpstm435s8.sh                                                                                                                                                                                                                              
Starting text encoder output caching...
Trying to import sageattention
Successfully imported sageattention
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (1328, 1328)
  batch_size: 1
  num_repeats: 1
  caption_extension: ".txt"
  enable_bucket: False
  bucket_no_upscale: False
  cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
  debug_dataset: False
    image_directory: "/workspace/28_imgs_1328/1_ohwx"
    image_jsonl_file: "None"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:__main__:Loading Qwen2.5-VL: /workspace/Training_Models_Qwen/qwen_2.5_vl_7b_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loading state dict from /workspace/Training_Models_Qwen/qwen_2.5_vl_7b_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loaded Qwen2.5-VL: <All keys matched successfully>
INFO:musubi_tuner.qwen_image.qwen_image_utils:Setting Qwen2.5-VL to dtype: torch.bfloat16
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loading tokenizer from Qwen/Qwen-Image
INFO:__main__:Encoding with Qwen2.5-VL
INFO:musubi_tuner.cache_text_encoder_outputs:Encoding dataset [0]
28it [00:00, 46.32it/s]
Text encoder caching completed successfully!
Starting training...
Trying to import sageattention
Successfully imported sageattention
Trying to import sageattention
Successfully imported sageattention
INFO:musubi_tuner.hv_train_network:Loading settings from /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml...
INFO:musubi_tuner.hv_train_network:/workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (1328, 1328)
  batch_size: 1
  num_repeats: 1
  caption_extension: ".txt"
  enable_bucket: False
  bucket_no_upscale: False
  cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
  debug_dataset: False
    image_directory: "/workspace/28_imgs_1328/1_ohwx"
    image_jsonl_file: "None"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:musubi_tuner.dataset.image_video_dataset:bucket: (1328, 1328), count: 28
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 28
INFO:__main__:preparing accelerator
INFO:musubi_tuner.hv_train_network:Loading settings from /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml...
INFO:musubi_tuner.hv_train_network:/workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (1328, 1328)
  batch_size: 1
  num_repeats: 1
  caption_extension: ".txt"
  enable_bucket: False
  bucket_no_upscale: False
  cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
  debug_dataset: False
    image_directory: "/workspace/28_imgs_1328/1_ohwx"
    image_jsonl_file: "None"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:musubi_tuner.dataset.image_video_dataset:bucket: (1328, 1328), count: 28
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 28
INFO:__main__:preparing accelerator
accelerator device: cuda:1
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
accelerator device: cuda:0
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
INFO:__main__:Loading weights from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:Loading weights from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:Loaded DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors, info=<All keys matched successfully>
QwenModel: Gradient checkpointing enabled. Activation CPU offloading: False
INFO:musubi_tuner.hv_train_network:use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
WARNING:musubi_tuner.hv_train_network:constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
INFO:__main__:Loaded DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors, info=<All keys matched successfully>
QwenModel: Gradient checkpointing enabled. Activation CPU offloading: False
prepare optimizer, data loader etc.
number of trainable parameters: 20430401088
INFO:musubi_tuner.hv_train_network:use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
WARNING:musubi_tuner.hv_train_network:constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
override steps. steps for 200 epochs is / 指定エポックまでのステップ数: 2800
enable full bf16 training.
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
INFO:__main__:DiT dtype: torch.bfloat16, device: cuda:1
running training / 学習開始
  num train items / 学習画像、動画数: 28
  num batches per epoch / 1epochのバッチ数: 14
  num epochs / epoch数: 200
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 2800
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
steps:   0%|                                                                                     | 0/2800 [00:00<?, ?it/s]INFO:__main__:DiT dtype: torch.bfloat16, device: cuda:0

epoch 1/200
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
steps:   0%|                                                           | 1/2800 [00:04<3:13:23,  4.15s/it, avr_loss=0.125][rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 724, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 720, in main
[rank1]:     trainer.train(args)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 550, in train
[rank1]:     model_pred, target = self.call_dit(
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train_network.py", line 427, in call_dit
[rank1]:     model_pred = model(
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1644, in forward
[rank1]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1533, in _pre_forward
[rank1]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank1]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank1]: making sure all `forward` function outputs participate in calculating loss. 
[rank1]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank1]: Parameter indices which did not receive grad for rank 1: 1915 1916 1925 1926 1927 1928
[rank1]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 724, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 720, in main
[rank0]:     trainer.train(args)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 550, in train
[rank0]:     model_pred, target = self.call_dit(
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train_network.py", line 427, in call_dit
[rank0]:     model_pred = model(
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1644, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1533, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank0]: making sure all `forward` function outputs participate in calculating loss. 
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 1915 1916 1925 1926 1927 1928
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
steps:   0%|                                                           | 1/2800 [00:04<3:24:49,  4.39s/it, avr_loss=0.125]
[rank0]:[W1023 11:05:10.493945548 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1023 11:05:11.598000 4232 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4376 closing signal SIGTERM
E1023 11:05:11.813000 4232 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 4377) of binary: /workspace/SECourses_Musubi_Trainer/venv/bin/python
Traceback (most recent call last):
  File "/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate", line 7, in <module>
    sys.exit(main())
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1204, in launch_command
    multi_gpu_launcher(args)
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher
    distrib_run.run(args)
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-23_11:05:11
  host      : c4a3d3e2a9ac
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4377)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

kohya-ss · 2025-10-26T01:11:26Z

sure i will work on that but have you seen this error?

The modification on qwen_image_train_network.py appears to be just a change in the order of the code.

fix multiple gpu training

e7ae593

FurkanGozukara mentioned this pull request Oct 23, 2025

Qwen Image Fine Tuning Metadata and Multiple GPU Fine Tuning Fix #668

Closed

kohya-ss reviewed Oct 25, 2025

View reviewed changes

FurkanGozukara added 3 commits October 25, 2025 14:52

Resolve merge conflicts and apply multiple GPU training fixes

678bef1

Apply multi-GPU training fix: compute target before unpacking latents

78c1409

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fixing multiple GPU Qwen Image Fine tuning training #674

Fixing multiple GPU Qwen Image Fine tuning training #674

FurkanGozukara commented Oct 23, 2025

Uh oh!

FurkanGozukara commented Oct 24, 2025

Uh oh!

kohya-ss Oct 25, 2025

Uh oh!

kohya-ss Oct 25, 2025

Uh oh!

kohya-ss Oct 25, 2025

Uh oh!

kohya-ss Oct 25, 2025

Uh oh!

FurkanGozukara commented Oct 25, 2025

Uh oh!

kohya-ss commented Oct 25, 2025

Uh oh!

FurkanGozukara commented Oct 25, 2025

Uh oh!

kohya-ss commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		unwrapped_transformer = accelerator.unwrap_model(transformer)
		logger.info(f"DiT dtype: {unwrapped_transformer.dtype}, device: {unwrapped_transformer.device}")

Uh oh!

Fixing multiple GPU Qwen Image Fine tuning training #674

Are you sure you want to change the base?

Fixing multiple GPU Qwen Image Fine tuning training #674

Conversation

FurkanGozukara commented Oct 23, 2025

Uh oh!

FurkanGozukara commented Oct 24, 2025

Uh oh!

kohya-ss Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

kohya-ss Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

kohya-ss Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

kohya-ss Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

FurkanGozukara commented Oct 25, 2025

Uh oh!

kohya-ss commented Oct 25, 2025

Uh oh!

FurkanGozukara commented Oct 25, 2025

Uh oh!

kohya-ss commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants