Skip to content

Conversation

@FurkanGozukara
Copy link
Contributor

I am not 100% sure if this is accurate way but after this i was able to train

Haven't tested result of training yet

This fixes issue mentioned here : #672

@FurkanGozukara
Copy link
Contributor Author

@kohya-ss good news i tested trained on this (1x gpu) and works perfect no issues

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this file has been unintentionally modified.

if args.ddp_gradient_as_bucket_view or args.ddp_static_graph
else None
DistributedDataParallelKwargs(
find_unused_parameters=True,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the PyTorch documentation, specifying find_unused_parameters=True when it is not necessary will slow down the training:
https://docs.pytorch.org/docs/stable/notes/ddp.html#internal-design

Therefore, as with other DDP-related options, it would be preferable to be able to specify it as an argument (for example, --ddp_find_unused_parameters).

Comment on lines 1882 to 1886
# Ensure DDP is properly configured for models with unused parameters
if hasattr(transformer, 'module') and hasattr(transformer.module, 'find_unused_parameters'):
transformer.module.find_unused_parameters = True
elif hasattr(transformer, 'find_unused_parameters'):
transformer.find_unused_parameters = True
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no point in overriding find_unused_parameters here, it will already be True if configured correctly.

Comment on lines 2123 to 2124
unwrapped_transformer = accelerator.unwrap_model(transformer)
logger.info(f"DiT dtype: {unwrapped_transformer.dtype}, device: {unwrapped_transformer.device}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining a new local variable unwrapped_transformer may prevent garbage collection later, it is better to call it directly: accelerator.unwrap_model(transformer).dtype and accelerator.unwrap_model(transformer).device.

…ddp_find_unused_parameters argument to hv_train.py and hv_train_network.py - Remove unnecessary override of find_unused_parameters after accelerator.prepare() - Fix unwrapped_transformer variable to prevent garbage collection issues - Revert unintentional changes to qwen_image_train_network.py - Make DDP configuration consistent across all training scripts
@FurkanGozukara
Copy link
Contributor Author

@kohya-ss i just tried to make the changes you requested

I am ok with any way of fixing thank you

…ddp_find_unused_parameters argument to hv_train.py and hv_train_network.py - Remove unnecessary override of find_unused_parameters after accelerator.prepare() - Fix unwrapped_transformer variable to prevent garbage collection issues - Keep intentional fix in qwen_image_train_network.py for multi-GPU training - Make DDP configuration consistent across all training scripts
@kohya-ss
Copy link
Owner

Thank you for the update. Ruff's lint check is reporting an error, so please format it using ruff format.

Also, it seems that parts of qwen_image_train_network.py that are unrelated to this fix have been updated. Is there any reason for this?

@FurkanGozukara
Copy link
Contributor Author

Thank you for the update. Ruff's lint check is reporting an error, so please format it using ruff format.

Also, it seems that parts of qwen_image_train_network.py that are unrelated to this fix have been updated. Is there any reason for this?

sure i will work on that but have you seen this error?

Error 2

/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate launch --dynamo_backend no --gpu_ids 0,1 --mixed_precision bf16 --multi_gpu --num_processes 2 --num_machines 1 --num_cpu_threads_per_process 2 /workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py --config_file /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml --ddp_gradient_as_bucket_view

11:04:41-006719 INFO     Executing command: bash /tmp/tmpstm435s8.sh                                                                                                                                                                                                                              
Starting text encoder output caching...
Trying to import sageattention
Successfully imported sageattention
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (1328, 1328)
  batch_size: 1
  num_repeats: 1
  caption_extension: ".txt"
  enable_bucket: False
  bucket_no_upscale: False
  cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
  debug_dataset: False
    image_directory: "/workspace/28_imgs_1328/1_ohwx"
    image_jsonl_file: "None"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:__main__:Loading Qwen2.5-VL: /workspace/Training_Models_Qwen/qwen_2.5_vl_7b_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loading state dict from /workspace/Training_Models_Qwen/qwen_2.5_vl_7b_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loaded Qwen2.5-VL: <All keys matched successfully>
INFO:musubi_tuner.qwen_image.qwen_image_utils:Setting Qwen2.5-VL to dtype: torch.bfloat16
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loading tokenizer from Qwen/Qwen-Image
INFO:__main__:Encoding with Qwen2.5-VL
INFO:musubi_tuner.cache_text_encoder_outputs:Encoding dataset [0]
28it [00:00, 46.32it/s]
Text encoder caching completed successfully!
Starting training...
Trying to import sageattention
Successfully imported sageattention
Trying to import sageattention
Successfully imported sageattention
INFO:musubi_tuner.hv_train_network:Loading settings from /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml...
INFO:musubi_tuner.hv_train_network:/workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (1328, 1328)
  batch_size: 1
  num_repeats: 1
  caption_extension: ".txt"
  enable_bucket: False
  bucket_no_upscale: False
  cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
  debug_dataset: False
    image_directory: "/workspace/28_imgs_1328/1_ohwx"
    image_jsonl_file: "None"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:musubi_tuner.dataset.image_video_dataset:bucket: (1328, 1328), count: 28
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 28
INFO:__main__:preparing accelerator
INFO:musubi_tuner.hv_train_network:Loading settings from /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml...
INFO:musubi_tuner.hv_train_network:/workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (1328, 1328)
  batch_size: 1
  num_repeats: 1
  caption_extension: ".txt"
  enable_bucket: False
  bucket_no_upscale: False
  cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
  debug_dataset: False
    image_directory: "/workspace/28_imgs_1328/1_ohwx"
    image_jsonl_file: "None"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:musubi_tuner.dataset.image_video_dataset:bucket: (1328, 1328), count: 28
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 28
INFO:__main__:preparing accelerator
accelerator device: cuda:1
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
accelerator device: cuda:0
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
INFO:__main__:Loading weights from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:Loading weights from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:Loaded DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors, info=<All keys matched successfully>
QwenModel: Gradient checkpointing enabled. Activation CPU offloading: False
INFO:musubi_tuner.hv_train_network:use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
WARNING:musubi_tuner.hv_train_network:constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
INFO:__main__:Loaded DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors, info=<All keys matched successfully>
QwenModel: Gradient checkpointing enabled. Activation CPU offloading: False
prepare optimizer, data loader etc.
number of trainable parameters: 20430401088
INFO:musubi_tuner.hv_train_network:use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
WARNING:musubi_tuner.hv_train_network:constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
override steps. steps for 200 epochs is / 指定エポックまでのステップ数: 2800
enable full bf16 training.
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
INFO:__main__:DiT dtype: torch.bfloat16, device: cuda:1
running training / 学習開始
  num train items / 学習画像、動画数: 28
  num batches per epoch / 1epochのバッチ数: 14
  num epochs / epoch数: 200
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 2800
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
steps:   0%|                                                                                     | 0/2800 [00:00<?, ?it/s]INFO:__main__:DiT dtype: torch.bfloat16, device: cuda:0

epoch 1/200
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
steps:   0%|                                                           | 1/2800 [00:04<3:13:23,  4.15s/it, avr_loss=0.125][rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 724, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 720, in main
[rank1]:     trainer.train(args)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 550, in train
[rank1]:     model_pred, target = self.call_dit(
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train_network.py", line 427, in call_dit
[rank1]:     model_pred = model(
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1644, in forward
[rank1]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1533, in _pre_forward
[rank1]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank1]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank1]: making sure all `forward` function outputs participate in calculating loss. 
[rank1]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank1]: Parameter indices which did not receive grad for rank 1: 1915 1916 1925 1926 1927 1928
[rank1]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 724, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 720, in main
[rank0]:     trainer.train(args)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 550, in train
[rank0]:     model_pred, target = self.call_dit(
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train_network.py", line 427, in call_dit
[rank0]:     model_pred = model(
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1644, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1533, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank0]: making sure all `forward` function outputs participate in calculating loss. 
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 1915 1916 1925 1926 1927 1928
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
steps:   0%|                                                           | 1/2800 [00:04<3:24:49,  4.39s/it, avr_loss=0.125]
[rank0]:[W1023 11:05:10.493945548 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1023 11:05:11.598000 4232 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4376 closing signal SIGTERM
E1023 11:05:11.813000 4232 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 4377) of binary: /workspace/SECourses_Musubi_Trainer/venv/bin/python
Traceback (most recent call last):
  File "/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate", line 7, in <module>
    sys.exit(main())
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1204, in launch_command
    multi_gpu_launcher(args)
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher
    distrib_run.run(args)
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-23_11:05:11
  host      : c4a3d3e2a9ac
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4377)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@kohya-ss
Copy link
Owner

sure i will work on that but have you seen this error?

The modification on qwen_image_train_network.py appears to be just a change in the order of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants