-
-
Notifications
You must be signed in to change notification settings - Fork 166
Open
Description
I am trying to do multiple GPU Qwen Fine tuning @kohya-ss
I updated my pull request to fix both errors
Error 1
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
steps: 0%| | 0/2800 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 722, in <module>
[rank0]: main()
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 718, in main
[rank0]: trainer.train(args)
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 518, in train
[rank0]: logger.info(f"DiT dtype: {transformer.dtype}, device: {transformer.device}")
[rank0]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1962, in __getattr__
[rank0]: raise AttributeError(
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'. Did you mean: 'type'?
steps: 0%| | 0/2800 [00:00<?, ?it/s]
[rank0]:[W1023 10:55:50.996410765 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1023 10:55:51.715000 2901 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3030 closing signal SIGTERM
E1023 10:55:52.079000 2901 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 3031) of binary: /workspace/SECourses_Musubi_Trainer/venv/bin/python
Traceback (most recent call last):
File "/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate", line 7, in <module>
sys.exit(main())
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1204, in launch_command
multi_gpu_launcher(args)
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher
distrib_run.run(args)
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-10-23_10:55:51
host : c4a3d3e2a9ac
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3031)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Error 2
/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate launch --dynamo_backend no --gpu_ids 0,1 --mixed_precision bf16 --multi_gpu --num_processes 2 --num_machines 1 --num_cpu_threads_per_process 2 /workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py --config_file /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml --ddp_gradient_as_bucket_view
11:04:41-006719 INFO Executing command: bash /tmp/tmpstm435s8.sh
Starting text encoder output caching...
Trying to import sageattention
Successfully imported sageattention
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
is_image_dataset: True
resolution: (1328, 1328)
batch_size: 1
num_repeats: 1
caption_extension: ".txt"
enable_bucket: False
bucket_no_upscale: False
cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
debug_dataset: False
image_directory: "/workspace/28_imgs_1328/1_ohwx"
image_jsonl_file: "None"
fp_latent_window_size: 9
fp_1f_clean_indices: None
fp_1f_target_index: None
fp_1f_no_post: False
flux_kontext_no_resize_control: False
qwen_image_edit_no_resize_control: False
qwen_image_edit_control_resolution: None
INFO:__main__:Loading Qwen2.5-VL: /workspace/Training_Models_Qwen/qwen_2.5_vl_7b_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loading state dict from /workspace/Training_Models_Qwen/qwen_2.5_vl_7b_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loaded Qwen2.5-VL: <All keys matched successfully>
INFO:musubi_tuner.qwen_image.qwen_image_utils:Setting Qwen2.5-VL to dtype: torch.bfloat16
INFO:musubi_tuner.qwen_image.qwen_image_utils:Loading tokenizer from Qwen/Qwen-Image
INFO:__main__:Encoding with Qwen2.5-VL
INFO:musubi_tuner.cache_text_encoder_outputs:Encoding dataset [0]
28it [00:00, 46.32it/s]
Text encoder caching completed successfully!
Starting training...
Trying to import sageattention
Successfully imported sageattention
Trying to import sageattention
Successfully imported sageattention
INFO:musubi_tuner.hv_train_network:Loading settings from /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml...
INFO:musubi_tuner.hv_train_network:/workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
is_image_dataset: True
resolution: (1328, 1328)
batch_size: 1
num_repeats: 1
caption_extension: ".txt"
enable_bucket: False
bucket_no_upscale: False
cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
debug_dataset: False
image_directory: "/workspace/28_imgs_1328/1_ohwx"
image_jsonl_file: "None"
fp_latent_window_size: 9
fp_1f_clean_indices: None
fp_1f_target_index: None
fp_1f_no_post: False
flux_kontext_no_resize_control: False
qwen_image_edit_no_resize_control: False
qwen_image_edit_control_resolution: None
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (1328, 1328), count: 28
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 28
INFO:__main__:preparing accelerator
INFO:musubi_tuner.hv_train_network:Loading settings from /workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441.toml...
INFO:musubi_tuner.hv_train_network:/workspace/Lora/My_Qwen_Fine_Tuned_Model_20251023-110441
INFO:__main__:Load dataset config from /workspace/Lora/dataset_config_20251023_105449.toml
INFO:musubi_tuner.dataset.image_video_dataset:glob images in /workspace/28_imgs_1328/1_ohwx
INFO:musubi_tuner.dataset.image_video_dataset:found 28 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
is_image_dataset: True
resolution: (1328, 1328)
batch_size: 1
num_repeats: 1
caption_extension: ".txt"
enable_bucket: False
bucket_no_upscale: False
cache_directory: "/workspace/28_imgs_1328/1_ohwx/cache_dir"
debug_dataset: False
image_directory: "/workspace/28_imgs_1328/1_ohwx"
image_jsonl_file: "None"
fp_latent_window_size: 9
fp_1f_clean_indices: None
fp_1f_target_index: None
fp_1f_no_post: False
flux_kontext_no_resize_control: False
qwen_image_edit_no_resize_control: False
qwen_image_edit_control_resolution: None
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (1328, 1328), count: 28
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 28
INFO:__main__:preparing accelerator
accelerator device: cuda:1
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
accelerator device: cuda:0
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
INFO:__main__:Loading weights from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:Loading weights from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:Loaded DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors, info=<All keys matched successfully>
QwenModel: Gradient checkpointing enabled. Activation CPU offloading: False
INFO:musubi_tuner.hv_train_network:use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
WARNING:musubi_tuner.hv_train_network:constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
INFO:__main__:Loaded DiT model from /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors, info=<All keys matched successfully>
QwenModel: Gradient checkpointing enabled. Activation CPU offloading: False
prepare optimizer, data loader etc.
number of trainable parameters: 20430401088
INFO:musubi_tuner.hv_train_network:use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
WARNING:musubi_tuner.hv_train_network:constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
override steps. steps for 200 epochs is / 指定エポックまでのステップ数: 2800
enable full bf16 training.
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
INFO:__main__:DiT dtype: torch.bfloat16, device: cuda:1
running training / 学習開始
num train items / 学習画像、動画数: 28
num batches per epoch / 1epochのバッチ数: 14
num epochs / epoch数: 200
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 2800
INFO:__main__:set DiT model name for metadata: /workspace/Training_Models_Qwen/Qwen_Image_Edit_Plus_2509_bf16.safetensors
INFO:__main__:set VAE model name for metadata: /workspace/Training_Models_Qwen/qwen_train_vae.safetensors
steps: 0%| | 0/2800 [00:00<?, ?it/s]INFO:__main__:DiT dtype: torch.bfloat16, device: cuda:0
epoch 1/200
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
steps: 0%| | 1/2800 [00:04<3:13:23, 4.15s/it, avr_loss=0.125][rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 724, in <module>
[rank1]: main()
[rank1]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 720, in main
[rank1]: trainer.train(args)
[rank1]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 550, in train
[rank1]: model_pred, target = self.call_dit(
[rank1]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train_network.py", line 427, in call_dit
[rank1]: model_pred = model(
[rank1]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1644, in forward
[rank1]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1533, in _pre_forward
[rank1]: if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank1]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank1]: making sure all `forward` function outputs participate in calculating loss.
[rank1]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank1]: Parameter indices which did not receive grad for rank 1: 1915 1916 1925 1926 1927 1928
[rank1]: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 724, in <module>
[rank0]: main()
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 720, in main
[rank0]: trainer.train(args)
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 550, in train
[rank0]: model_pred, target = self.call_dit(
[rank0]: File "/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train_network.py", line 427, in call_dit
[rank0]: model_pred = model(
[rank0]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1644, in forward
[rank0]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]: File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1533, in _pre_forward
[rank0]: if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 1915 1916 1925 1926 1927 1928
[rank0]: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
steps: 0%| | 1/2800 [00:04<3:24:49, 4.39s/it, avr_loss=0.125]
[rank0]:[W1023 11:05:10.493945548 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1023 11:05:11.598000 4232 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4376 closing signal SIGTERM
E1023 11:05:11.813000 4232 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 4377) of binary: /workspace/SECourses_Musubi_Trainer/venv/bin/python
Traceback (most recent call last):
File "/workspace/SECourses_Musubi_Trainer/venv/bin/accelerate", line 7, in <module>
sys.exit(main())
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1204, in launch_command
multi_gpu_launcher(args)
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher
distrib_run.run(args)
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/SECourses_Musubi_Trainer/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/workspace/SECourses_Musubi_Trainer/musubi-tuner/src/musubi_tuner/qwen_image_train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-10-23_11:05:11
host : c4a3d3e2a9ac
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 4377)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Metadata
Metadata
Assignees
Labels
No labels