Attempting to run Qwen-Image fine-tuning on 2xRTX 4090 machine and getting OOM errors. #498

jferments · 2025-08-25T15:18:45Z

jferments
Aug 25, 2025

I am trying to run a full fine tune on my 2xRTX4090 machine. It has 512GB system RAM so I am fine to use as much memory as needed there to offload layers. But I keep getting CUDA OOM errors when trying to use accelerate to run the Qwen-Image training script from PR #492

I tried to run it both with and without FSDP enabled, and with/without "--blocks_to_swap 24" (I also tried 36, 48, and 59 blocks)... I am getting OOM errors for both of them. I tried both FSDP v1 and v2 .. neither worked. This is my first time using musubi-tuner, so I very well might be doing something stupid in my command:

Here is the command that I am trying to run for my test dataset:

export CUDA_VISIBLE_DEVICES=0,1
export TF_CPP_MIN_LOG_LEVEL=2
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
ACC_CFG=~/.cache/huggingface/accelerate/default_config.yaml

LOGDIR=./logs/musubi
TOML=./data/qwen-training/tomls/qwen_warmup_256.toml

accelerate launch \
  --config_file "$ACC_CFG" \
  ../musubi-tuner/src/musubi_tuner/qwen_image_train.py \
  --dit ~/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors \
  --text_encoder ~/code/ComfyUI/models/text_encoders/qwen2.5-vl-7b.safetensors \
  --vae ~/code/ComfyUI/models/vae/qwen_image.safetensors \
  --dataset_config "$TOML" \
  --mixed_precision bf16 --full_bf16 \
  --optimizer_type adafactor \
  --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
  --max_grad_norm 0 \
  --lr_scheduler constant_with_warmup --lr_warmup_steps 10 \
  --timestep_sampling qwen_shift --weighting_scheme none \
  --gradient_checkpointing \
  --sdpa --split_attn \
  --fused_backward_pass \
  --max_data_loader_n_workers 1 \
  --max_train_epochs 1 \
  --save_every_n_epochs 1 --save_last_n_epochs_state 1 \
  --mem_eff_save --save_state \
  --logging_dir "$LOGDIR" --log_prefix "qwenimg_fullft_fsdpv1_" \
  --output_dir /checkpoints/qwen-image-fullft-fsdp \
  --output_name qwenimg-fullft-fsdp \
  --log_with tensorboard

Here is my "accelerate config" choices:

(.venv) jferments@ML-tower:~/code/phunt/data-prep$ accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]:
Do you want to use FullyShardedDataParallel? [yes/NO]: yes
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP version? [2]:
1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your sharding strategy?
FULL_SHARD
Do you want to offload parameters and gradients to CPU? [yes/NO]: yes
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your auto wrap policy?
SIZE_BASED_WRAP
What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: 1000000
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP's backward prefetch policy?
BACKWARD_PRE
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your FSDP's state dict type?
FULL_STATE_DICT
Do you want to enable FSDP's forward prefetch policy? [yes/NO]: no
Do you want to enable FSDP's `use_orig_params` feature? [YES/no]: no
Do you want to enable CPU RAM efficient model loading? Only applicable for 🤗 Transformers models. [YES/no]: no
Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start? [YES/no]: yes
Do you want to enable FSDP activation checkpointing? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:2
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use mixed precision?
bf16
accelerate configuration saved at /home/jferments/.cache/huggingface/accelerate/default_config.yaml

(Side note: one question I had here was whether I should used "TRANSFORMER_BASED_WRAP" instead of "SIZE_BASED_WRAP", but I wasn't sure how to answer the question: "Specify the comma-separated list of transformer layer class names (case-sensitive) to wrap ,e.g, :BertLayer, GPTJBlock, T5Block, BertLayer,BertEmbeddings,BertSelfOutput ...?")

And here is the error I'm getting when I run the command above:

(.venv) jferments@ML-tower:~/code/phunt/data-prep$ export CUDA_VISIBLE_DEVICES=0,1
export TF_CPP_MIN_LOG_LEVEL=2
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
ACC_CFG=~/.cache/huggingface/accelerate/default_config.yaml

LOGDIR=./logs/musubi
TOML=./data/qwen-training/tomls/qwen_warmup_256.toml

accelerate launch \
  --config_file "$ACC_CFG" \
  ../musubi-tuner/src/musubi_tuner/qwen_image_train.py \
  --dit ~/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors \
  --text_encoder ~/code/ComfyUI/models/text_encoders/qwen2.5-vl-7b.safetensors \
  --vae ~/code/ComfyUI/models/vae/qwen_image.safetensors \
  --dataset_config "$TOML" \
  --mixed_precision bf16 --full_bf16 \
  --optimizer_type adafactor \
  --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
  --max_grad_norm 0 \
  --lr_scheduler constant_with_warmup --lr_warmup_steps 10 \
  --timestep_sampling qwen_shift --weighting_scheme none \
  --gradient_checkpointing \
  --sdpa --split_attn \
  --fused_backward_pass \
  --max_data_loader_n_workers 1 \
  --max_train_epochs 1 \
  --save_every_n_epochs 1 --save_last_n_epochs_state 1 \
  --mem_eff_save --save_state \
  --logging_dir "$LOGDIR" --log_prefix "qwenimg_fullft_fsdpv1_" \
  --output_dir /checkpoints/qwen-image-fullft-fsdp \
  --output_name qwenimg-fullft-fsdp \
  --log_with tensorboard
Trying to import sageattention
Failed to import sageattention
Trying to import sageattention
Failed to import sageattention
2025-08-25 08:12:26.666414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1756134746.677197 1106784 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756134746.680162 1106784 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756134746.689717 1106784 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756134746.689735 1106784 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756134746.689737 1106784 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756134746.689740 1106784 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-08-25 08:12:26.740587: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1756134746.754978 1106783 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756134746.758614 1106783 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756134746.768070 1106783 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756134746.768095 1106783 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756134746.768098 1106783 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756134746.768103 1106783 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
INFO:__main__:Load dataset config from data/qwen-training/tomls/qwen_warmup_256.toml
INFO:musubi_tuner.dataset.image_video_dataset:load image jsonl from /home/jferments/code/phunt/data-prep/data/qwen-training/256px/images.jsonl
INFO:musubi_tuner.dataset.image_video_dataset:loaded 5000 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (256, 256)
  batch_size: 1
  num_repeats: 1
  caption_extension: "None"
  enable_bucket: True
  bucket_no_upscale: True
  cache_directory: "/home/jferments/code/phunt/data-prep/data/qwen-training/256px/cache_train"
  debug_dataset: False
    image_directory: "None"
    image_jsonl_file: "/home/jferments/code/phunt/data-prep/data/qwen-training/256px/images.jsonl"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:musubi_tuner.dataset.image_video_dataset:bucket: (176, 336), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (176, 352), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 288), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 304), count: 34
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 320), count: 15
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 336), count: 29
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (208, 288), count: 240
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (208, 304), count: 2520
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (224, 272), count: 7
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (224, 288), count: 4
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (240, 256), count: 7
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (256, 256), count: 5
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (272, 224), count: 5
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (288, 208), count: 199
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (288, 224), count: 4
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (304, 192), count: 334
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (304, 208), count: 1317
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (320, 192), count: 7
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (336, 176), count: 61
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (336, 192), count: 205
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (352, 176), count: 3
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (368, 176), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 5000
INFO:__main__:preparing accelerator
INFO:__main__:Load dataset config from data/qwen-training/tomls/qwen_warmup_256.toml
INFO:musubi_tuner.dataset.image_video_dataset:load image jsonl from /home/jferments/code/phunt/data-prep/data/qwen-training/256px/images.jsonl
INFO:musubi_tuner.dataset.image_video_dataset:loaded 5000 images
INFO:musubi_tuner.dataset.config_utils:[Dataset 0]
  is_image_dataset: True
  resolution: (256, 256)
  batch_size: 1
  num_repeats: 1
  caption_extension: "None"
  enable_bucket: True
  bucket_no_upscale: True
  cache_directory: "/home/jferments/code/phunt/data-prep/data/qwen-training/256px/cache_train"
  debug_dataset: False
    image_directory: "None"
    image_jsonl_file: "/home/jferments/code/phunt/data-prep/data/qwen-training/256px/images.jsonl"
    fp_latent_window_size: 9
    fp_1f_clean_indices: None
    fp_1f_target_index: None
    fp_1f_no_post: False
    flux_kontext_no_resize_control: False
    qwen_image_edit_no_resize_control: False
    qwen_image_edit_control_resolution: None


INFO:musubi_tuner.dataset.image_video_dataset:bucket: (176, 336), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (176, 352), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 288), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 304), count: 34
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 320), count: 15
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (192, 336), count: 29
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (208, 288), count: 240
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (208, 304), count: 2520
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (224, 272), count: 7
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (224, 288), count: 4
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (240, 256), count: 7
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (256, 256), count: 5
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (272, 224), count: 5
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (288, 208), count: 199
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (288, 224), count: 4
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (304, 192), count: 334
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (304, 208), count: 1317
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (320, 192), count: 7
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (336, 176), count: 61
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (336, 192), count: 205
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (352, 176), count: 3
INFO:musubi_tuner.dataset.image_video_dataset:bucket: (368, 176), count: 1
INFO:musubi_tuner.dataset.image_video_dataset:total batches: 5000
INFO:__main__:preparing accelerator
accelerator device: cuda:0
accelerator device: cuda:1
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:DiT precision: torch.bfloat16
INFO:__main__:Loading DiT model from /home/jferments/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors
INFO:__main__:Loading DiT model from /home/jferments/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
INFO:musubi_tuner.qwen_image.qwen_image_model:Creating QwenImageTransformer2DModel
INFO:__main__:Loading weights from /home/jferments/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors
INFO:__main__:Loading weights from /home/jferments/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 729, in <module>
[rank0]:     main()
[rank0]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 725, in main
[rank0]:     trainer.train(args)
[rank0]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 223, in train
[rank0]:     transformer = self.load_transformer(accelerator, args, args.dit, attn_mode, args.split_attn, loading_device, dit_dtype)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 99, in load_transformer
[rank0]:     state_dict = load_safetensors(dit_path, device=loading_device, disable_mmap=True, dtype=dit_weight_dtype)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jferments/code/phunt/musubi-tuner/src/musubi_tuner/utils/safetensors_utils.py", line 183, in load_safetensors
[rank0]:     state_dict[key] = f.get_tensor(key).to(device, dtype=dtype)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacity of 23.43 GiB of which 25.25 MiB is free. Including non-PyTorch memory, this process has 23.31 GiB memory in use. Of the allocated memory 22.93 GiB is allocated by PyTorch, and 7.28 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W825 08:12:50.031683269 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 729, in <module>
[rank1]:     main()
[rank1]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 725, in main
[rank1]:     trainer.train(args)
[rank1]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 223, in train
[rank1]:     transformer = self.load_transformer(accelerator, args, args.dit, attn_mode, args.split_attn, loading_device, dit_dtype)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jferments/code/phunt/data-prep/../musubi-tuner/src/musubi_tuner/qwen_image_train.py", line 99, in load_transformer
[rank1]:     state_dict = load_safetensors(dit_path, device=loading_device, disable_mmap=True, dtype=dit_weight_dtype)
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/jferments/code/phunt/musubi-tuner/src/musubi_tuner/utils/safetensors_utils.py", line 183, in load_safetensors
[rank1]:     state_dict[key] = f.get_tensor(key).to(device, dtype=dtype)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 1 has a total capacity of 23.43 GiB of which 114.56 MiB is free. Process 822912 has 17.41 MiB memory in use. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.66 GiB is allocated by PyTorch, and 3.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0825 08:12:51.796000 1106684 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1106784 closing signal SIGTERM
E0825 08:12:52.060000 1106684 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 1106783) of binary: /home/jferments/code/phunt/.venv/bin/python
Traceback (most recent call last):
  File "/home/jferments/code/phunt/.venv/bin/accelerate", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jferments/code/phunt/.venv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/home/jferments/code/phunt/.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1198, in launch_command
    multi_gpu_launcher(args)
  File "/home/jferments/code/phunt/.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/jferments/code/phunt/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/jferments/code/phunt/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jferments/code/phunt/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../musubi-tuner/src/musubi_tuner/qwen_image_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-25_08:12:51
  host      : ML-tower
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1106783)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Any ideas how to prevent OOM errors on my machine?

jferments · 2025-09-03T22:02:29Z

jferments
Sep 3, 2025
Author

I am able to get it to run on a single GPU on my machine with a dataset of 256px test images using the following command:

accelerate launch  --num_processes 1  --num_cpu_threads_per_process 16 \
  ../musubi-tuner/src/musubi_tuner/qwen_image_train.py \
  --dit ~/code/ComfyUI/models/diffusion_models/qwen_image_bf16.safetensors \
  --text_encoder ~/code/ComfyUI/models/text_encoders/qwen2.5-vl-7b.safetensors \
  --vae ~/code/ComfyUI/models/vae/qwen_image.safetensors \
  --dataset_config "$TOML" \
  --optimizer_type adafactor \
  --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
  --learning_rate 5e-7 \
  --timestep_sampling qwen_shift \
  --weighting_scheme none \
  --fused_backward_pass \
  --sdpa \
  --max_data_loader_n_workers 2 \
  --persistent_data_loader_workers \
  --max_train_epochs 1 \
  --save_every_n_epochs 1 \
  --save_last_n_epochs_state 1 \
  --save_state \
  --logging_dir "$LOGDIR" \
  --log_prefix "qwenimg_fullft_" \
  --output_dir ./checkpoints/qwen-image-fullft \
  --output_name qwenimg-fullft \
  --log_with tensorboard \
  --blocks_to_swap 32 \
  --max_grad_norm 0 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 10 \
  --full_bf16 \
  --split_attn

Unfortunately, this is only utilizing one of my two GPUs, and ~80GB out of 512GB system RAM.

Note that even with 32 blocks swapped my GPU VRAM is completely full on tiny 256px images:

And I am only getting about 4.6 iterations/second:

So I am doubtful that even with all blocks swapped, I will be able to move on to the next stages of training with higher resolution images.

I am happy that at least I was able to get it to run at all, and wanted to share that here for anyone trying to train on a 4090, but I really need to figure out a way to get multi-GPU running on this machine because I'm essentially running training at 50% of the speed it's capable of right now by not being able to use the second GPU for parallel training.

Even if I had to load a complete second copy of everything into system RAM, I'd have plenty left over since it's only using 80GB of 512GB. But I can't figure out how to make this happen.

0 replies

Oruli · 2025-09-16T00:41:25Z

Oruli
Sep 16, 2025

@jferments don't use --full_bf16 !!

Use --fp8_base and --fp8_scaled.

Though I would add that I'm not confident with the multigpu implementation in general here, using diffusion-pipe I get way less VRAM usage and seemingly faster speeds.

I'm not technical enough to know why, but this uses accelerate and diffusion-pipe uses deep speed.

1 reply

kohya-ss Sep 16, 2025
Maintainer

For full finetuning, full_bf16 is useful in environments with limited VRAM (approx. <100~120GB).

Musubi tuner supports multi-GPU training, but only with DDP. Full finetuning is difficult with DDP and requires FSDP or DeepSpeed, but we have not yet confirmed that this repository works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Attempting to run Qwen-Image fine-tuning on 2xRTX 4090 machine and getting OOM errors. #498

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Attempting to run Qwen-Image fine-tuning on 2xRTX 4090 machine and getting OOM errors. #498

Uh oh!

Uh oh!

jferments Aug 25, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

jferments Sep 3, 2025 Author

Uh oh!

Uh oh!

Oruli Sep 16, 2025

Uh oh!

kohya-ss Sep 16, 2025 Maintainer

jferments
Aug 25, 2025

Replies: 2 comments 1 reply

jferments
Sep 3, 2025
Author

Oruli
Sep 16, 2025

kohya-ss Sep 16, 2025
Maintainer