Qwen Image VRAM usage #615

MitPitt · 2025-09-30T14:56:06Z

MitPitt
Sep 30, 2025

Did some testing with Qwen Image (not Edit) Lora args. I have an RTX 3090TI with 24GB VRAM.

Qwen training docs suggest using --blocks_to_swap 16 --fp8_base --fp8_scaled

My results for VRAM usage and iteration time:

Settings	Training	Sample inference
`--blocks_to_swap 16 --fp8_base --fp8_scaled`	19.9GB (8s/it)	17.5GB (7s/it)
`--blocks_to_swap 8 --fp8_base --fp8_scaled`	22.5GB (8s/it)	19.7GB (7s/it)
`--blocks_to_swap 8 --fp8_base`	21.8GB (8s/it)	19.1GB (7s/it)
`--fp8_base`	23.8GB (8s/it)	21.5GB (3s/it)
`--fp8_base --fp8_scaled`	OOM (40s/it)	didn't try

Note that ~750MB of VRAM was used already used by Windows.

So I am able to train on 24GB with no block swapping. --fp8_base seems to be fastest if you consider sampling.

Should I be always be using --fp8_scaled with --fp8_base, because of some quality implications? Because it seems to be adding ~700MB instead of reducing VRAM usage.

Full command: accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 src/musubi_tuner/qwen_im age_train_network.py --dit "F:\musubi-tuner\TRAINING\models\qwen_image_bf16.safetensors" --vae "F:\musubi-tuner\TRAINING \models\diffusion_pytorch_model.safetensors" --text_encoder "F:\musubi-tuner\TRAINING\models\qwen_2.5_vl_7b.safetensors" --dataset_config "F:\musubi-tuner\TRAINING\dataset.toml" --output_dir F:\musubi-tuner\TRAINING\output --output_name my_ qwen_lora --network_module networks.lora_qwen_image --mixed_precision bf16 --gradient_checkpointing --optimizer_type ada mw8bit --network_dim 16 --max_train_epochs 16 --save_every_n_epochs 1 --max_data_loader_n_workers 2 --persistent_data_lo ader_workers --seed 42 --sample_prompts "F:\musubi-tuner\TRAINING\prompts.txt" --sample_every_n_epochs 1 --learning_rat e 2e-4 --sdpa --network_weights "F:\musubi-tuner\TRAINING\output\820.safetensors" <insert specific args from table above here>

MitPitt · 2025-09-30T20:34:21Z

MitPitt
Sep 30, 2025
Author

Tried different batch size

args	batch_size	iteration	effective iteration
--blocks_to_swap 28 --fp8_base --fp8_scaled	4	26.2s/it	6.55s/image
--blocks_to_swap 28 --fp8_base	4	27s/it	6.75s/image
--blocks_to_swap 16 --fp8_base --fp8_scaled	2	13.7s/it	6.85s/image
--blocks_to_swap 16 --fp8_base	2	14s/it	7s/image

Bigger batch size clearly makes training faster. Not sure whether it will affect final quality.

batch_size=4 settings are less reliable with VRAM heavy apps (windows, chrome) running but should be ok on a cleaner linux system.
I could never get batch_size=6 or batch_size=8 to work.

0 replies

kohya-ss · 2025-10-01T12:47:14Z

kohya-ss
Oct 1, 2025
Maintainer

Should I be always be using --fp8_scaled with --fp8_base, because of some quality implications? Because it seems to be adding ~700MB instead of reducing VRAM usage.

Although the training results have not been verified, it is likely that the training quality will also improve based on the quality of the inference. You may have already seen it, but please also see #564 .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Qwen Image VRAM usage #615

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Qwen Image VRAM usage #615

Uh oh!

MitPitt Sep 30, 2025

Replies: 2 comments

Uh oh!

MitPitt Sep 30, 2025 Author

Uh oh!

kohya-ss Oct 1, 2025 Maintainer

MitPitt
Sep 30, 2025

MitPitt
Sep 30, 2025
Author

kohya-ss
Oct 1, 2025
Maintainer