-
-
Notifications
You must be signed in to change notification settings - Fork 166
Description
I don't know why but sometimes the training speed decreases considerably out of no where. It starts fine for the first steps then takes a nose dive.
INFO:musubi_tuner.dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
steps: 7%|███▎ | 250/3646 [2:01:03<27:24:21, 29.05s/it, avr_loss=0.00565]
saving checkpoint: O:/SD/DATASETS/ANIFY/musubi_output\qwen_edit_anify_1024x1024-step00000250.safetensors
steps: 14%|██████▋ | 500/3646 [6:03:45<38:08:48, 43.65s/it, avr_loss=0.00564]
saving checkpoint: O:/SD/DATASETS/ANIFY/musubi_output\qwen_edit_anify_1024x1024-step00000500.safetensors
steps: 21%|█████████▊ | 750/3646 [15:59:42<61:45:46, 76.78s/it, avr_loss=0.00589]
saving checkpoint: O:/SD/DATASETS/ANIFY/musubi_output\qwen_edit_anify_1024x1024-step00000750.safetensors
steps: 22%|██████████▌ | 805/3646 [18:10:12<64:07:33, 81.26s/it, avr_loss=0.00583]
The log above is for a Qwen-Edit lora being trained at 1024x1024 - which I've done many times (without the speed loss) with pretty much the exact same config.
But every so often the speed decrease you see here happens for no reason. VRAM usage suddenly increases and overflows to RAM - causing this issue. But why does it happen sometimes and not every time? How can I avoid it?
Its not as though I launched a game or opened another AI model that consumed VRAM while it was training - or something like that. The log above was from over-night training so no external variables took place.
EDIT:
I should probably show you my config:
accelerate launch ^
--num_cpu_threads_per_process 1 ^
--mixed_precision bf16 src/musubi_tuner/qwen_image_train_network.py ^
--dit "F:/SD/Qwen-Image-Edit/qwen_image_edit_bf16.safetensors" ^
--vae "F:/SD/Qwen-Image-Edit/qwen_image_vae.safetensors" ^
--text_encoder "F:/SD/Qwen-Image-Edit/qwen_2.5_vl_7b.safetensors" ^
--dataset_config "O:/SD/DATASETS/ANIFY/config_1024.toml" ^
--edit ^
--fp8_base ^
--fp8_scaled ^
--fp8_vl ^
--blocks_to_swap 55 ^
--sdpa ^
--mixed_precision bf16 ^
--timestep_sampling qwen_shift ^
--weighting_scheme none ^
--optimizer_type adamw8bit ^
--learning_rate 0.0002 ^
--gradient_checkpointing ^
--max_data_loader_n_workers 2 ^
--persistent_data_loader_workers ^
--network_module networks.lora_qwen_image ^
--network_dim 64 ^
--max_train_epochs 2 ^
--save_every_n_epochs 1 ^
--save_every_n_steps 250 ^
--seed 42 ^
--logging_dir=logs ^
--network_weights "O:/SD/DATASETS/ANIFY/musubi_output/qwen_edit_anify_640x640-step00006000.safetensors" ^
--output_dir "O:/SD/DATASETS/ANIFY/musubi_output" ^
--output_name "qwen_edit_anify_1024x1024"
I used to train at rank 16 or 32 with 45 blocks swapped at 1024x1024 - successfully. I can't remember if I ever had this issue with those settings.
At some point I decided to increase the number of swapping blocks to 55 to have enough spare VRAM for safety and I'm not sure if that was the moment I started having this issue or if it was due to updating the program but from there on out I was already facing this issue - even when training at rank 16. Right now I'm training at rank 64 but I don't think that's related.
EDIT2: Windows 10, RTX 5060Ti 16Gb, 64Gb RAM