Wan 2.2 training and general discussion -- advice, ideas, questions #455

BlipOnNobodysRadar · 2025-08-15T15:34:35Z

BlipOnNobodysRadar
Aug 15, 2025

Advice and general discussion is appreciated, both for the config provided and for general discussion! Hopefully we can share knowledge here to improve everyone's runs.

A few questions to start a discussion:

How should training be distributed between the low and high noise models? For style, should one focus on low noise vs noise hoise? Vice versa for motion?
Is qinglong_flux a drop in replacement/upgrade over --timestep_sampling shift --discrete_flow_shift 12.0?
Is scaling up DIMs and using 2x alpha something useful for wan training? I know from testing on SDXL models that higher DIMs + 2x alpha was very effective for complex lora datasets, but am curious if the benefit transfers between models.

I'll start by sharing my current config, just doing a test run on images only on t2v 14B high noise. This is with a 24gb VRAM card, running on Linux mint. The images are batch size 4 with resolutions [768,768]. I managed to squeeze it all to fit into VRAM by running export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in the terminal (only works on Linux I think).

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision fp16 wan_train_network.py --mixed_precision fp16 \
  --task t2v-A14B \
  --fp8_base --fp8_scaled \
  --blocks_to_swap 18 \
  --dit /home/blip/Desktop/AI_image_generators/comfy-ui/ComfyUI/models/diffusion_models/wan2.2_t2v_high_noise_14B_fp16.safetensors \
  --t5  /home/blip/Desktop/AI_image_generators/comfy-ui/ComfyUI/models/text_encoders/umt5-xxl-enc-bf16.safetensors \
  --vae /home/blip/Desktop/AI_image_generators/comfy-ui/ComfyUI/models/vae/wan_2.1_vae.safetensors \
  --network_args loraplus_lr_ratio=4 \
  --dataset_config /home/blip/Desktop/training/trainings/output_dir1/wanOutputs/tomls/img_only_mergeSet.toml \
  --flash_attn --split_attn \
  --optimizer_type adamw8bit \
  --learning_rate 3e-5 \
  --gradient_checkpointing \
  --max_data_loader_n_workers 2 --persistent_data_loader_workers \
  --network_module networks.lora_wan \
  --network_dim 32 --network_alpha 32 \
  --timestep_sampling shift --discrete_flow_shift 12.0 \
  --min_timestep 900 \
  --max_timestep 1000 \
  --preserve_distribution_shape \
  --log_with tensorboard \
  --logging_dir /home/blip/Desktop/training/trainings/output_dir1/wanOutputs/wan_logs \
  --max_train_epochs 16 \
  --save_every_n_epochs 1 \
  --seed 420 \
  --output_dir /home/blip/Desktop/training/trainings/output_dir1/wanOutputs/test \
  --output_name wan22_highnoise_lora_imageTest

Sarania · 2025-08-22T16:10:05Z

Sarania
Aug 22, 2025
Sponsor

I just started with Wan 2.2 myself!

As for your questions, the only one I really have an answer for is 3 - and in my opinion it's a solid no. The final size of a LoRA is a function of it's rank/dimension and the number of parameters the base model has. Wan 2.2 is much larger than SDXL so a given rank LoRA is MUCH bigger and more capable. I personally almost always train Wan at rank 16/alpha 16, sometimes even less for simple character LoRA. A smaller rank is not only easier to share but plays nicer with other LoRA and can be less prone to learn parts of the training data you aren't targetting. With smaller models like SDXL or SD, bumping the rank was beneficial to increase the learning capacity but it's just not really necessary here. This applies to all the "larger" diffusion transformers like Wan 2.1/2.2, Hunyuan, Flux, etc. If you were using the 5B Wan 2.2 or the 1.3B Wan 2.1, then it can be beneficial, but not for the biggies! Of course, feel free to experiment!

So anyway, I have 16GB of VRAM on my 4070 ti Super. My first foray last night, I used the same exact settings I'd been using for Wan 2.1 to train Wan 2.2 high:

accelerate launch --config_file /home/blyss/projects/blissful-tuner/yamls/defaultfp16.yaml \
/home/blyss/projects/blissful-tuner/wan_train_network.py \ 
--dit /home/blyss/projects/art/generative_models/wan/wan2.2_t2v_high_mixed_precision.safetensors \
--dataset_config  /home/blyss/projects/art/extra/dataset/WanKisses/dataset.toml \
--flash_attn --mixed_precision fp16 --optimizer_type came_pytorch.CAME --lr_scheduler constant_with_warmup \
--optimizer_args weight_decay=0.01 eps=(1e-30,1e-16) betas=(0.9,0.999,0.9999) --learning_rate 2e-5 \
--gradient_checkpointing --max_data_loader_n_workers 16 --persistent_data_loader_workers \
--network_module=networks.lora_wan --network_dim=16  --network_alpha=16 --timestep_sampling shift \
--discrete_flow_shift 3.0 --gradient_accumulation_steps 1  --max_train_steps 2400 --save_every_n_epochs=1 \
--seed 032424 --output_dir ./WanKisses --output_name WanKisses --blocks_to_swap 28 --log_with tensorboard \
--logging_dir ./logs --save_state_on_train_end --fp8_base --lr_warmup_steps 100 --network_args loraplus_lr_ratio=4 \
--network_dropout 0.05 --upcast_quantization --upcast_linear --mixed_precision_transformer \
--preserve_distribution_shape --min_timestep 875 --max_timestep 1000 --rope_func comfy \
--optimized_compile --fp8_scaled

The training ran for about 8 hours, the optimized compile I'm working on really helps! Loss looks great, initial results are quite promising though unrefined without the low noise LoRA. I need to train the low noise model tonight to fully see how it turns out. A few of those option are unique to my Blissful Tuner(https://github.com/Sarania/blissful-tuner/), specifically --rope_func comfy --optimized_compile --upcast_quantization --upcast_linear --mixed_precision_transformer and these are just related to optimizing VRAM/quality and making it possible to do this on my hardware 😁. Beyond that, you'll notice I'm using a --discrete_flow_shift of 3.0, I'm definitely not certain this is best but it's what worked very well for 2.1, I briefly started a run with the higher shift but the avr_loss was looking a lot worse. Testing will be needed.

I used multiple resolutions on the same video dataset, e.g.:

[[datasets]]
video_directory = "/home/blyss/projects/art/extra/dataset/WanKisses"
cache_directory = "/home/blyss/projects/art/extra/dataset/WanKisses/cache0"
resolution = [480, 272] # 130,560 pixels
target_frames = [65] # 8.5 megaframepixels
frame_sample = 2
frame_extraction = "uniform"

[[datasets]]
video_directory = "/home/blyss/projects/art/extra/dataset/WanKissees"
cache_directory = "/home/blyss/projects/art/extra/dataset/WanKisses/cache1"
resolution = [512, 512]  # 262,144 pixels
target_frames = [33] # 8.5 megaframepixels
frame_extraction = "uniform"
frame_sample = 3

For the low noise model tonight, I was thinking of removing the lower res 480x272x65f bucket and adding an 848x480x21f bucket to show a little more detail to it to. The idea is to play to the model's way of working, the high noise model decides overall structure and layout, the low noise model refines it into something aesthetically pleasing and adds fine details. I don't know if this is beneficial but I'm just following my intuition as always.

That's where I am currently! I'll definitely come back to this thread and share more when I have more results, cheers!

3 replies

BlipOnNobodysRadar Aug 26, 2025
Author

I'm actually kind of confused on how uniform frame extraction works. I read the docs on it repeatedly but never really understood it.

If a clip is 81 frames long, captioned as one complete clip, and you have

target_frames = [33] # 8.5 megaframepixels
frame_extraction = "uniform"
frame_sample = 3

Does it break it into three chunks of ~33 frames? does it just train on the first 33 frames?

If it's splitting it into sub-clips of 33 frames, would that not make the captioning incorrect for those chunks?

Lee-Ai88 Oct 1, 2025

Hello, I'm new to musubi tuner - installed it for WAN and still learning. So a newbie questions here:

you use "--save_state_on_train_end". Have you sucessfuly resumed training for WAN? I did a training for 40 epochs and then I try to continue training with --resume option pointing to the "state" folder for another 40 epochs, however the final LoRAs generated almost identical results with the one trained for half the epochs. So either I'm doing something wrong or the script is not resuming training, but restarts. Will try to do some testing, but any info on resuming will be welcome. Also for other models if they resume sucessfuly.

EDIT: I should have messed up something first time as the second and third attempts to resume were successful and produced the expected results.

BlipOnNobodysRadar Oct 21, 2025
Author

Hello, I'm new to musubi tuner - installed it for WAN and still learning. So a newbie questions here:

you use "--save_state_on_train_end". Have you sucessfuly resumed training for WAN? I did a training for 40 epochs and then I try to continue training with --resume option pointing to the "state" folder for another 40 epochs, however the final LoRAs generated almost identical results with the one trained for half the epochs. So either I'm doing something wrong or the script is not resuming training, but restarts. Will try to do some testing, but any info on resuming will be welcome. Also for other models if they resume sucessfuly.

EDIT: I should have messed up something first time as the second and third attempts to resume were successful and produced the expected results.

I think resumed training is currently bugged just based on observing loss and LR scheduling in original run vs resumed. It seems to not properly account for the total number of steps and thus assigns an incorrect (or in my case even inverse) LR schedule.

I've swapped to simply starting from existing weights with no state using --network_weights "path/to/lora.safetensor" instead and manually setting the LR/optimizer params when I continue training.

FWIW this may also be the case with sd_scripts backend in general and not only Wan2.2, as I observed similar issues when resuming from state for SDXL lora training. I however never dived deep enough into the issue to provide a good idea of what exactly is going wrong there.

Sarania · 2025-08-24T15:16:58Z

Sarania
Aug 24, 2025
Sponsor

So that run didn't really work out, I tried a new run last night where I did 3e-5 with loraplus of 4 for 1600 steps with shift of 12. This allows me to train high and low back to back in a single night, took about 12 hours! It's looking pretty good too, I did 360x360x65 and 512x512x33 for the high noise model, and for the low noise model I added in 640x640x21 to show some more detail. The rest was the same as above. It's looking promising and I'm especially pleased that I can train both high and low noise in a single night on 16GB!

15 replies

BenDes21 Sep 18, 2025

That's an interesting error I'm not sure about, can you show me the traceback so I can look into it please? 🙂 At least which line is causing that when optimized is enabled.

Hi there! Any news a possible fix of --optimized_compile :D BTW also schedulefree.RAdamScheduleFree give me terrible results, dont know if I should use this scheduler...

Sarania Sep 18, 2025
Sponsor

Firstly I apologize, I didn't get a notification for when you posted the traceback somewhy but I did for this one so here I am! Secondly that's... weird. Seems above my pay grade so I asked AI to help me craft a fix. What version of Pytorch are you on fwiw? Anyway I've pushed this hopeful fix so please give it a try and let me know!

Edit: FWIW this is because I want to compile a function for inference but not for training because it changes shape with the input shape so during training if you have more than one bucket it will cause the compiler to hit it's limit quickly. I guess I was trying to treat a method like a function but they aren't exactly the same, compiler can't set it's attrib on the method.

BenDes21 Sep 20, 2025

Firstly I apologize, I didn't get a notification for when you posted the traceback somewhy but I did for this one so here I am! Secondly that's... weird. Seems above my pay grade so I asked AI to help me craft a fix. What version of Pytorch are you on fwiw? Anyway I've pushed this hopeful fix so please give it a try and let me know!

Edit: FWIW this is because I want to compile a function for inference but not for training because it changes shape with the input shape so during training if you have more than one bucket it will cause the compiler to hit it's limit quickly. I guess I was trying to treat a method like a function but they aren't exactly the same, compiler can't set it's attrib on the method.

Hi there ! No problem haha, Im using Pytorch 2.8, seems the error is now InductorError: AttributeError: type object 'CompiledKernel' has no attribute 'launch_enter_hook' .

Also unfortunately I really have hard time to get good result with this config :

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision fp16 src/musubi_tuner/wan_train_network.py
--max_train_epochs 20 --save_every_n_epochs 1 --seed 42
--output_dir /workspace/output --output_name motion_HIGH
--network_module networks.lora_wan
--task i2v-A14B
--dataset_config /workspace/dataset.toml
--min_timestep 900 --max_timestep 1000
--dit /workspace/models/wan2.2_i2v_high_noise_14B_fp16.safetensors
--t5 /workspace/models/umt5-xxl-enc-bf16.safetensors
--vae /workspace/models/wan_2.1_vae.safetensors
--discrete_flow_shift 5.0 --timestep_sampling shift
--rope_func comfy --sdpa --preserve_distribution_shape --gradient_checkpointing
--fp8_base --fp8_scaled --mixed_precision fp16
--max_data_loader_n_workers 32 --persistent_data_loader_workers
--network_module networks.lora_wan --network_dim 16 --network_alpha 16 --network_args loraplus_lr_ratio=4
--optimizer_type schedulefree.RAdamScheduleFree --learning_rate 3e-5

Dont know if the issue come from the fork or I just need to push to 40-50 epochs ( low noise config is same )

Sarania Sep 20, 2025
Sponsor

As far as your error, I'll be honest I have no idea. ChatGPT5 says it comes from a mismatch between your ondisk cached compiled modules and the version of pytorch you're using but I don't know if I buy that and it's misled me a lot lately(4 and o3/4 were a lot better IMO), it suggests clearing caches from ~/.cache/torch/inductor ~/.triton in that case. Compile and Inductor is just really picky I guess, because you're not the only one facing issues while it "JustWorks" for me which makes fixing it harder.

As far as training Wan 2.2... I've never gotten completely satisfactory results in honesty, I've gotten like 90% of the way there. I don't think the issue should be from my fork but of course you're free to try the base repo and see. Wow I feel like my reply isn't very helpful, sorry!

BenDes21 Sep 21, 2025

As far as your error, I'll be honest I have no idea. ChatGPT5 says it comes from a mismatch between your ondisk cached compiled modules and the version of pytorch you're using but I don't know if I buy that and it's misled me a lot lately(4 and o3/4 were a lot better IMO), it suggests clearing caches from ~/.cache/torch/inductor ~/.triton in that case. Compile and Inductor is just really picky I guess, because you're not the only one facing issues while it "JustWorks" for me which makes fixing it harder.

As far as training Wan 2.2... I've never gotten completely satisfactory results in honesty, I've gotten like 90% of the way there. I don't think the issue should be from my fork but of course you're free to try the base repo and see. Wow I feel like my reply isn't very helpful, sorry!

Thanks a lot, I think the issue also can come from the rtx 6000 pro Im using, it's a new gpu so it's maybe not compatible yet with torch compile.. anyway if you know others optimization with 96GB vram let me know. Thanks again.

ao899 · 2025-08-28T23:42:30Z

ao899
Aug 28, 2025

I don’t know what a typical configuration example for Schedule Free Optimizer is. I kind of understand that the other settings aren’t necessary, but what should I do about --learning_rate?

1 reply

BlipOnNobodysRadar Sep 6, 2025
Author

Depends on the optimizer, the model, and other factors. There are many schedulefree optimizers. Which one specifically are you using?

FurkanGozukara · 2025-10-20T10:26:14Z

FurkanGozukara
Oct 20, 2025
Sponsor

everyone training in fp16

but normally bf16 is much more accurate and working better

anyone tested bf16? wan original models are fp32 so can be converted into bf16 i think

0 replies

Uh oh!

Wan 2.2 training and general discussion -- advice, ideas, questions #455

Uh oh!

BlipOnNobodysRadar Aug 15, 2025

Replies: 4 comments · 19 replies

Uh oh!

Sarania Aug 22, 2025 Sponsor

Uh oh!

Uh oh!

BlipOnNobodysRadar Aug 26, 2025 Author

Uh oh!

Uh oh!

Lee-Ai88 Oct 1, 2025

Uh oh!

Uh oh!

BlipOnNobodysRadar Oct 21, 2025 Author

Uh oh!

Sarania Aug 24, 2025 Sponsor

Uh oh!

BenDes21 Sep 18, 2025

Uh oh!

Uh oh!

Sarania Sep 18, 2025 Sponsor

Uh oh!

Uh oh!

BenDes21 Sep 20, 2025

Uh oh!

Sarania Sep 20, 2025 Sponsor

Uh oh!

BenDes21 Sep 21, 2025

Uh oh!

ao899 Aug 28, 2025

Uh oh!

BlipOnNobodysRadar Sep 6, 2025 Author

Uh oh!

FurkanGozukara Oct 20, 2025 Sponsor

BlipOnNobodysRadar
Aug 15, 2025

Replies: 4 comments 19 replies

Sarania
Aug 22, 2025
Sponsor

BlipOnNobodysRadar Aug 26, 2025
Author

BlipOnNobodysRadar Oct 21, 2025
Author

Sarania
Aug 24, 2025
Sponsor

Sarania Sep 18, 2025
Sponsor

Sarania Sep 20, 2025
Sponsor

ao899
Aug 28, 2025

BlipOnNobodysRadar Sep 6, 2025
Author

FurkanGozukara
Oct 20, 2025
Sponsor