WAN training - The rules of the trade #182

AndreRatzenberger · 2025-03-22T12:46:33Z

AndreRatzenberger
Mar 22, 2025

Let's kick off this board with some general "rules" and concepts to keep in mind when training WAN LoRAs.

This should cover things like:

What kind of arguments work?
What kind of concepts work?
How do I get a specific concept to work at all?
...

I'll start:

You know how sometimes a t2v LoRA works surprisingly well for i2v tasks too — and sometimes it does absolutely nothing?
And then there are times where even an i2v LoRA doesn't work for i2v, even though you're sure you did everything right.

So... has anyone figured out the "rules" behind this?

Here’s my current default config:

accelerate launch --num_cpu_threads_per_process 1 `
    --mixed_precision bf16 wan_train_network.py `
    --task t2v-14B `
    --dit "D:\Diffusion\Repos\comfy\ComfyUI\models\diffusion_models\wan2.1_t2v_14B_fp8_e4m3fn.safetensors" `
    --t5 "D:\Diffusion\Repos\comfy\ComfyUI\models\text_encoders\umt5-xxl-enc-fp8_e4m3fn.safetensors" `
    --vae "D:\Diffusion\Repos\comfy\ComfyUI\models\vae\wan_2.1_vae.safetensors" `
    --sample_prompts experiments/prompt.txt `
    --dataset_config experiments/data.toml `
    --sdpa `
    --mixed_precision bf16 `
    --blocks_to_swap 30 `
    --optimizer_type adamw8bit `
    --learning_rate 1e-4 `
    --fp8_t5 `
    --fp8_base `
    --gradient_checkpointing `
    --max_data_loader_n_workers 2 `
    --persistent_data_loader_worker `
    --network_module networks.lora_wan `
    --network_dim 32 `
    --network_alpha 32 `
    --timestep_sampling shift `
    --discrete_flow_shift 3.0 `
    --max_train_steps 2000 `
    --save_every_n_steps 100 `
    --seed 42 `
    --logging_dir experiments/logging `
    --sample_every_n_steps 50 `
    --output_dir experiments/out_wan/quicki `
    --output_name quicki

dataset.toml

# [[datasets]]
# resolution = [768, 768]
# batch_size = 4
# enable_bucket = true
# image_directory = 'experiments/datasets/images'
# cache_directory = "experiments/cache/images"
# caption_extension = ".txt"
# num_repeats = 1


[[datasets]]
video_directory = "experiments/datasets/vids"
cache_directory = "experiments/cache/vids"
caption_extension = ".txt"
enable_bucket = true
resolution = [320, 320]
target_frames = [81]
frame_extraction = "full"

this config maxes out my 4090 and I'm pretty happy with it and its results!

Sarania · 2025-03-22T16:53:09Z

Sarania
Mar 22, 2025
Sponsor

One thing we know is that fp16 precision is generally better than bf16 for Wan: https://blog.comfy.org/p/updates-for-wan-21-and-hunyuan-image so you might consider switching to that. Also you should not need fp8_t5 on a 4090 I don't need it on a 4070 TS with 16GB(though you'd need the full size model ofc.) Wan, like Hunyuan also seems to benefit from LoraPlus("--network_args loraplus_lr_ratio=X" where X is the multiplier for the LR on lora_b blocks, 2 or 4 seems good for Wan) That's all I can tell you about Wan I've only just begun with it! I'm glad we have a discussion place now!

Oh one more thing: --fp8_scaled can be used in combination with --fp8_base to employ a scaling algorithm that kohya created/ported from one based on HyVideo. It allows the model to maintain a higher accuracy in fp8 precision by being more thoughtful about the conversion back and forth basically. fp16 base model plus --fp8_base and --fp8_scaled is likely the way to go in my experiments. You can inference with scaled too, it's implemented for wan_generate_video.py and I've ported it to WanVideoWrapper (and also HunyuanVideoWrapper which it's a HUGE boon for) also although I think kijai wants to do his own implementation, it took me a sec to get mine working right.

3 replies

AndreRatzenberger Mar 22, 2025
Author

Will try your recommendations out and will post some results/comparison!

Thanks for the input!

Meliauo Mar 27, 2025

One thing we know is that fp16 precision is generally better than bf16 for Wan: https://blog.comfy.org/p/updates-for-wan-21-and-hunyuan-image so you might consider switching to that. Also you should not need fp8_t5 on a 4090 I don't need it on a 4070 TS with 16GB(though you'd need the full size model ofc.) Wan, like Hunyuan also seems to benefit from LoraPlus("--network_args loraplus_lr_ratio=X" where X is the multiplier for the LR on lora_b blocks, 2 or 4 seems good for Wan) That's all I can tell you about Wan I've only just begun with it! I'm glad we have a discussion place now!

Oh one more thing: --fp8_scaled can be used in combination with --fp8_base to employ a scaling algorithm that kohya created/ported from one based on HyVideo. It allows the model to maintain a higher accuracy in fp8 precision by being more thoughtful about the conversion back and forth basically. fp16 base model plus --fp8_base and --fp8_scaled is likely the way to go in my experiments. You can inference with scaled too, it's implemented for wan_generate_video.py and I've ported it to WanVideoWrapper (and also HunyuanVideoWrapper which it's a HUGE boon for) also although I think kijai wants to do his own implementation, it took me a sec to get mine working right.

Can you tell me how you set it up? My 16g 3070 is not enough to train 5 x 5s @256x256. swap block = 26.

CCpt5 Mar 28, 2025
Sponsor

@AndreRatzenberger I had discrete_flow_shift set to 3 also as listed here but then when visiting the repo the other day I saw this note on the main page:

https://github.com/kohya-ss/musubi-tuner?tab=readme-ov-file#usage

Update: Changed the sample training settings to a learning rate of 2e-4, --timestep_sampling to shift, and --discrete_flow_shift to 7.0. Faster training is expected. If the details of the image are not learned well, try lowering the discete flow shift to around 3.0.

However, the training settings are still experimental. Appropriate learning rates, training steps, timestep distribution, loss weighting, etc. are not yet known. Feedback is welcome.

At the time I hadn't realized that was only on the main page and not on the document page where it suggests using the same values for Wan you do Hunyuan.

Is 3 definitely the value that we should use for training Wan i2v (not 5 or 7?) Thanks for any insight!!

aliok222 · 2025-03-27T21:06:26Z

aliok222
Mar 27, 2025

hi wnaybody know what is wrong here , it gave me very hard time

(hun_env) C:\ai\musubi-tuner>accelerate launch --num_cpu_threads_per_process 1 ^
More? --mixed_precision bf16 wan_train_network.py ^
More? --task i2v-14B ^
More? --dit "E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors" ^
More? --vae "C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors" ^
More? --t5 "E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/clip/umt5-xxl-enc-bf16.safetensors" ^
More? --dataset_config wan_train_config.toml ^
More? --sdpa ^
More? --blocks_to_swap 30 ^
More? --fp8_base ^
More? --fp8_t5 ^
More? --optimizer_type adamw8bit ^
More? --network_module networks.lora_wan ^
More? --max_data_loader_n_workers 2 ^
More? --persistent_data_loader_workers ^
More? --sample_every_n_steps 50 ^
More? --seed 42
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Trying to import sageattention
Failed to import sageattention
INFO:wan.modules.model:Detected DiT dtype: torch.float8_e4m3fn
INFO:hv_train_network:Load dataset config from wan_train_config.toml
INFO:dataset.image_video_dataset:glob images in dataset/venom
INFO:dataset.image_video_dataset:found 1 videos
INFO:dataset.config_utils:[Dataset 0]
is_image_dataset: False
resolution: (320, 320)
batch_size: 1
num_repeats: 1
caption_extension: ".txt"
enable_bucket: True
bucket_no_upscale: False
cache_directory: "dataset/venom/cache"
debug_dataset: False
video_directory: "dataset/venom"
video_jsonl_file: "None"
target_frames: (1, 129)
frame_extraction: full
frame_stride: 1
frame_sample: 1
max_frames: 129

INFO:dataset.image_video_dataset:bucket: (320, 320, 129), count: 1
INFO:dataset.image_video_dataset:total batches: 1
INFO:hv_train_network:preparing accelerator
accelerator device: cuda
INFO:hv_train_network:DiT precision: torch.bfloat16, weight precision: torch.float8_e4m3fn
INFO:hv_train_network:Loading DiT model from E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors
INFO:wan.modules.model:Creating WanModel
INFO:wan.modules.model:Loading DiT model from E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors, device=cpu, dtype=torch.float8_e4m3fn
INFO:wan.modules.model:Loaded DiT model from E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors, info=
INFO:hv_train_network:enable swap 30 blocks to CPU from device: cuda
WanModel: Block swap enabled. Swapping 30 blocks out of 40 blocks. Supports backward: True
import network module: networks.lora_wan
INFO:networks.lora:create LoRA network. base dim (rank): 4, alpha: 1
INFO:networks.lora:neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
INFO:networks.lora:create LoRA for U-Net/DiT: 480 modules.
INFO:networks.lora:enable LoRA for U-Net: 480 modules
prepare optimizer, data loader etc.
INFO:hv_train_network:use 8-bit AdamW optimizer | {}
INFO:hv_train_network:casting model to torch.float8_e4m3fn
running training / 学習開始
num train items / 学習画像、動画数: 1
num batches per epoch / 1epochのバッチ数: 1
num epochs / epoch数: 1600
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1600
INFO:hv_train_network:set DiT model name for metadata: E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors
INFO:hv_train_network:set VAE model name for metadata: C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors
steps: 0%| | 0/1600 [00:00<?, ?it/s]INFO:hv_train_network:DiT dtype: torch.float8_e4m3fn, device: cuda:0

epoch 1/1600
Trying to import sageattention
Failed to import sageattention
Trying to import sageattention
Failed to import sageattention
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
Traceback (most recent call last):
File "C:\ai\musubi-tuner\wan_train_network.py", line 411, in
trainer.train(args)
File "C:\ai\musubi-tuner\hv_train_network.py", line 1853, in train
model_pred, target = self.call_dit(
File "C:\ai\musubi-tuner\wan_train_network.py", line 346, in call_dit
image_latents = batch["latents_image"]
KeyError: 'latents_image'
steps: 0%| | 0/1600 [00:03<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\HP\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\HP\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\ai\musubi-tuner\hun_env\Scripts\accelerate.exe_main.py", line 7, in
File "C:\ai\musubi-tuner\hun_env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "C:\ai\musubi-tuner\hun_env\lib\site-packages\accelerate\commands\launch.py", line 1168, in launch_command
simple_launcher(args)
File "C:\ai\musubi-tuner\hun_env\lib\site-packages\accelerate\commands\launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\ai\musubi-tuner\hun_env\Scripts\python.exe', 'wan_train_network.py', '--task', 'i2v-14B', '--dit', 'E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors', '--vae', 'C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors', '--t5', 'E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/clip/umt5-xxl-enc-bf16.safetensors', '--dataset_config', 'wan_train_config.toml', '--sdpa', '--blocks_to_swap', '30', '--fp8_base', '--fp8_t5', '--optimizer_type', 'adamw8bit', '--network_module', 'networks.lora_wan', '--max_data_loader_n_workers', '2', '--persistent_data_loader_workers', '--sample_every_n_steps', '50', '--seed', '42']' returned non-zero exit status 1.

(hun_env) C:\ai\musubi-tuner>

1 reply

CCpt5 Mar 28, 2025
Sponsor

Do you intend for the launch commands to end at --Seed 42? You're missing a few things like --output_dir. What likely happened is that editing the file you deleted the ^ that's supposed to be after 42 which would tell the console there are more commands forthcoming. Currently the commands your executing end at --seed. So maybe take a look and confirm that each line (until the final line) has a ^ symbol followed by a carrier return (enter).

aliok222 · 2025-03-28T06:20:15Z

aliok222
Mar 28, 2025

Thank you for replying; it made me crazy , any help appreciated

Here is my toml file

[[datasets]]
video_directory = "dataset/venom"
cache_directory = "dataset/venom/cache"
caption_extension = ".txt"
enable_bucket = true
resolution = [320, 320]
target_frames = [129]
frame_extraction = "full"
frame_stride = 1
frame_sample = 1
num_repeats = 1
batch_size = 1

Training command :

accelerate launch --num_cpu_threads_per_process 1 ^
--mixed_precision bf16 wan_train_network.py ^
--task i2v-14B ^
--dit "E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors" ^
--vae "C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors" ^
--t5 "E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/clip/umt5-xxl-enc-bf16.safetensors" ^
--dataset_config wan_train_config.toml ^
--sdpa ^
--blocks_to_swap 30 ^
--fp8_base ^
--fp8_t5 ^
--optimizer_type adamw8bit ^
--network_module networks.lora_wan ^
--network_dim 32 ^
--network_alpha 32 ^
--timestep_sampling shift ^
--discrete_flow_shift 3.0 ^
--gradient_checkpointing ^
--max_data_loader_n_workers 2 ^
--persistent_data_loader_workers ^
--sample_every_n_steps 50 ^
--save_every_n_steps 100 ^
--seed 42 ^
--output_dir experiments/out_wan/venom ^
--output_name venom

(hun_env) C:\ai\musubi-tuner>accelerate launch --num_cpu_threads_per_process 1 ^
More? --mixed_precision bf16 wan_train_network.py ^
More? --task i2v-14B ^
More? --dit "E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors" ^
More? --vae "C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors" ^
More? --t5 "E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/clip/umt5-xxl-enc-bf16.safetensors" ^
More? --dataset_config wan_train_config.toml ^
More? --sdpa ^
More? --blocks_to_swap 30 ^
More? --fp8_base ^
More? --fp8_t5 ^
More? --optimizer_type adamw8bit ^
More? --network_module networks.lora_wan ^
More? --network_dim 32 ^
More? --network_alpha 32 ^
More? --timestep_sampling shift ^
More? --discrete_flow_shift 3.0 ^
More? --gradient_checkpointing ^
More? --max_data_loader_n_workers 2 ^
More? --persistent_data_loader_workers ^
More? --sample_every_n_steps 50 ^
More? --save_every_n_steps 100 ^
More? --seed 42 ^
More? --output_dir experiments/out_wan/venom ^
More? --output_name venom
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Trying to import sageattention
Failed to import sageattention
INFO:wan.modules.model:Detected DiT dtype: torch.float8_e4m3fn
INFO:hv_train_network:Load dataset config from wan_train_config.toml
INFO:dataset.image_video_dataset:glob images in dataset/venom
INFO:dataset.image_video_dataset:found 1 videos
INFO:dataset.config_utils:[Dataset 0]
is_image_dataset: False
resolution: (320, 320)
batch_size: 1
num_repeats: 1
caption_extension: ".txt"
enable_bucket: True
bucket_no_upscale: False
cache_directory: "dataset/venom/cache"
debug_dataset: False
video_directory: "dataset/venom"
video_jsonl_file: "None"
target_frames: (129,)
frame_extraction: full
frame_stride: 1
frame_sample: 1
max_frames: 129

INFO:dataset.image_video_dataset:bucket: (320, 320, 129), count: 1
INFO:dataset.image_video_dataset:total batches: 1
INFO:hv_train_network:preparing accelerator
accelerator device: cuda
INFO:hv_train_network:DiT precision: torch.bfloat16, weight precision: torch.float8_e4m3fn
INFO:hv_train_network:Loading DiT model from E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors
INFO:wan.modules.model:Creating WanModel
INFO:wan.modules.model:Loading DiT model from E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors, device=cpu, dtype=torch.float8_e4m3fn
INFO:wan.modules.model:Loaded DiT model from E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors, info=
INFO:hv_train_network:enable swap 30 blocks to CPU from device: cuda
WanModel: Block swap enabled. Swapping 30 blocks out of 40 blocks. Supports backward: True
import network module: networks.lora_wan
INFO:networks.lora:create LoRA network. base dim (rank): 32, alpha: 32.0
INFO:networks.lora:neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
INFO:networks.lora:create LoRA for U-Net/DiT: 480 modules.
INFO:networks.lora:enable LoRA for U-Net: 480 modules
WanModel: Gradient checkpointing enabled.
prepare optimizer, data loader etc.
INFO:hv_train_network:use 8-bit AdamW optimizer | {}
INFO:hv_train_network:casting model to torch.float8_e4m3fn
running training / 学習開始
num train items / 学習画像、動画数: 1
num batches per epoch / 1epochのバッチ数: 1
num epochs / epoch数: 1600
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1600
INFO:hv_train_network:set DiT model name for metadata: E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors
INFO:hv_train_network:set VAE model name for metadata: C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors
steps: 0%| | 0/1600 [00:00<?, ?it/s]INFO:hv_train_network:DiT dtype: torch.float8_e4m3fn, device: cuda:0

epoch 1/1600
Trying to import sageattention
Failed to import sageattention
Trying to import sageattention
Failed to import sageattention
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
Traceback (most recent call last):
File "C:\ai\musubi-tuner\wan_train_network.py", line 411, in
trainer.train(args)
File "C:\ai\musubi-tuner\hv_train_network.py", line 1853, in train
model_pred, target = self.call_dit(
File "C:\ai\musubi-tuner\wan_train_network.py", line 346, in call_dit
image_latents = batch["latents_image"]
KeyError: 'latents_image'
steps: 0%| | 0/1600 [00:03<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\HP\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\HP\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\ai\musubi-tuner\hun_env\Scripts\accelerate.exe_main.py", line 7, in
File "C:\ai\musubi-tuner\hun_env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "C:\ai\musubi-tuner\hun_env\lib\site-packages\accelerate\commands\launch.py", line 1168, in launch_command
simple_launcher(args)
File "C:\ai\musubi-tuner\hun_env\lib\site-packages\accelerate\commands\launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\ai\musubi-tuner\hun_env\Scripts\python.exe', 'wan_train_network.py', '--task', 'i2v-14B', '--dit', 'E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors', '--vae', 'C:/ai/musubi-tuner/models/wan_2.1_vae.safetensors', '--t5', 'E:/MyAi/ComfyUI_windows_portable/ComfyUI/models/clip/umt5-xxl-enc-bf16.safetensors', '--dataset_config', 'wan_train_config.toml', '--sdpa', '--blocks_to_swap', '30', '--fp8_base', '--fp8_t5', '--optimizer_type', 'adamw8bit', '--network_module', 'networks.lora_wan', '--network_dim', '32', '--network_alpha', '32', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.0', '--gradient_checkpointing', '--max_data_loader_n_workers', '2', '--persistent_data_loader_workers', '--sample_every_n_steps', '50', '--save_every_n_steps', '100', '--seed', '42', '--output_dir', 'experiments/out_wan/venom', '--output_name', 'venom']' returned non-zero exit status 1.

(hun_env) C:\ai\musubi-tuner>

3 replies

CCpt5 Mar 28, 2025
Sponsor

This is saying you only have 1 video in total for you dataset:

INFO:dataset.image_video_dataset:found 1 videos

I assume you have more than 1 file in: "dataset/venom". Try writing the path differently. For example for my current dataset.toml I have "g:/musubi-tuner/training_videos" for my dataset, and for my tensorboard logs I have /logs, and THAT (w/ the leading slash) goes to g:\logs on my Windows workstation. So perhaps it's looking at the wrong folder looking for your dataset.

Also make sure that there are files in your cache directory which you list as: "dataset/venom/cache" - You need to run two command line processes prior to launching training. They will create the latent/txtenc cache files (should be a bunch of .safetensors in that cache folder if you did it properly).

See her for the Pre-Caching commands if you didn't catch that this needs to be done before training: https://github.com/kohya-ss/musubi-tuner/blob/main/docs/wan.md#pre-caching--%E4%BA%8B%E5%89%8D%E3%82%AD%E3%83%A3%E3%83%83%E3%82%B7%E3%83%A5

FurkanGozukara Mar 28, 2025
Sponsor

@CCpt5 even if you have video still video frames (images) are trained right? not a video stream

CCpt5 Mar 28, 2025
Sponsor

@CCpt5 even if you have video still video frames (images) are trained right? not a video stream

Yes, I think you always have to pre-cache due to VRAM limits.

Sarania · 2025-03-28T14:02:08Z

Sarania
Mar 28, 2025
Sponsor

I use flow shift = 5.0 for Wan, = 7.0 for Hunyuan. I haven't trained for I2V specifically yet. Basic tips for saving VRAM/avoiding OOM:

Use --fp8_base (optionally with --fp8_scaled, it's REALLY good)
Use --flash_attn (make sure it's installed with "pip install flash_attn --no-build-isolation") instead of --sdpa
Use --split_attn
Use memory efficient optimizer(adamw8bit, came_pytorch.CAME)
Use --blocks_to_swap N where N is the number of transformer blocks to offload to CPU(max = 39 Wan, 37 Hunyuan). Experiment to find a balance that uses about 90% of your VRAM. More blocks swapped = slower, but less VRAM usage.
Close any programs that might be using VRAM like browser, games, etc, unless they are running on another GPU.
When training video, don't use too high of resolution. Both Wan and Hunyuan learn well at lower resolutions like 424x240, 256x256, etc. You shouldn't ever really need more than 640x360. When training images, you can go higher.

Advanced:
If possible, run your desktop/OS stuff on another GPU to free up max VRAM on your training GPU. For instance I run an RX 5700 XT to drive my display and run KDE/Wayland etc, while training on my RTX 4070 Ti Super. Alternatively if you are in Linux, you can run in headless mode(no GUI) if you are comfortable working strictly from the terminal. I wrote a really lovely little Nvidia GPU management tool for the terminal back when I was doing that: https://github.com/Sarania/Envious (It also works under X/Wayland just fine in the terminal!) It can show you VRAM usage by process and type(graphics/compute) in addition to overclocking, managing fans, monitoring temps, etc!

Super Advanced(Linux only):
Adding "nvidia.NVreg_EnableGpuFirmware=0" to your kernel command line in Linux will permanently free about 300MB more VRAM(at least that's how much it was for me) by disabling some low level Nvidia stuff from loading. Unless you are running the "nvidia-open" kernel modules which do require hooking the firmware(these are the new open source Nvidia kernel modules so you likely aren't unless you went out of your way to or maybe have a 5000 series card), you don't need that stuff and disabling it should be harmless regardless of whether you are gaming/training/whatever. Do so at your own risk, of course. With this and running my OS on a different GPU, my 4070 TS starts out with only 193.25 / 16376.00 MB used!

1 reply

Meliauo Mar 30, 2025

Very good tips! Training to step 12 with 256x256 video always suggests that oom needs 350M more vram.

CCpt5 · 2025-03-28T16:04:54Z

CCpt5
Mar 28, 2025
Sponsor

I just noticed for samples during training Wan (i2v) Musubi is using a dicrete_flow_shift value of 14.5 which is way higher than the 5 I'm training on. I didn't change that value nor set it specifically in the prompt.txt.

Should I change this value to 5 in the hv_train_network.py file: discrete_flow_shift = sample_parameter.get("discrete_flow_shift", 14.5)

Or is that value w/ 20steps a fair representation of how the LoRA will work later under normal Comfy WFs?

Looks like it's set to 14.5 default due to the HY paper? There's a mention to do testing on the inference expample page:

`timestep_sampling`や`discrete_flow_shift`は一例です。どのような値が適切かは実験が必要です。
その他のオプションについては python wan_train_network.py --help を使用してください（多くのオプションは未検証です）。

But also this about HYV's paper:

musubi-tuner/docs/sampling_during_training.md

Line 101 in c8fea74

`--fs`Specify the discrete flow shift. The default is 14.5, which corresponds to a step count of 20. In the HunyuanVideo paper, 7.0 is recommended for a step count of 50, and 17.0 is recommended for a step count of less than 20 (such as 10).

Can someone please advise if I should change this for the most accurate representation of where my training is at?

2 replies

Sarania Mar 28, 2025
Sponsor

I just noticed for samples during training Wan (i2v) Musubi is using a dicrete_flow_shift value of 14.5 which is way higher than the 5 I'm training on. I didn't change that value nor set it specifically in the prompt.txt.

Should I change this value to 5 in the hv_train_network.py file: discrete_flow_shift = sample_parameter.get("discrete_flow_shift", 14.5)

Or is that value w/ 20steps a fair representation of how the LoRA will work later under normal Comfy WFs?

Looks like it's set to 14.5 default due to the HY paper? There's a mention to do testing on the inference expample page:
`timestep_sampling`や`discrete_flow_shift`は一例です。どのような値が適切かは実験が必要です。
その他のオプションについては python wan_train_network.py --help を使用してください（多くのオプションは未検証です）。
But also this about HYV's paper:

musubi-tuner/docs/sampling_during_training.md

Line 101 in c8fea74
`--fs`Specify the discrete flow shift. The default is 14.5, which corresponds to a step count of 20. In the HunyuanVideo paper, 7.0 is recommended for a step count of 50, and 17.0 is recommended for a step count of less than 20 (such as 10).
Can someone please advise if I should change this for the most accurate representation of where my training is at?

I would set --fs to 5.0 when sampling for Wan, 7.0 for sampling with Hunyuan - when using the default step count of 50. If you are using less steps, you should use more flow shift. Higher flow shift basically allows more change to happen in earlier timesteps which you need if you are sampling less of them.

CCpt5 Mar 28, 2025
Sponsor

Great thanks!

kohya-ss · 2025-04-06T13:19:24Z

kohya-ss
Apr 6, 2025
Maintainer

If you don't already know, the description of this model published on Civitai is very informative: https://civitai.com/models/1404755/studio-ghibli-style-wan21-t2v-14b

Many thanks to him.

1 reply

Sarania Apr 6, 2025
Sponsor

If you don't already know, the description of this model published on Civitai is very informative: https://civitai.com/models/1404755/studio-ghibli-style-wan21-t2v-14b

Many thanks to him.

DAMN and I thought I did in depth, verbose write ups on this stuff! That's a super great resource and I expressed my thanks to the writer as well!

Sarania · 2025-05-11T23:13:57Z

Sarania
May 11, 2025
Sponsor

Some more things I can add now that I have more experience:

With Wan I get better results with --network_alpha == --network_dim instead of the default network_alpha of 1. Hunyuan needs more testing, but I think it might benefit more from alpha being half or even 1 as default.

I've had really good luck using an LR of 2e-5 with a LoraPlus of 4 with Wan, sticking to low resolutions of around 480x272 for video. Converting the video to 16fps beforehand DEFINITELY helps(I've been creating my datasets in 24fps for Hunyuan and then using a script that employs ffmpeg to convert the dataset to 16 for training Wan.) I find Wan notably easier to train than Hunyuan (possibly because Wan uses CFG and Hunyuan has embedded? I had more trouble training Flux with it's embedded too until the CFG De-distilled came out to train on!) It's worth nothing that Wan LoRAs tend to work both for T2V/I2V mode, I think training on T2V is better if you wanna go for both, training on I2V produces better results for I2V but worse for T2V.

Let's see what else... still shot samples (1 frame) are more useful when training/motion video than you'd think, so if you don't wanna splurge for a video sample I highly recommend that instead. I think that's all I've got to add!

2 replies

CCpt5 May 12, 2025
Sponsor

Thanks for coming back and updating! I also find alpha=dim seems to be more effective for me w LR properly adjusted. I've been training more ob the Skyreels models lately (at 24fps) and I have had some good results doing that. Not enough testing to say it's "better" but I've been liking recent results and that the only change I've made in my procedure.

Sarania May 12, 2025
Sponsor

Skyreels V2 is pretty good in my experience(I haven't trained it, but I've tested the base model for inference), it's just a bit heavy for someone like me with 16GB VRAM and 4070 TI Super. The lowest res videos I regularly make are 832x624 and because of the extra frames required for a given length because 24fps plus the lack of embedded guidance(which helps Hunyuan a lot here) and the extra blocks that must be swapped as a result to boot, it just takes too long for my taste. I'm pretty on about quality and I try to shoot for around 20 mins per gen at that res, and I rarely make anything except 5 seconds length, skyreels takes me nearly twice that. And for 960x720 which I do sometimes and can get out of Hunyuan or Wan in ~ 30 mins, Skyreels takes me 50 for the same. Basic gen settings of SageAttn + compile + fp16_accumulation + fast RoPE on my blissful-tuner version of Musubi. FP8 scaled precision. No fp8_fast for Wan/Skyreels but yes for Hunyuan.

That said if it's working for you and your setup, yeah the quality is nice the fluidity of motion is very good! I've actually come to see Wan's 16FPS as a bit of a plus because I don't tend to make extreme motion scenes(I focus on intimate moments, emotions, sensual things, sexuality, etc), and I can VFI my output to 32FPS with my scripts and it's plenty smooth, while requiring only 2/3 the frames generated!

Edit: I've also experimented with a workflow where I generate the video first with Wan at 832x624 or similar because I get better prompt responsiveness out of it usually, then do a 30% V2V with Hunyuan at 960x720 to top up the overall quality because I think it's outputs generally look nicer. Even though Wan outputs 16 natively and Hunyuan 24, with only 30% denoise it seems to work very well regardless of that! I still have it output at 16 fps even after the V2V, 30% doesn't change it enough to change the effective frame rate. Then I just do 2x VFI!

Sarania · 2025-05-18T22:37:45Z

Sarania
May 18, 2025
Sponsor

Ya'll I've been having EXCEPTIONAL results training Wan with the following config:

LR 2e-5 with LoraPlus of 4
Network dimension 16 Network Alpha 16
came_pytorch optimizer with default args of weight_decay=0.01,eps=(1e-30, 1e-16),betas=(0.9, 0.999, 0.9999)
Timestep sampling: Shift Discrete Flow Shift 3.0
2400 total steps on T2V 14B FP16 base model in FP8 scaled precision

And for my dataset.toml:

[general]
caption_extension = ".txt"
enable_bucket = true
bucket_no_upscale = false
batch_size = 1

[[datasets]]
video_directory = "/path/to/dataset/loraname"
cache_directory = "/path/to/dataset/loraname/cache0"
resolution = [480, 272] # 130,560 pixels
target_frames = [65] # 8.5 megaframepixels
frame_extraction = "head"

[[datasets]]
video_directory = "/path/to/dataset/loraname"
cache_directory = "/path/to/dataset/loraname/cache1"
resolution = [640, 360]  # 230,400 pixels
target_frames = [37] # 8.5 megaframepixels
frame_extraction = "uniform"
frame_sample = 2

[[datasets]]
video_directory = "/path/to/dataset/loraname"
cache_directory = "/path/to/dataset/loraname/cache2"
resolution = [848, 480]  # 407,040 pixels
target_frames = [21] # 8.5 megaframepixels
frame_extraction = "head"

Videos preprocessed into ~5 seconds clips showing the subject of interest at 16fps. This has produced such stellar results that I've gone back and retrained all my Wan models with these settings! You can get away with just the low res 480x272 bucket but it will reflect in the quality of the learned material. Including the higher res shorter clips allows showing the detail while including the lower res longer clips allows showing the progression of an action or scene.

Also some of you have likely realized this but you can use width x height x frames which I've dubbed "framepixels" as a guide for mentally computing VRAM usage. For instance 480x272x65 is 8,486,400 or ~8.5 megaframepixels. That means any other buckets such as 848x480x21 that are also ~8.5 megaframepixels will use a very similar amount of VRAM! If you know that you can do 480x272x65 with 20 blocks swapped for instance, then you know you can also do the other two with that same amount of blocks since they are the same amount of framepixels or less!

Hope it helps!

3 replies

MeiYi-dev Jun 14, 2025

How much is the iteration speed per step?

Sarania Jul 1, 2025
Sponsor

Oh I didn't see this question before I apologize! On my setup I get around ~16-20 seconds per iteration on a config like that. That's with my 4070 Ti Super 16GB. My DDR4 Memory is quad channel with very tight timings which helps a lot when swapping lots of blocks and fortunately my mobo's PCIe3 doesn't seem to hold it back! My current run is using my mixed precision transformer(base weight params in fp16 scaled to fp8 with fp8_scaled, "special" params like head, norm, bias kept in fp32 all the way), upcasting of quantization and linear transforms and I've got 36 Wan blocks swapped! Currently averaging 18.08 seconds per iteration with an ETA of ~12 hours on 2400 steps. Flash attention 3 beta. No compile/dynamo when training because I can't get it to play nice.

MeiYi-dev Jul 1, 2025

Oh I didn't see this question before I apologize! On my setup I get around ~16-20 seconds per iteration on a config like that. That's with my 4070 Ti Super 16GB. My DDR4 Memory is quad channel with very tight timings which helps a lot when swapping lots of blocks and fortunately my mobo's PCIe3 doesn't seem to hold it back! My current run is using my mixed precision transformer(base weight params in fp16 scaled to fp8 with fp8_scaled, "special" params like head, norm, bias kept in fp32 all the way), upcasting of quantization and linear transforms and I've got 36 Wan blocks swapped! Currently averaging 18.08 seconds per iteration with an ETA of ~12 hours on 2400 steps. Flash attention 3 beta. No compile/dynamo when training because I can't get it to play nice.

No worries. That speed looks good. Thanks for the settings. I am going to train a lora on my system soon.

HMCIndie · 2025-05-23T00:18:35Z

HMCIndie
May 23, 2025

Hey, has anyone stumbled upon this one: If I train network dim and network alpha = the same, then my lora will be all garbled up. It's just messy noise. If I do the default 32/1 it comes out fine. Any ideas?

1 reply

Sarania May 23, 2025
Sponsor

Network dim = network alpha disables alpha scaling. Network alpha equals 1 maximizes it. You can read more about what's actually going on under the definition for network alpha in my guide: #288

Nutshell: We take network_alpha divided by network_dim, then take the square root of that result to create scale which is then used to amplify the weights during training and reverse the effect during inference. The net effect is complex and there is a lot of misinformation around and I'm not even sure if my own explanation is fully correct but as you've seen turning the effect off tends to allow larger changes to the base model for better or worse. If you look at my post right above yours, I've actually decided that the best settings for me personally for Wan are with the effect off - dim = alpha. But for Hunyuan it seems to be a bit more beneficial.

Also note that some projects don't even support the network_alpha feature at all, like diffusion-pipe for instance. In this case they are functionally doing the same thing as dim=alpha.

Edit: All that said, generally not using alpha requires different LR and stuff than using it, so the reason for your result is likely that the alpha=dim LoRA is just burnt. Using alpha scaling tends to require higher LR and not using it tends to require lower!

niceguy4 · 2025-08-19T05:21:03Z

niceguy4
Aug 19, 2025

What is the general consensus for captaining video for Wan2.2 training? Any writing style that is working well or learning lessons?

The Wan paper seems to suggest very simple captaining "the woman pours coffee".

Sometimes my videos have more then one action that could be described, something like "while the camera pans down, the man fishes with dynamite".

I'm wondering if I should be captaining video something similar to this: https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y

0 replies

Uh oh!

WAN training - The rules of the trade #182

Uh oh!

Uh oh!

Replies: 10 comments · 17 replies

Uh oh!

Uh oh!

Sarania Mar 22, 2025 Sponsor

Uh oh!

AndreRatzenberger Mar 22, 2025 Author

Uh oh!

Uh oh!

Uh oh!

CCpt5 Mar 28, 2025 Sponsor

Uh oh!

Uh oh!

CCpt5 Mar 28, 2025 Sponsor

Uh oh!

Uh oh!

CCpt5 Mar 28, 2025 Sponsor

Uh oh!

FurkanGozukara Mar 28, 2025 Sponsor

Uh oh!

CCpt5 Mar 28, 2025 Sponsor

Uh oh!

Uh oh!

Sarania Mar 28, 2025 Sponsor

Uh oh!

Uh oh!

CCpt5 Mar 28, 2025 Sponsor

Uh oh!

Sarania Mar 28, 2025 Sponsor

Uh oh!

CCpt5 Mar 28, 2025 Sponsor

Uh oh!

kohya-ss Apr 6, 2025 Maintainer

Uh oh!

Sarania Apr 6, 2025 Sponsor

Uh oh!

Uh oh!

Sarania May 11, 2025 Sponsor

Uh oh!

CCpt5 May 12, 2025 Sponsor

Uh oh!

Uh oh!

Sarania May 12, 2025 Sponsor

Uh oh!

Uh oh!

Sarania May 18, 2025 Sponsor

Uh oh!

Uh oh!

Sarania Jul 1, 2025 Sponsor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sarania May 23, 2025 Sponsor

Uh oh!

Replies: 10 comments 17 replies

Sarania
Mar 22, 2025
Sponsor

AndreRatzenberger Mar 22, 2025
Author

CCpt5 Mar 28, 2025
Sponsor

CCpt5 Mar 28, 2025
Sponsor

CCpt5 Mar 28, 2025
Sponsor

FurkanGozukara Mar 28, 2025
Sponsor

CCpt5 Mar 28, 2025
Sponsor

Sarania
Mar 28, 2025
Sponsor

CCpt5
Mar 28, 2025
Sponsor

Sarania Mar 28, 2025
Sponsor

CCpt5 Mar 28, 2025
Sponsor

kohya-ss
Apr 6, 2025
Maintainer

Sarania Apr 6, 2025
Sponsor

Sarania
May 11, 2025
Sponsor

CCpt5 May 12, 2025
Sponsor

Sarania May 12, 2025
Sponsor

Sarania
May 18, 2025
Sponsor

Sarania Jul 1, 2025
Sponsor

Sarania May 23, 2025
Sponsor