WAN training - The rules of the trade #182
Replies: 10 comments 17 replies
-
|
One thing we know is that fp16 precision is generally better than bf16 for Wan: https://blog.comfy.org/p/updates-for-wan-21-and-hunyuan-image so you might consider switching to that. Also you should not need fp8_t5 on a 4090 I don't need it on a 4070 TS with 16GB(though you'd need the full size model ofc.) Wan, like Hunyuan also seems to benefit from LoraPlus("--network_args loraplus_lr_ratio=X" where X is the multiplier for the LR on lora_b blocks, 2 or 4 seems good for Wan) That's all I can tell you about Wan I've only just begun with it! I'm glad we have a discussion place now! Oh one more thing: --fp8_scaled can be used in combination with --fp8_base to employ a scaling algorithm that kohya created/ported from one based on HyVideo. It allows the model to maintain a higher accuracy in fp8 precision by being more thoughtful about the conversion back and forth basically. fp16 base model plus --fp8_base and --fp8_scaled is likely the way to go in my experiments. You can inference with scaled too, it's implemented for wan_generate_video.py and I've ported it to WanVideoWrapper (and also HunyuanVideoWrapper which it's a HUGE boon for) also although I think kijai wants to do his own implementation, it took me a sec to get mine working right. |
Beta Was this translation helpful? Give feedback.
-
|
hi wnaybody know what is wrong here , it gave me very hard time (hun_env) C:\ai\musubi-tuner>accelerate launch --num_cpu_threads_per_process 1 ^ INFO:dataset.image_video_dataset:bucket: (320, 320, 129), count: 1 epoch 1/1600 (hun_env) C:\ai\musubi-tuner> |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for replying; it made me crazy , any help appreciated Here is my toml file [[datasets]] Training command : accelerate launch --num_cpu_threads_per_process 1 ^ (hun_env) C:\ai\musubi-tuner>accelerate launch --num_cpu_threads_per_process 1 ^ INFO:dataset.image_video_dataset:bucket: (320, 320, 129), count: 1 epoch 1/1600 (hun_env) C:\ai\musubi-tuner> |
Beta Was this translation helpful? Give feedback.
-
|
I use flow shift = 5.0 for Wan, = 7.0 for Hunyuan. I haven't trained for I2V specifically yet. Basic tips for saving VRAM/avoiding OOM:
Advanced: Super Advanced(Linux only): |
Beta Was this translation helpful? Give feedback.
-
|
I just noticed for samples during training Wan (i2v) Musubi is using a dicrete_flow_shift value of 14.5 which is way higher than the 5 I'm training on. I didn't change that value nor set it specifically in the prompt.txt. Should I change this value to 5 in the hv_train_network.py file: discrete_flow_shift = sample_parameter.get("discrete_flow_shift", 14.5) Or is that value w/ 20steps a fair representation of how the LoRA will work later under normal Comfy WFs? Looks like it's set to 14.5 default due to the HY paper? There's a mention to do testing on the inference expample page: But also this about HYV's paper: musubi-tuner/docs/sampling_during_training.md Line 101 in c8fea74 Can someone please advise if I should change this for the most accurate representation of where my training is at? |
Beta Was this translation helpful? Give feedback.
-
|
If you don't already know, the description of this model published on Civitai is very informative: https://civitai.com/models/1404755/studio-ghibli-style-wan21-t2v-14b Many thanks to him. |
Beta Was this translation helpful? Give feedback.
-
|
Some more things I can add now that I have more experience: With Wan I get better results with I've had really good luck using an LR of 2e-5 with a LoraPlus of 4 with Wan, sticking to low resolutions of around 480x272 for video. Converting the video to 16fps beforehand DEFINITELY helps(I've been creating my datasets in 24fps for Hunyuan and then using a script that employs ffmpeg to convert the dataset to 16 for training Wan.) I find Wan notably easier to train than Hunyuan (possibly because Wan uses CFG and Hunyuan has embedded? I had more trouble training Flux with it's embedded too until the CFG De-distilled came out to train on!) It's worth nothing that Wan LoRAs tend to work both for T2V/I2V mode, I think training on T2V is better if you wanna go for both, training on I2V produces better results for I2V but worse for T2V. Let's see what else... still shot samples (1 frame) are more useful when training/motion video than you'd think, so if you don't wanna splurge for a video sample I highly recommend that instead. I think that's all I've got to add! |
Beta Was this translation helpful? Give feedback.
-
|
Ya'll I've been having EXCEPTIONAL results training Wan with the following config: LR 2e-5 with LoraPlus of 4 And for my dataset.toml: Videos preprocessed into ~5 seconds clips showing the subject of interest at 16fps. This has produced such stellar results that I've gone back and retrained all my Wan models with these settings! You can get away with just the low res 480x272 bucket but it will reflect in the quality of the learned material. Including the higher res shorter clips allows showing the detail while including the lower res longer clips allows showing the progression of an action or scene. Also some of you have likely realized this but you can use width x height x frames which I've dubbed "framepixels" as a guide for mentally computing VRAM usage. For instance 480x272x65 is 8,486,400 or ~8.5 megaframepixels. That means any other buckets such as 848x480x21 that are also ~8.5 megaframepixels will use a very similar amount of VRAM! If you know that you can do 480x272x65 with 20 blocks swapped for instance, then you know you can also do the other two with that same amount of blocks since they are the same amount of framepixels or less! Hope it helps! |
Beta Was this translation helpful? Give feedback.
-
|
Hey, has anyone stumbled upon this one: If I train network dim and network alpha = the same, then my lora will be all garbled up. It's just messy noise. If I do the default 32/1 it comes out fine. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
|
What is the general consensus for captaining video for Wan2.2 training? Any writing style that is working well or learning lessons? The Wan paper seems to suggest very simple captaining "the woman pours coffee". Sometimes my videos have more then one action that could be described, something like "while the camera pans down, the man fishes with dynamite". I'm wondering if I should be captaining video something similar to this: https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Let's kick off this board with some general "rules" and concepts to keep in mind when training WAN LoRAs.
This should cover things like:
I'll start:
You know how sometimes a t2v LoRA works surprisingly well for i2v tasks too — and sometimes it does absolutely nothing?
And then there are times where even an i2v LoRA doesn't work for i2v, even though you're sure you did everything right.
So... has anyone figured out the "rules" behind this?
Here’s my current default config:
dataset.toml
this config maxes out my 4090 and I'm pretty happy with it and its results!
Beta Was this translation helpful? Give feedback.
All reactions