Wan 2.2 Complete Training Tutorial - Text to Image, Text to Video, Image to Video, Windows & Cloud #352

FurkanGozukara · 2025-12-21T12:44:23Z

FurkanGozukara
Dec 21, 2025
Maintainer

Wan 2.2 Complete Training Tutorial - Text to Image, Text to Video, Image to Video, Windows & Cloud

Full tutorial: https://www.youtube.com/watch?v=ocEkhAsPOs4

Wan 2.2 training is now so easy. I have done over 64 different unique Wan 2.2 trainings to prepare the very best working training configurations for you. The configurations are fully working locally with as low as 6 GB GPUs. So you will be able to train your awesome Wan 2.2 image or video generation LoRAs on your Windows computer with easiness. Moreover, I have shown how to train on cloud platforms RunPod and Massed Compute so even if you have no GPU or you want faster training, you can train on cloud for very cheap prices fully privately.

📂 Resources & Links:

Download the One-Click Installer & Configs: [ https://www.patreon.com/posts/Musubi-Tuner-Trainer-App-Configs-137551634 ]

Qwen Image Model Training Tutorial (Prerequisite): [ https://youtu.be/DPX3eBTuO_Y ]

SwarmUI & ComfyUI Setup Guide for Windows: [ https://youtu.be/c3gEoAyL2IE ]

SwarmUI Installer and Model Downloader : [ https://www.patreon.com/posts/SwarmUI-Install-Download-Models-114517862 ]

ComfyUI Installer : [ https://www.patreon.com/posts/ComfyUI-Installers-105023709 ]

SwarmUI & ComfyUI Setup Guide for RunPod & Massed Compute: [ https://youtu.be/bBxgtVD3ek4 ]

Upload / Download Big Files Guide for RunPod & Massed Compute: [ https://youtu.be/X5WVZ0NMaTg ]

⏱️ Video Chapters:

00:00:00 Introduction to Wan 2.2 Training & Capabilities

00:00:56 Installing & Updating Musubi Tuner Locally

00:02:20 Explanation of Optimized Presets & Research Logic

00:04:00 Differences Between T2I, T2V, and I2V Configs

00:05:36 Extracting Files & Running Update Batch File

00:06:14 Downloading Wan 2.2 Training Models via Script

00:07:30 Loading Configs: Selecting GPU & VRAM Options

00:09:33 Using nvitop to Monitor RAM & VRAM Usage

00:10:28 Preparing Image Dataset & Trigger Words

00:11:17 Generating Dataset Config & Resolution Logic

00:12:55 Calculating Epochs & Checkpoint Save Frequency

00:13:40 Troubleshooting: Fixing Missing VAE Path Error

00:15:12 VRAM Cache Behavior & Training Speed Analysis

00:15:51 Trade-offs: Learning Rate vs Resolution vs Epochs

00:16:29 Installing SwarmUI & Updating ComfyUI Backend

00:18:13 Importing Latest Presets into SwarmUI

00:19:25 Downloading Inference Models via Script

00:20:33 Generating Images with Trained Low Noise LoRA

00:22:22 Upscaling Workflow for High-Fidelity Results

00:24:15 Increasing Base Resolution to 1280x1280

00:27:26 Text-to-Video Generation with Lightning LoRA

00:30:12 Image-to-Video Generation Workflow & Settings

00:31:35 Restarting Backend to Clear VRAM for Model Switching

00:33:45 Fixing RAM Crashes with Cache-None Argument

00:35:13 Dual Model (High & Low Noise) Training Setup

00:36:54 Preparing Hybrid Datasets (Images + Videos)

00:37:40 Manually Editing Dataset TOML for Resolution Control

00:39:53 Setting High Noise Model Paths for Dual Training

00:41:50 Optimization: Block Swap vs CPU Offload

00:43:10 Generating Video with Dual-Model Trained LoRA

00:45:35 Massed Compute: Server Setup & Coupon Code

00:47:00 Connecting via ThinLinc & File Transfer Methods

00:49:12 Massed Compute: Fast UV Installation & Downloads

00:50:27 Loading Configurations on Massed Compute

00:52:18 Troubleshooting: Fixing Config Version Error

00:53:20 Dual Model Training Speed Analysis on Cloud

00:55:40 RunPod: Selecting the Correct Template & GPU

00:57:45 RunPod: Uploading Files & Extracting Archive

00:58:38 RunPod: Terminal Installation & Model Downloads

01:00:26 RunPod: Correct Pathing Syntax & Backslash Fix

01:01:28 Setting Dataset Paths on RunPod

01:03:34 Installing nvitop on RunPod Terminal

01:03:54 Speed Hack: Disabling Numpy Memory Mapping

01:06:00 Terminating Instances & Final Remarks

Greetings everyone! Today I am presenting an epic tutorial on how to train the Wan 2.2 model to generate extremely high-quality, realistic images and videos. This is currently the most advanced model for generating life-like textures and details.

In this comprehensive guide, I cover everything you need to know to train Wan 2.2 on your local Windows computer, as well as on cloud platforms like RunPod and Massed Compute. We utilize the SECourses Musubi Tuner with fully optimized, 1-click presets designed for every GPU range (from 6GB to 192GB VRAM).

🚀 What You Will Learn in This Tutorial:

Wan 2.2 Text-to-Image Training: How to train the Low Noise model for massive detail and realism.

Wan 2.2 Text-to-Video Training: Mastering Dual Model training (Low Noise + High Noise) for superior video consistency.

Image-to-Video Workflow: How to use your trained LoRAs to animate static images.

Cloud Training: Step-by-step guides for Massed Compute (ultra-fast disk speeds) and RunPod.

Performance Optimization: Using FP8 scaling, Block Swapping, and CPU offloading to train on consumer GPUs.

Inference & Upscaling: Using SwarmUI and ComfyUI to generate and upscale content to 4K resolution.

💡 Key Features of Our Workflow:

Auto-Resume & Speed: New UV package installers for lightning-fast setup.

Presets for All GPUs: Configurations included for 6GB, 12GB, 24GB, 48GB, and 80GB+ cards.

Dataset Automation: Auto-resizing and captioning for both image and video datasets.

Video Transcription

00:00:00 Greetings everyone. Today I am going to show you Wan 2.2 trainings. This will be an epic
00:00:08 tutorial. I will cover so many topics, so check out the video descriptions to learn all. So,
00:00:14 what I am going to show you today? I will show you how to train Wan 2.2 model to generate
00:00:22 images like these ones. You see, these are extremely high quality and extremely realistic,
00:00:28 with massive amount of details, textures, amazing quality, and amazing realism. The
00:00:35 Wan 2.2 is currently the most realistic plus most advanced model for generating images.
00:00:44 But I will not show only how to train images. I will show also how to generate videos,
00:00:50 how to train video model, so many things. I will show how to train on your local computer
00:00:56 with our SECourses Musubi Tuner application. I have prepared configurations for every GPU,
00:01:04 starting from 6 GB GPUs to 192 GB GPUs. The configurations are all set. They are separated
00:01:12 into qualities, so you can pick your configuration according to your GPU and start training right
00:01:19 away and get the most amazing results without doing all the research that I did for you.
00:01:25 I have literally done over 64 separate trainings to find out the best workflow,
00:01:33 find out the best parameters, and prepare these presets. I have used a cloud machine with 8 B200
00:01:41 GPUs to complete these trainings, analyze them. Moreover, I have developed this Gradio based
00:01:46 application that you will be able to load the configuration and start training right away.
00:01:52 Furthermore, I will show how to do training on RunPod. So if you don't have a powerful GPU,
00:01:57 you will be able to train on RunPod. And I will show how to do training on Massed Compute. Again,
00:02:02 if you don't have a powerful GPU, you will be able to train on Massed Compute. For example,
00:02:07 you see with this GPU we are able to train with like 70 minutes with only 2 dollars in 1 hour.
00:02:14 Everything is ready. Some people were asking me how to train on RunPod and Massed Compute
00:02:19 the Qwen image models. This tutorial covers the Wan 2.2 training. However,
00:02:25 all of the models that we support in SECourses Musubi Tuner are exactly same. So you can train
00:02:31 the Qwen image models, Wan models, or the future models hopefully that will get added like Flux 2,
00:02:37 like Z Image base model, will be exactly same as in this tutorial.
00:02:42 Moreover, I have updated all of our presets to the highest quality with lowest amount of
00:02:49 time to generate highest quality images and videos. So all the image generation and video
00:02:56 generation presets are updated and ready for you to use. To update these presets,
00:03:01 I have done so many different tests. I have compared all the results of these tests to
00:03:08 generate videos or images. So many tests have been made. This tutorial is a product of a massive
00:03:15 research. So many different parameter testing, so many different workflow testing, and all is ready
00:03:20 for you to use right away with 1 click installers. So you see from bad videos to good videos.
00:03:26 This tutorial covers text to image training with only from images,
00:03:30 text to video training with only images or plus videos, but videos are not necessary,
00:03:36 or image to video training. I have done so many testing and training only text to video yields the
00:03:43 best results for text to video generation or image to video generation. The presets are separated,
00:03:49 so if you are not interested in video generation, you can only use the text to image generation
00:03:55 workflow. Still, it is working on text to video generation or image to video generation as well.
00:04:01 So all these configs are interchangeable. There
00:04:04 are several tricks. If you want to generate highest quality images,
00:04:08 you use the text to image configs. If you want to generate highest quality videos,
00:04:13 you use Wan 2.2 text to video configs. If you want to generate image to video, still I recommend to
00:04:20 use text to video configs because this is working better than image to video. It is weird, I know,
00:04:25 but this is my findings. How do I know? Because I have done so many different testings as you
00:04:32 are seeing. This is a research of a week doing a lot of comparisons, doing a lot of testing,
00:04:38 experimentation, and all the presets are ready for you to use right away with the highest quality.
00:04:45 So I will show everything: how to install, how to set up, how to start training both Windows,
00:04:51 RunPod, Massed Compute. And everything is literally with 1 click install. Let's begin.
00:04:56 So as usual, I have prepared an amazing post where you will find everything to follow this
00:05:02 tutorial. However, before starting this tutorial, you need to watch this main tutorial. So open
00:05:08 this tutorial. This tutorial is our Qwen image models training tutorial. This is a masterpiece.
00:05:16 This tutorial covers so many different topics as you are seeing right now. So I recommend
00:05:22 you to watch this tutorial and learn how to use SECourses Kohya Musubi Tuner premium application.
00:05:30 So I will download the latest zip file. This zip file includes our latest installers and
00:05:36 configurations. I will move it into my existing installation folder. Then I will extract the
00:05:43 files here and I will overwrite existing files. Don't forget that you can also do a
00:05:48 fresh installation. After extraction, double click Windows install and update.bat file.
00:05:55 This will update the application to the latest version. Moreover, now we are using uv package
00:06:01 installer along with pip, therefore it is ultra fast. So you see my update already completed.
00:06:08 Then double click Windows download training models.bat file. This file will ask you download
00:06:14 which training models. For this tutorial, we need Wan 2.2 text to video training. If you want to
00:06:21 train Wan 2.2 image to video, then you need to download it with option 5. Since I already have
00:06:27 downloaded the models, it will just verify their hash values, and once the verification completed,
00:06:33 we will be ready. This way we are ensuring that our models are 100 percentage accurately
00:06:39 downloaded, so we will never have any weird bugs, issues, or problems. If the files were missing,
00:06:46 it would have downloaded them with the maximum speed with your network connection support.
00:06:51 So now we are ready to begin training. I will double click Windows start up.bat file. This
00:06:58 will start the latest version of the SECourses Kohya Musubi Tuner premium application. You see
00:07:05 currently we are at the version 24. This is where you see the versioning. Then go to Wan
00:07:11 models training. So we are going to use this tab. Whenever you are going to reload a model,
00:07:16 I recommend to refresh, go to the accurate tab, and then begin loading. Then click this icon
00:07:23 and go back to your installation folder, enter inside Wan 2.2 training configs. Now there are
00:07:29 3 configuration folders: Wan 2.2 image to video, Wan 2.2 text to image, and Wan 2.2 text to video.
00:07:37 So what are the differences between these 3? From the naming as you can see, Wan 2.2 image to video
00:07:44 trains image to video model. However, there is only single configuration because if you
00:07:49 want to only generate images, it is not the model that you need. This model is for only
00:07:54 generating videos. However, I don't recommend to use this model for training even if you are going
00:07:59 to use image to video models because training with Wan 2.2 text to video works better than
00:08:06 training with Wan 2.2 image to video, even for image to video presets. Even for image
00:08:13 to video generation. So for video generation, I recommend to always use Wan 2.2 text to video,
00:08:20 and for only image generation, I recommend to use Wan 2.2 text to image.
00:08:24 So let's begin with Wan 2.2 text to image. Now you will see bunch of configurations like this.
00:08:30 The first numbers are for GPU VRAM. So if you have 6 GB of GPU,
00:08:36 you need to pick it. If you have 24 GB of GPU, then you need to pick it. It also shows the
00:08:42 quality. The quality 1 is the maximum quality. Then the quality as the number increases the
00:08:48 quality get worse. So quality 1 is better than quality 2, quality 2 better than quality 3,
00:08:54 and so on. Since I have RTX 5090, I am going to use 32 GB VRAM configuration. However,
00:09:02 you will see that there are 2 versions: 32 GB of FP8 scaled and no FP8 scaled. The FP8
00:09:09 scaled has a big advantage. It is usually faster because it uses lesser VRAM. Moreover, it uses
00:09:16 much lesser RAM memory. So if your RAM memory is limited, then always go with FP8 scaled version.
00:09:23 So let's open nvitop to see our GPU and RAM memory. pip install nvitop. nvitop. And this is
00:09:33 my current usages. So I have 60 GB of empty RAM memory. My first GPU is fully empty. Therefore,
00:09:42 I will pick the very best configuration inside Wan 2.2 trainings. Wan 2.2 text to image 32 GB.
00:09:50 And it is loaded. You can also click this icon to be sure it is loaded. Then it is same as in
00:09:57 our Qwen image training tutorial. What I am going to change here is the training dataset.
00:10:02 For text to image training, you only need to use the images. For text to video training, if you
00:10:09 only use images, it is still working perfect as I have shown in the beginning of the tutorial.
00:10:15 But you can also include videos. I will show that later. So I will pick my training dataset,
00:10:21 but first let's prepare it. This is exactly same logic as in the Qwen images, so this is my folder.
00:10:28 I will delete the other files to show you what happens as a quick rewinding. But watch the Qwen
00:10:35 image training tutorial. So this is my trigger word ohwx and this is my parent path. Don't worry,
00:10:42 it will automatically resize them. So this is it. I copy pasted, I can also pick from here.
00:10:48 The best working resolution is 960 to 960. How do I know? I have literally tested so
00:10:56 many different configurations. So when you see these configurations 1024, 1280, 1328,
00:11:05 these are the resolutions that I have tested. And the best yielding resolution is 960 to 960.
00:11:12 Therefore, I recommend that. Then don't forget to click generate dataset configuration and
00:11:17 it is ready. So when you return back to your images folder, you should see the txt files.
00:11:23 These are the captions and when you open one of them you will see only the folder
00:11:27 name is used. If you want to use custom captions, you need to change these files.
00:11:32 Okay, our training dataset is ready. Now we need to set our output and other stuff same as in the
00:11:38 Qwen image training tutorial. So I will save them into my test trainings folder like this
00:11:44 where the models will be saved. This is the model LoRA models file names. Currently fine tuning is
00:11:50 not supported so we can only train LoRA. Okay, then the model settings. You need to pick the
00:11:55 files for text to image training, we are only use the low noise. There are 2 versions of Wan 2.2:
00:12:02 high noise and low noise. High noise is the initial generation which roughly generates
00:12:08 the base of the image or video. The low noise is the details. For generating only images,
00:12:14 we are only using low noise. This is the higher quality. So I will pick the accurate models from
00:12:21 downloaded models training models Wan. So this is the low noise text to video model.
00:12:27 Like this. Then since other files are there, I will just copy paste the folder paths like this,
00:12:34 like this. But you can individually pick them. We are not going to use Clip Vision, this is not
00:12:38 used for Wan 2.2 training. And you see there is high noise. This is used for dual model training
00:12:44 which I will explain after this. So you don't need to change anything else here. Don't change them.
00:12:50 In the training settings, you need to set your training epochs as I have explained. Since I
00:12:55 have 28 images, I am going to use 200 epochs. If you have 100 images, then you can reduce this to
00:13:02 100. That is the logic. But up to 50 images, I recommend to train at least 200 epochs. Moreover,
00:13:08 there is save every N epochs. So we are saving every 20 epochs. It will save 10 checkpoints that
00:13:15 I can compare later. Then don't forget to save your configuration. So let's save this as like
00:13:22 this. LoRA tutorial text to image TOML. Okay, I will save it. Then I will click start training.
00:13:30 Then follow the CMD window. If you get any errors, you need to report me that.
00:13:36 So currently I have an error. Let's see where we have the error. Maybe we forgot something. Yes,
00:13:41 I have forgotten the VAE path. This can happen. So let's open all the panels like this and
00:13:49 search for VAE. Okay, the VAE is set here. Okay, I can see the error that I have made,
00:13:55 so I need to fix it. It will be like this. And this is also error. Okay, it will be like this.
00:14:00 So pay attention to the model paths save. You can use the folder icon start training. And you see
00:14:07 we are also supporting torch compile. I am using your Visual Studio installation automatically.
00:14:14 Therefore, it is super important for you to follow the requirements tutorial which you
00:14:20 must. So the requirements tutorial are listed here in the post. Follow this requirements
00:14:26 tutorial first. When you follow the Qwen image models training, you will know this already.
00:14:31 Okay, let's follow our training. So now it is loading model into the RAM memory,
00:14:36 then it will load into the GPU memory. If you don't have sufficient amount of RAM memory,
00:14:42 training both models, both low and high noise, is harder. And it is even not mandatory. You can
00:14:48 also train only low noise model and even generate videos. It will be a little bit lower quality,
00:14:55 but it will work. Don't worry. So the training is about to start. Yes, training starting. Okay,
00:15:01 training started. It is not using my entire GPU memory. This is good. You need to have some free
00:15:07 memory. You will also see that your VRAM usage is not static like this. This is because how
00:15:12 Kohya developed the Wan 2.2 model training. It clears the VRAM cache after every step because
00:15:19 it is designed to be work with both models at the same time during training. So the speed
00:15:25 is around 5 seconds per IT. It will get a little bit slower. This is how Kohya displays currently.
00:15:31 But it is pretty decent. It is taking like 8 hours on my GPU. If I need faster training,
00:15:37 then I can reduce the resolution or I can increase the learning rate and I can do lesser epoch. So
00:15:44 let's say you want 100 epochs, then you need to multiply your learning rate with 2.5 and reduce
00:15:51 the maximum epochs number. However, you will lose some quality. So this set learning rate
00:15:58 and number of epochs are the best overall. It is a choice. It is a trade off. Either you need to
00:16:04 reduce the resolution to speed up or you need to lower the epoch count and increase your learning
00:16:12 rate. So this will train text to image LoRAs for us which you can use for text to video as well.
00:16:18 How we are going to use the trained models? I will stop the training. So for using the
00:16:23 trained models, we are going to use SwarmUI. The SwarmUI link is here. So let's also proceed with
00:16:29 it. If you don't know how to use SwarmUI already, you need to follow this tutorial.
00:16:34 You see this is the tutorial link in the top of the post. When you open it,
00:16:40 copy the zip file there. So let's download the latest zip file from here. Then I will move it
00:16:46 into my existing SwarmUI installation like this. Copy the file there. Right click and extract files
00:16:54 here and overwrite all. Then as a next step, we will update our ComfyUI installation. So go to
00:17:00 ComfyUI link from here. Download the latest zip file. You see there are 2 versions with
00:17:06 password protected one. If your antivirus, this happens with Windows Defender, causes issues,
00:17:13 download with the password protected, then move your ComfyUI zip file into your folder or you can
00:17:19 make a fresh installation. As usual, make an file extraction overwrite all files. Then to update or
00:17:27 install Windows install or update comfyui.bat file. We have moved the installation to uv
00:17:34 packages. So now it is much faster than before. Both update and installation. So you see it is
00:17:40 all updated. Then return back to your SwarmUI. In the SwarmUI, we have Windows update swarmui. So
00:17:48 double click it run. It will update SwarmUI and start your application. Everything was already
00:17:54 explained in other tutorials in more details, so you should watch them. Okay, SwarmUI update
00:18:00 succeed and it started. You will also get this error. This is not important, just ignore that.
00:18:06 First of all, you need to get the latest presets. So you can use import, choose file, go back to
00:18:13 your SwarmUI installation and select the latest preset from here. Then overwrite existing presets.
00:18:20 Alternatively, while the application is running, you can use the preset delete import. This will
00:18:26 delete your existing presets and just refresh it. Then click refresh and everything is here. The
00:18:33 presets are important because I have updated the presets for video generation for Wan 2.2 and also
00:18:40 image generation. So use the latest presets. Now how we are going to use our trained Wan 2.2 image
00:18:48 generation LoRAs. So first of all, quick tools reset params to default. Then from the presets,
00:18:53 the preset that you need to use is Wan 2.2 generate realistic images. Click here and
00:18:58 direct apply. This will let you generate 960 to 960 images. So let's type our prompt. Select
00:19:07 your aspect ratio whichever you want. For example, this one. And then you need to select your LoRA.
00:19:14 But there is one more thing. You need to have models downloaded for this preset to work. So
00:19:19 how you are going to download accurate models? You see Windows start download models app.bat
00:19:25 file. This will start the model downloader latest version. We have upgraded this. Also
00:19:30 it is using uv package. So now it is even faster to start. Okay, the downloader started. So which
00:19:37 models you need to download? You can download complete image generation and editing bundle.
00:19:42 Or you can download Wan 2.2 core 4 steps bundle. This is my recommended bundle. So
00:19:49 go to bottom and download all models inside this bundle. This way you will have all the
00:19:56 models that you need. Even the upscaler models that you need to use to get maximum quality.
00:20:03 You can follow the download process from the CMD window. So if I am missing any models,
00:20:09 they will get downloaded. If I already have some models, they will be verified and it will skip
00:20:14 the download. So currently it is downloading some of my missing models, but I have the
00:20:19 necessary models for image generation which is Wan 2.2 text to video low noise. So it is set.
00:20:26 Now I need to set my LoRA. So these are the LoRAs that it finds. It is compatible with Wan 2.2. My
00:20:33 trained LoRA is inside here. Wan 2.2 low noise. This is my trained LoRA. This is 200 epochs
00:20:42 training results. As in the Qwen image training tutorial, you should compare your checkpoints
00:20:47 with grid system and verify which model, which checkpoint is the best. Like 20 epochs,
00:20:54 40 epochs, 60, 80. So this is 200 epochs. And let's generate 4 random images. Currently this
00:21:01 is using our ComfyUI installation. It was updated. It is using Sage Attention,
00:21:07 my first GPU. Everything was previously explained in other tutorials. Moreover,
00:21:12 you should follow your VRAM usage from nvitop to see what is happening. I have the sufficient
00:21:17 amount of RAM memory and VRAM. If your GPU is weak like 6 GB GPU, don't worry. Since
00:21:23 we are using the ComfyUI, it is automatically doing block swapping and also CPU offloading,
00:21:29 RAM offloading. Therefore, no matter what GPU you have, it will work. This is the beauty of
00:21:35 the SwarmUI with ComfyUI backend. So it is doing all the optimizations that you need.
00:21:41 The image generation in the base resolution will be pretty fast as you are seeing right now. Okay,
00:21:46 this was the first image. The face is not that great. We should do either face inpainting or
00:21:51 we are going to do upscaling which I will show in a moment. However, this took only 45 seconds.
00:21:58 You can also reduce the number of steps if you want. But this model is not a small model. This
00:22:04 is a big model. 14 billion parameters. This model is much more powerful than the Z Image
00:22:11 turbo model if you ask my opinion. Okay, this is another image. So generate few images like this.
00:22:16 Okay. Let's say I liked this image. So I will click reuse parameters. It will set its seed.
00:22:22 Then I will go to presets and I will apply the our upscale. You see there is 2x upscale direct apply.
00:22:30 And I will regenerate. Let's see the difference. Furthermore, you can change the base resolution
00:22:36 of this model. Up to 1280 to 1280. 1280 to 1280. This model is able to generate images. Sometimes
00:22:46 it may add some hallucinations to the right and left border. But you can increase the resolution,
00:22:52 the base resolution to 1280 to 1280. I will show after this how to do that. But let's
00:22:59 see the upscaled result. So as a reminding, let me remind you where you put your trained LoRAs.
00:23:04 They are put inside SwarmUI inside models inside LoRA folder. This is where you need to put your
00:23:10 trained LoRAs. Inside LoRA folder. Okay, now it is upscaling. The upscaling will take more
00:23:15 time obviously because we are doubling the image resolution. So as I said, don't worry if you don't
00:23:22 have a powerful GPU, it will still work with automatic block swapping and CPU offloading.
00:23:28 Everything will be handled by the ComfyUI. No matter what GPU you have, it will work.
00:23:33 And here it is. So let's make a comparison. This was the base image. You see this was the
00:23:38 base image. Then this is the upscaled image. As you can see, it fixed the face. It added
00:23:44 huge amount of details, quality. There is a significant difference between the base image
00:23:50 and the generated upscaled image. Again, as I said, you can increase the base resolution even
00:23:58 up to actually 1920 to 1080. However, it may add some hallucinations. Let's try this and to see
00:24:06 the difference. But in my experimentation what I did was I selected my model which is low noise.
00:24:14 So here low noise. So click this hamburger menu, edit metadata, and change this to 1280 1280. Okay,
00:24:23 like this. Save. Then select another model and select your model. It will update resolutions
00:24:29 accordingly. So when you select your aspect ratio now, it will set it accordingly to 1280 to 1280
00:24:37 base resolution. These are all optional. You can do whichever you want. And generate another image
00:24:43 with 1280 to 1280 total resolution. The aspect ratio is like this. But let's see also 1920 to
00:24:51 1080 result as well. Okay. Yeah, not bad. Not as good as upscale but not bad at all.
00:24:58 Now with this 1280 to 1280 base resolution, if I also do upscale which I did in the example images,
00:25:07 it will become much better. For example, let me show you from the history row and Wan 2.2.
00:25:15 I have shown in the beginning of the tutorial but let's see again. So you see these images are huge,
00:25:21 massive. If I open, this is 3360 to 1920 pixels. Therefore, it has huge amount of details. I mean
00:25:34 look at these details. So you can generate massive images as well with this strategy.
00:25:40 Set the base resolution and upscale. Okay, this is 1280 to 1280. Now I will reuse parameters
00:25:46 and enable my upscale preset one more time. It is here direct apply and generate. Now it will
00:25:53 upscale this image into twice resolution. Of course, it will take time. However,
00:25:58 it will be the maximum quality. So this image full HD was generated in 73 seconds. This image
00:26:07 1280 to 1280 was generated in 55 seconds. This big image was generated in 2 minute 25 seconds.
00:26:16 I mean these are the expected times because we are generating really high quality images,
00:26:21 really high resolution images with perfect accuracy. And this is literally the first
00:26:26 generation that we did. So I can generate more and pick the better ones and as you
00:26:30 do more generations you will understand how the model works, what are the basics.
00:26:35 Okay, so the upscale of the 1280 to 1280 completed and let's look at the difference
00:26:41 one more time. So this is upscaled and this was the base. So from this base,
00:26:47 we upscaled into this masterpiece. The resolution is just amazing. The details,
00:26:53 the realism, everything is mind blowing compared to the other models. So can you use this model,
00:27:01 this only images trained low noise model for video generation? Yes. It will be
00:27:07 a little bit lower quality, but yes. Let me demonstrate that. So I will copy this prompt.
00:27:13 You can also use this for image to video as well, don't worry. Everything is same. So let's refresh.
00:27:19 Reset params to default. And there are several options that you can use. For the highest quality,
00:27:26 you need to use Wan 2.2 high quality text to video 20 steps. This will take some time. On my GPU it
00:27:34 takes like 6 to 7 minutes. Or you can use the lightning LoRA based one. You see Wan 2.2 text to
00:27:41 video. So let's see the lightning LoRA first, then let's see the other one. Then let's proceed to
00:27:47 dual model training. So direct apply. You see it is going to set the low and high noise LoRAs like
00:27:54 this automatically for you. Copy paste your prompt and select your trained LoRA. So this was the low
00:28:00 noise LoRA. Then generate. This will be pretty fast because it is only total 4 steps. And again,
00:28:09 for this preset to work flawlessly, you need to download this bundle. Or you will be having issues
00:28:16 with the models finding them, setting them. I am doing everything automatically for you.
00:28:21 Okay, so the generation started. First it is starting to use the high noise Wan 2.2 model.
00:28:28 Then it will use the low noise model. However, we should change the prompt to into a video
00:28:35 of. So it will be a little bit better. When you switch presets like this, since your base model
00:28:41 will change, it will take some time for loading reloading models, but everything is automatically
00:28:46 and properly made by the ComfyUI. So you shouldn't have any problems. And as you can see, it is able
00:28:54 to 100 percentage utilize my GPU. It is using my GPU 575 watts at once. So it is maximum power
00:29:05 usage. Okay, it is starting to generate. Since we trained our LoRA only on low noise model, this
00:29:12 will work a little bit lower quality than training on dual models which I will show. However,
00:29:18 you can use this LoRA on image to video presets as well. I will show that too after this. Okay,
00:29:24 so the video has been generated. Let's see the video. Yes, I can see the resemblance
00:29:30 and accuracy. It is pretty good as you can see. This is base video. If I upscale it,
00:29:35 it would be better, but I can certainly see the resemblance especially as it gets closer. Let me
00:29:42 show you the final frame perhaps like this. And it only took how many minutes. It only took 111
00:29:49 seconds. So it takes lesser than 2 minutes. If I want better quality, then I need to switch to this
00:29:56 preset. This will do real 20 steps with real CFG scale and it will be much higher quality.
00:30:04 What about using this on image to video. Can I do that? Yes. So I will do reset params to default.
00:30:12 Go to presets. With image to video again we have 2 quality presets. You can use this high quality or
00:30:20 you can use Wan 2.2 image to video 4 steps. This is also really good quality. So direct apply. Then
00:30:27 let's select our prompt like this. I don't need to copy this of course this time. Just copy paste
00:30:33 the prompt. Then as an image for example let's use this image. So I will go to init image choose file
00:30:41 then choose this then resolution close use closest aspect ratio. And the preset was not selected.
00:30:48 Yes, I need to select the preset so refresh. Reset params to default. Then refresh models to be sure
00:30:55 you have everything. Then presets and let's go to presets. Okay this one direct apply. Yes,
00:31:02 you see it has selected the accurate model. This is important apply. Then let's go to init image
00:31:08 choose file select our photo closest aspect ratio. Okay. Now I also need to select my LoRA. So my
00:31:15 LoRA was let's see this one. This is only trained with the images on low noise and generate. This is
00:31:24 also 4 steps. So therefore it will be pretty good. However, since we are switching a lot of models,
00:31:30 it may leave some left over. Therefore what I am going to do is I will cancel this first.
00:31:35 Then server backend and restart all backends. This will refresh my memory usages like this.
00:31:43 You see it cleared everything. If you are going to do a lot of model switching, you can restart your
00:31:48 backend to clear your VRAM and RAM memory and then generate. Now it should work better. Otherwise you
00:31:56 need to trust the ComfyUI RAM management. So I recommend to restart backends when you are going
00:32:01 to do big models switching like this. Because we switched from LoRA text to video to image
00:32:08 to video. Okay, it is nice. It is loading into RAM then starting generation. This should be also take
00:32:15 lesser than 2 minutes especially the second and third generations. By the way this was an image
00:32:21 prompt. Therefore for video prompting you need to do better. But I am just showing an example.
00:32:28 So this prompt is not made for video generation. This was made for image generation. Therefore it
00:32:35 may not be perfect for video generation. However still we should see the you know the quality and
00:32:42 video. I have tested that training text to video model LoRA improves image to video as
00:32:49 well. Training image to video LoRA doesn't improve that well. It is weird I know. Maybe it is because
00:32:55 I have trained with only images static images not with videos. But this is really improving
00:33:00 image to video as well. So you can train an image LoRA from text to video model with just images and
00:33:07 it will improve your image to video generations as well. Okay, it failed for some reason maybe.
00:33:14 Let's see the logs. No I don't see any error. Why it is showing an error. Okay, I think I
00:33:21 was out of RAM memory so the ComfyUI crashed probably because I am already using 2 GPUs.
00:33:29 RAM memory on my computer with all other stuff that is open. Okay, it crashed again due to out of
00:33:37 RAM memory. Therefore, I have added this argument: --cache-none. When you add this into your backend,
00:33:45 it will cache none of the models. So this is really useful when you are working with Wan 2.2
00:33:52 low and high noise models together at the same time if you don't have sufficient amount of RAM
00:33:58 memory, especially system RAM memory. So this way, it will use a model, generate, it will
00:34:04 then offload it, completely deleted, and load the next model. This will minimize your RAM usage,
00:34:11 also VRAM usage, with delay of loading models from the disk. But since my disks are fast,
00:34:18 it is not an issue. So it fixed the issue. There is no more RAM leakage
00:34:23 or RAM accumulation and we are about to get the video ready. It is pretty fast.
00:34:30 Okay, we got the video. So this was from image to video. As we get, you know, distant from the
00:34:37 camera, the resemblance gets decreased, but this was only static images trained model,
00:34:44 not the auto model. However, it is decent and it took only 2.4 minutes generation. So,
00:34:52 how we are going to train on dual models? Not only with the single model. If you want
00:34:58 to generate videos, this is what you should do. So I will just disable my backend to minimize my
00:35:04 VRAM usage. Then go back to your SECourses Musubi Tuner. So for dual model training,
00:35:13 we are just going to change our preset. So I will open a new tab here. Go to 1 Models Training,
00:35:20 open folder. Go back to Wan 2.2 training configurations. So we are going to use
00:35:24 text to video configs. These configurations train both high noise and low noise models at the same
00:35:32 time. However, this will use more VRAM and this will use more RAM memory since you need to have
00:35:38 both of the models loaded into your RAM memory. Therefore, be careful. I recommend to begin with
00:35:45 fp8 scaled versions. Then once you verified it is working, you can move to non-fp8. So let's
00:35:52 begin with fp8 scaled. Again, always save your configurations like this. Then we need to set
00:36:00 the output directory. Don't forget that. So I can just copy paste the directories here.
00:36:07 Training dataset. Again, I am only using static images. If you use videos, the videos will use
00:36:13 more VRAM memory. Therefore, be very careful. If you use videos, then you need to have more
00:36:20 VRAM memory, RAM memory. You need to do more block swapping. You may need to reduce your
00:36:25 training resolution. Everything changes. So first, train with only static images, verify results,
00:36:33 get some results, get some test done, then you can include videos into your training dataset. You can
00:36:39 have multiple subfolders for different training datasets and for each dataset have different
00:36:47 resolutions. So let me demonstrate an example. So what I did is I generated another folder like
00:36:54 this, 1_ohwx_video. However, I will be needed to set different captions, not automatic ones.
00:37:00 So I will copy paste it here. As I said, first I recommend to train with only static images then
00:37:06 include videos. But I will show how to include videos. And generate dataset configuration. Now,
00:37:13 it will generate datasets like this. You need to edit this TOML file. Why? Because you will see
00:37:22 that it is using the same resolution for both images and videos. What does this mean? This
00:37:28 means that if your images are high resolution, the videos will be too high resolution. Either
00:37:32 you need to auto downscale your videos to lower resolution or you need to edit this TOML file.
00:37:40 So I will open this TOML file. It is saved inside this folder. You need to do this manually.
00:37:47 So this is the generated TOML file. You can see that there are 2 different datasets. The first
00:37:53 one is the image dataset and the second one is the video dataset. And in the video dataset,
00:37:58 it is going to use some frame extraction method and it is going to use some methods like frame
00:38:03 extraction head, target frames, number of repeats, max frames. So I am going to set resolution here
00:38:09 as well and change the resolution like 480 to 480. So that it will use this resolution for
00:38:17 this dataset. You can also change the resolution of the image dataset from here. So you can have
00:38:23 different dataset resolutions. And if you are wondering what are these other options for
00:38:28 video datasets, it is all written here with details. When you scroll down, you will see
00:38:33 dataset preparation details. Open this and read here. However, as I said multiple times now,
00:38:41 train with only images then verify your results then do training with video dataset. Moreover,
00:38:49 now since we have 2 different folders, you will see that it generated the captions like this
00:38:54 for the video dataset. So you need to change the caption of the video as well. I didn't test this.
00:39:01 Probably this will work best with just trigger word or you can describe the video action as
00:39:07 well in these captions. However, for using the static images, I recommend only trigger word
00:39:13 as a caption. So I will delete this and I will regenerate dataset configuration. Yes. Now it will
00:39:20 use only my images dataset. It shows 56 double times but it is okay. There is actually 28 images.
00:39:27 Okay, we did set the dataset. This is how you prepare your video training
00:39:32 datasets. Then we set the other things same as the previous training. So in the models,
00:39:39 I will just copy paste the paths here, here, here. And additionally, for this model,
00:39:46 you need to set the high noise model as well. So it is here. So I will set it like this. You see,
00:39:53 high noise. And that's it. Now I will save my configuration and click training. It should
00:40:01 start and work right away. However, you need to have more RAM memory for this to work and it
00:40:08 will use a little bit more VRAM memory as well. So let's see what happens. The training window
00:40:15 is here. You will see that it will load both low noise and high noise models into the your system.
00:40:24 You should see them in the CMD window. So first it is loading the low noise then it is loading
00:40:30 the high noise. Low noise is for details, high noise is for the initial generation,
00:40:35 the base structure. We train only low noise for image generation. However, if you want to generate
00:40:42 videos, you should train on both of them. You can also train them individually one by one. However,
00:40:48 its quality is lower than training this dual model configuration. So you can load the image
00:40:54 configuration, image training configuration, change low noise model to high noise model. It
00:40:59 will work exactly same and train high noise model. If your system is not able to train
00:41:05 both models at the same time, you can follow that strategy. But what I recommend is rent
00:41:11 a machine from the RunPod or Massed Compute which I will show and train there and use.
00:41:16 Okay, it started. As you can see, both of the models is running at the same time now. You will
00:41:23 see that it will fluctuate the VRAM usage and how it changes model is like 87 percent. Okay, you see
00:41:31 the RAM usage increased because now it is loading the other model and it is switching between them.
00:41:37 Okay, you see. It is working nice. It is using almost my entire RAM memory right now so I should
00:41:43 decrease my RAM memory usage. I can also reduce my block swap count because I still have some VRAM.
00:41:50 So where you increase or decrease the block swap? So let's open all panels, search for swap and
00:41:57 you will see that there is block swap. So you can change block swaps. Currently since it is fitting
00:42:04 into my VRAM entirely with fp8 scaled, I am not doing any block swap. But if it doesn't fit in
00:42:10 your case, you need to increase this like 10, 20, 15. Moreover, I can use offload inactive DiT to
00:42:18 CPU. This requires more RAM memory and it is not compatible with blocks to swap. So when you do
00:42:24 block swap, you cannot use this unfortunately. Therefore, it is the fastest training that I
00:42:30 can get right now. Its speed is very good as you can see. It is also faster than training
00:42:34 only image model right now because I am doing fp8 scaled training. If I were doing not fp8 scaling,
00:42:41 I would be needed to do block swapping and block swapping reduces your speed. So this dual model
00:42:47 training takes like 6 hours on my GPU, maybe even faster. This is how you train dual model.
00:42:54 And how you use dual model? Exactly same as the single image training. I only change my LoRA to
00:43:01 use it. So let's enable back our backend and let's open a new tab. Copy our prompt. Quick tools,
00:43:10 reset params to default presets. And let's use this high quality preset this time. Direct apply.
00:43:16 Type our prompt like this. Okay. Make sure to not include the LoRAs in your prompt when you
00:43:22 copy paste. Okay, it is like this. And I need to select my LoRA. So my LoRA is this one. High noise
00:43:30 low noise auto. So this was trained exactly as I have just shown you. This is the highest quality
00:43:36 and let's generate. So this generation will take significantly more time compared to the
00:43:43 fast generation which is Wan 2.2 text to video 8 steps. Actually this is 4 steps. I forgotten to
00:43:50 change its name. Yes, this is 4 steps. So it will take more time but this is the highest quality
00:43:57 that you are going to get with video generation, with text to video generation. For image to video,
00:44:02 it doesn't make that much big difference. You can perfectly use this preset. But for text to video
00:44:09 generation, it makes really big difference still because none of the recent LoRAs are,
00:44:15 the lightning LoRAs are still as good as the image to video LoRAs. I am using the latest LoRA which
00:44:22 was published only 3 days ago. You see they are 17 December 2025 LoRAs but it is as I say.
00:44:30 So the generation started. As you can see, it should be done in like 6 minutes if I remember
00:44:37 correctly. We will see. Okay, so we got our video generated. It is not the best video. The quality
00:44:43 is good but the prompt is bad. So therefore I need to edit my prompt and generate new one.
00:44:50 The prompting do matters a lot. So play with your prompts to get the best results. Remember all the
00:44:58 generations will be inside your SwarmUI, inside output, inside local, inside raw folder. So you
00:45:06 will see all of your generations here. They will be saved by the date, latest date folder. You will
00:45:13 see and find your images and videos in this folder if you need them later. Furthermore,
00:45:19 you will see them in the history tab as well from the raw and with the folders like this.
00:45:25 So as a next step, I will show how to train on Massed Compute then I will show on RunPod. So
00:45:32 for Massed Compute, enter inside the extracted zip file folder and open Massed Compute instructions
00:45:38 read txt. Reading this file is mandatory and watching the Windows tutorial part are
00:45:43 mandatory. Please use this link to register. I appreciate that very much. Then go to billing
00:45:50 and set up some credits. They accept crypto payments as well. Then go to deploy. In here,
00:45:57 I recommend you to use RTX Pro 6000 Blackwell GPU. This is working really great. From the category,
00:46:04 select creator, select SECourses. Then enter your coupon. This is important. We are going
00:46:10 to use Category Creator, Image SECourses and discount coupon SECourses. You see the,
00:46:16 this very powerful GPU is 1.8 dollars per hour. When I apply my coupon,
00:46:23 it becomes 1.35 dollars. If you want even more powerful GPU, faster training,
00:46:29 select H200 NVL. This is even more powerful than RTX Pro 6000. It is 2.6 dollars. With our coupon,
00:46:38 it becomes 1.95 dollars. This is the fastest GPU. So you can use either of them. Let's use H200 for
00:46:46 this demonstration, for this tutorial. Deploy. Now we need to wait for machine to initialize.
00:46:53 To connect our machine, we are going to use ThinLinc Client. Download it. If you
00:46:57 followed other tutorials, you already know. After downloading, just next, next, next. Then you need
00:47:03 to set your shared folder from local devices. You see clipboard synchronization and drives.
00:47:09 And in here I have a shared folder with read and write permission. So I can copy my small
00:47:14 files and I can connect. For big files, you need to watch this tutorial. This will show
00:47:21 you how to upload and download big files to Hugging Face. You can also use Google Drive
00:47:26 or OneDrive. So other cloud services as well. You know how to use them. But you cannot use ThinLinc
00:47:33 Client to transfer big files. So let's just wait for initialization to be complete. Okay,
00:47:39 now it is running. You may refresh page time to time to update but it was auto updated for me.
00:47:45 Then click here and copy login URL. Open ThinLinc Client. Copy username for first time users and
00:47:52 copy password and connect. Then continue. Then click start. It will start the machine and
00:48:00 connect from the ThinLinc Client. Remember this is running on a remote machine, not on my machine.
00:48:05 Now I need to transfer my file, my installer file. You can use Patreon login from this browser or you
00:48:13 can use the shared folder which I prefer. So I will copy my file into my shared folder. You can
00:48:21 also put your training images here. Go back to home from here. Go to Thin Drives. This is your
00:48:26 shared folder. Do not run anything inside here. Run everything from downloads folder. Whatever it
00:48:33 is. It doesn't matter. The transfer speeds are not great with ThinLinc Client. Therefore you should
00:48:38 use the other method that I have just shown you, the link. You see this link for big files. Okay,
00:48:44 I will copy paste my images. Drag and drop or copy paste both works. And our latest SECourses Musubi
00:48:53 Tuner Premium Installer. So it is here. Drag and drop into downloads folder. Wait for files to be
00:48:59 transferred. You see it is being transferred right now. Then go to downloads folder. Extract the
00:49:05 Musubi Tuner. Extract here. And I will open the Massed Compute instructions. Copy this command.
00:49:12 While I am inside this folder, open in terminal. This is important. And paste. You need to be
00:49:18 in the accurate folder. Then it will start the installation. You will see that installation is
00:49:23 lightning fast. Actually let's see in real time. So it is going to use uv package installation.
00:49:30 I mean look at the speed. Look at the installation speed. It will take like
00:49:33 30 seconds. Maybe 1 minute maximum. So it is almost done. Also we need to download training
00:49:40 models. So in here you see there is this command. python3 download_train_models. Again while I am
00:49:46 inside this folder, open in terminal. Paste. And it will ask you which models to download. So let's
00:49:53 download text to video option 4. The download will begin. The downloader is also optimized. It will
00:50:00 be pretty fast. You see 500 megabytes per second. So it is 4 gigabits per second. 4 gigabits. So it
00:50:08 will be done almost in no time. The installation also completed and the trainer started. You see
00:50:15 installation take like 60 seconds. The rest is exactly same as in the Windows tutorial.
00:50:22 You will open your configuration. And there is one advantage because configuration is
00:50:27 set for Massed Compute folders directly. So I will download the Wan 2.2 training configs,
00:50:34 text to video configs. Whether you want to use dual GPU or whichever GPU. So this one
00:50:41 is using 140 gigabytes of GPU. Let's see our VRAM. nvidia-smi. Okay we have 140 gigabytes.
00:50:49 Therefore I am going to use this config. I hope it fits. You see it is maximum. Then
00:50:54 I need to set my output folder and other thing. I think output folder is already set. So I need
00:51:00 to set my training images or videos. So extract here. Okay. So copy this path and
00:51:08 copy paste here. Then don't forget to click generate dataset configuration. And let's see
00:51:14 if the model download is completed. Yes almost completed. You see 3 over 4. It is verifying.
00:51:22 So literally with like 5 minutes setup, you can begin training in Massed Compute. Like 5
00:51:29 minutes. Maybe 4 minutes. Once you do this like 1 or 2 times, you will get used to it. You will be
00:51:35 immediately setup a machine and start training right away. Okay high noise model is getting
00:51:42 verified. I am also verifying files. Therefore you will never have any issues with downloaded
00:51:47 models. This is super important because sometimes models get corrupted and it is causing a lot of
00:51:52 issues. Okay it is downloading the last model. It is 50 seconds remaining time. Okay so all
00:51:58 the downloads have been completed. Now I did set the training dataset and start training.
00:52:05 It should work if the folder paths are accurate if I remember but probably we need to change one
00:52:11 more thing. Not ready yet. So in the 1 Models Settings, yes. You see this is 23 version so I
00:52:18 need to make this 24. You also need to change that. The output folder should be okay. Yes
00:52:26 it is okay. And start training. Now let's see. Oh by the way if this was a 2 model training,
00:52:33 yes it was. I need to also change this. So dual model. Okay. Now it will give another
00:52:38 error and I will click training again. So first it is caching the training dataset. Okay it
00:52:45 is loading. Okay I will stop and click start again. Yes now it should work perfectly fine.
00:52:51 Let's follow both of that. So let's zoom in. Another zoom. Okay. Let's see in real time. So
00:52:59 starting training. Loading models. You see loading models are amazingly fast. Blazing fast on Massed
00:53:06 Compute. This is why I like Massed Compute over RunPod. The disk speed on Massed Compute is like
00:53:12 10x, maybe 20x faster than the RunPod. Okay 77 gigabytes. Okay now it is training the low noise
00:53:20 then it will begin the high noise. Okay training is starting. 4 steps done on the first model then
00:53:27 it will load the second model. The logic of dual training is that it switches models between. Okay
00:53:33 why it is taking time? I think the first step is taking time. Okay. Now it did begin. Yes. So the
00:53:43 training started. We will see the training speed gets better. Okay it is using 80 gigabytes right
00:53:50 now. But the training speed is getting better. You see 2.2 second. It will get faster than 1 IT
00:53:57 per second. But it is still switching models. So what we can do, we can disable this and
00:54:04 see if it works. So let's disable this to get even faster. Did we load the inaccurate model?
00:54:10 Yeah this should be disabled. I will update this configuration for you so you won't have
00:54:15 this issue. Now it should be even faster than before because it will not offload each model.
00:54:21 It will load both of the models into the GPU. Let's see the speed. Of course you can enable
00:54:27 fp8 scaled. It will take even lesser VRAM so you can load it into like 80 gigabytes GPUs as well.
00:54:35 Okay training started. You see it is 1.5 IT per second using only 106 gigabytes of
00:54:45 GPU. I will update the configuration so you won't have this issue. But this is
00:54:49 the training speed. It is going to take only 67 minutes. So with like
00:54:54 2 dollars, you will be able to train very best Wan 2.2 model. And it's a dual model training,
00:55:02 not single model. You are seeing the speed right now. 1.35 IT per second. It is amazing. And it
00:55:09 will take only 68 minutes and we are doing 5600 steps. Not a low number of steps. So it is working
00:55:17 just amazing. Yes I have updated the config as 140 gigabytes and it is using 107 gigabytes GPU.
00:55:25 So now I will show how to train on RunPod. The rest of the Massed Compute is same. You just
00:55:32 need to download your LoRA files and use on your computer or you can use in Massed
00:55:37 Compute as well. If you wonder how to use on Massed Compute, on our channel SECourses,
00:55:44 we have a video for this. The video is this one. Generate AI art 10x faster. The ultimate SwarmUI
00:55:50 and ComfyUI cloud tutorial. You can watch this tutorial to learn how to use SwarmUI on Massed
00:55:57 Compute or on RunPod. This tutorial covers both of them. So for training on RunPod, we have RunPod
00:56:05 instructions. Open that file. Always you need to follow this file for RunPod. Please register
00:56:12 RunPod from this link. After registering and logging in, go to billing and set up some credits.
00:56:20 Then once you have the credits, go to pods. If you want to use permanent storage, I have
00:56:26 a tutorial for that. It is here. If you want to learn how to upload and download big files,
00:56:31 the tutorial is here. And if you want to learn RunPod usage, there is another tutorial here.
00:56:36 So I will show on RunPod with RTX 5090. One of the most commonly used one. But you can use bigger
00:56:45 GPUs like B200 for even faster training. So this is the disk type and this is the RAM GPU filters
00:56:51 that I use. Let's select RTX 5090. Now a lot of people are getting confused here. You are not
00:56:58 going to use this template. You are always going to use the template that I write in this file.
00:57:05 Always. Because my installers are optimized for whatever it is written here. Otherwise you will
00:57:10 get errors. So change template. Select PyTorch 2.2.0. This is official template. Click edit
00:57:18 and increase the disk size like 200. It is up to you. If you want to save more checkpoints, you
00:57:24 need to increase this. Set overrides and deploy on demand. Now we need to wait for machine to become
00:57:32 ready but it should be very fast because this is official template. This is very lightweight
00:57:37 template. So let's see. Time to time click here. Okay it is ready. Jupyter Lab. If this doesn't
00:57:43 open, you need to refresh, click some several times. If still doesn't work, get a new machine.
00:57:49 Then I will upload my downloaded zip file. Wait for upload to complete. You see in the bottom it
00:57:54 shows. Then right click and extract archive. Click refresh. Then open RunPod instructions
00:58:00 read txt file. Copy this command. Open a new terminal from here. And terminal. Make sure
00:58:07 that you are in the workspace and copy paste like this. Then it will start the installation.
00:58:12 So run the commands inside workspace folder. Open new terminal. Make sure that you are in the
00:58:18 workspace wherever you have extracted the files. The installation will be extremely fast compared
00:58:25 to before. Why? Because we are using uv package. While installation is continuing, let's download
00:58:31 the model because model download may be slow. Open another terminal. Copy paste. Select the
00:58:38 model text to video. It will start downloading. This is also super optimized. Let's see. You
00:58:44 see it is downloading with like 300 megabytes per second. Pretty good. The uv installation is also
00:58:50 ultra fast. Before it was taking sometimes 10 minutes, 20 minutes, 30 minutes to install. Now
00:58:56 it will take only few minutes to install. This is our latest optimization. I will move my all
00:59:03 installers to uv installation hopefully slowly. So you will always install my applications much
00:59:10 faster now compared to before. Also models are getting downloading meanwhile. So after
00:59:15 installation it will auto start the application. This warning is not important. You will get
00:59:20 this in all installers. The model merging and verification on RunPod is sadly slow compared
00:59:27 to Massed Compute because its disk speed is very slow compared to Massed Compute.
00:59:32 Okay application installation completed and started. I will open the Gradio live link.
00:59:38 However model downloading is still continuing. Meanwhile let's also upload our training images.
00:59:44 So for upload click this icon. Select your training images as a zip file. Upload. I
00:59:50 prefer this way of uploading. So we can set up the configuration while models are getting downloaded.
00:59:56 Wait for upload to be completed. You see it is slow. You can also use runpodctl for uploading.
01:00:02 I explain that in another tutorial video. For example you can watch this one. But it is not
01:00:08 mandatory to learn or you can use this tutorial to upload and download big files very fast. This
01:00:14 is the recommended tutorial. And interface is here. This is running on RunPod. 1 Models
01:00:20 Training. So from the configurations, let's pick the configuration we want to use. Let's
01:00:26 use the text to image configuration. So you see we have 32 gigabytes. Copy path. Right click and
01:00:32 copy path. Put a backslash. Always RunPod puts a backslash to beginning of path and
01:00:39 paste it. Then click this. It will auto load the config. You see it is loaded successfully. Then
01:00:46 set your output folder. Let's save them inside workspace a new folder: Trained LoRAs. Right
01:00:54 click and copy path. Then paste it here. Put a backslash to the beginning. This is important.
01:01:00 If you want to do Qwen training, it is same. We only load Qwen training into here. So this is how
01:01:06 you train on RunPod. You can change your save name or save every n epochs. And the training
01:01:12 dataset. So I have uploaded my images. Right click and extract archive. And they are named like this.
01:01:19 Remember this is mandatory. 1_ohwx. ohwx will be our captions. 1 is the repeating. So I am going to
01:01:28 give the path of the parent folder. Copy as, right click and copy as path. Backslash, put the path
01:01:35 then click generate dataset configuration. It is done. Then the 1 Models Settings,
01:01:41 we need to set the folder paths. So the models are downloaded inside Training Models 1. So copy path
01:01:49 and change them like this. And backslash. Okay. No not like this. It needs to be like this. Yes.
01:01:57 So this is the accurate pathing on RunPod. So like this. Yes. Verify the paths. And like this. Okay
01:02:06 all paths are set. This is single model so there is no high noise. And what else is left? Nothing
01:02:12 else is left. Save your configuration always. So you can load later if any error happens,
01:02:18 if you restart. Okay the models are still being downloaded. We need to wait them to be completed.
01:02:24 If you use RunPod permanent network storage system, you won't be needed to wait. However
01:02:30 currently we are not using. But it is also very slow. I mean unbearably slow. And you
01:02:36 see only 260 megabyte per second disk speed. On Massed Compute this is 2 gigabytes per second.
01:02:43 Okay all models downloaded. It is verifying the last one. So we can start training. So make sure
01:02:50 to save and click start training. Then we will follow what is happening on CMD window. This
01:02:58 is where we see the actual logs. If there is any error, you need to report from here. Unfortunately
01:03:04 setting up RunPod, starting training is like 10x slower than the Massed Compute. But it has
01:03:11 some other advantages like permanent storage or more GPU options. However this is the reality.
01:03:18 You see I click start training and I am still waiting for it to read from the disk to start
01:03:24 caching, loading models. Okay let's also install nvitop. pip install nvitop and just type nvitop.
01:03:34 So we can monitor the GPU usage as well. If you don't see GPU here, that means that your
01:03:41 machine is broken. It is going to start caching the text encoder outputs. It is started. Nice.
01:03:47 Now it will load model. Oh one more thing. We can increase the model loading speed. So I will stop
01:03:54 training. How? Open all panels and search for loading and disable numpy memory mapping. This
01:04:02 will increase your loading speed significantly on RunPod. So start training. Don't worry it
01:04:08 will just resume from wherever it is left. So we didn't lose any time. This is really speeding up
01:04:14 the model loadings on RunPod. Okay now we need to wait again. Okay it is loading the model.
01:04:20 Pretty fast. This optimization is making huge difference on RunPod. This has been implemented
01:04:27 after I have taught this issue to Kohya. But still nothing fast as the Massed Compute.
01:04:34 Okay training starting. You can monitor the GPU from nvitop. Yes started. You see the first step
01:04:41 speed was very slow. Now 1.46, 2 second IT, 2.15 second IT. This is slower than the other GPUs
01:04:50 because we are also doing some block swapping since this is not fp8 scaled. So when we go
01:04:57 to swap, let's see how many we are doing. We are doing 10 block swapping. If you want even faster
01:05:03 training, you can do fp8 scaled training and you can make the block swap zero and it will be
01:05:09 faster. But this is okay speed. It is going to take like 4 5 hours. So it is bearable maybe a
01:05:16 little bit more. But this is how you do training on RunPod. The models will be saved inside my
01:05:23 Trained LoRAs folder. You see it is already generated and the dataset TOML file is saved
01:05:29 there. The training TOML file also saved there when you click training. So you will download
01:05:33 LoRAs from here and use them. Or if you want to use them on RunPod, we have a tutorial for how
01:05:39 to use SwarmUI and ComfyUI on RunPod. It is here. This is a very recent tutorial. Fully up to date.
01:05:45 So you can follow this tutorial to learn how to use SwarmUI and ComfyUI on RunPod and Massed
01:05:52 Compute. So this is it. I hope you have enjoyed. Don't forget to stop your machine, terminate your
01:05:59 machine. On Massed Compute, you need to delete your machine. Stopping your machine will not
01:06:04 stop your credit spending. Don't forget that. But before deleting, make sure to backup all of your
01:06:11 data. On RunPod, you can stop your machine and it will keep your data then you need to terminate
01:06:17 your machine to completely avoid credit usage. But stopping machine will make it very minimal amount
01:06:24 of credit usage. If you have any questions, always ask me. We have Z Image Turbo LoRA training, very
01:06:31 up to date, working amazing, very lightweight. We have Flux SRPO training updated. We have Qwen
01:06:40 training that you need to watch. Hopefully more videos are coming, more applications are coming.
01:06:45 I am working 7 days every week literally like 10 hours a day. So stay subscribed, leave a comment,
01:06:53 ask questions, join our Discord, ask questions, message me from Patreon. Hopefully see you later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wan 2.2 Complete Training Tutorial - Text to Image, Text to Video, Image to Video, Windows & Cloud #352

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Wan 2.2 Complete Training Tutorial - Text to Image, Text to Video, Image to Video, Windows & Cloud #352

Uh oh!

FurkanGozukara Dec 21, 2025 Maintainer

Wan 2.2 Complete Training Tutorial - Text to Image, Text to Video, Image to Video, Windows & Cloud

Video Transcription

Replies: 0 comments

FurkanGozukara
Dec 21, 2025
Maintainer