-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: support HunyuanImage-2.1 #2198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
block swap would be nice. i could use it then. nice work |
LoRA training is not working yet. |
Sample command:
|
@kohya-ss amazing work do you think could it be possible to make block swapping count automatic based on available VRAM somehow? that would make this repo way more stronger |
Sample training command:
Approximate VRAM and main RAM usage for 2048x2048 resolution training (batch size of 1) with
The learning rate may be a little too high. |
…/kohya-ss/sd-scripts into feat-hunyuan-image-2.1-inference
I added a conversion script to use the trained LoRA in ComfyUI. Please use it like this:
|
1、The model download repository in the document is incorrect; it seems they have switched to the latest tencent/HunyuanImage-2.1 |
…s, and flow_shift in argument parser
…ructions and VRAM optimization settings (by Claude)
Thank you, I gave instructions to Claude and fixied it.
The weights seems to be per-tensor scaled, so it may be better to use per-channel dynamic scaling from bf16 weights. |
I've considered ways of doing this for blissful-tuner actually but it's not exactly trivial. You have a couple of options, you can directly calculate the amount of VRAM but there are a lot of possible variables that must be considered to know that, different for each model and possibly even differing by like Pytorch version, OS etc. Or you could maybe use a try except model where you try to load the model with N blocks swapped and if it works great and if not, try more until you find something that works. But that's messy IMO. The truth is with these large models, once you're swapping any amount of blocks the speed diff between say 10 blocks swapped and 20 is not that big(caveat: once you go over about 90% blocks swapped, it tanks speed but that's very rarely necessary). I usually aim to hit about 90% of my VRAM so I have a little extra to avoid OOM. But the point is you don't have to get the number exact and it's okay to err on the safer side to avoid OOMing. |
A small suggestion I might make for HyImage is the option to run the LLM on CPU in FP32 as an alternative to fp8. It takes about 15 seconds per encode on my 4.2 Ghz, 8 core, 16 thread i7 6900k which is a ten year old CPU so it's definitely a feasible option for those who have the sysram! Unless you're encoding like... tons XD. |
…ext encoders on CPU for training
|
A couple of things I notice about the minimal inference, in hunyuan_image_utils you seem to have inherited the old defaults from the repo:
But it was updated, from hunyuan_image_pipeline:
which makes sense, the default of 75 for ocr start step makes no sense when default number of steps is 50 and I also inherited this issue into my workspace until I noticed the change! Also in my personal experimentation, when you switch from CFG to APG makes a large difference in the aesthetic and quality of the output, text, etc. I can't give you perfect numbers I think it might be different depending on what you're doing, so my suggestion would be to allow the user to optionally choose the step to switch. Also, there doesn't seem to be any functional difference between cfg_guider_ocr and cfg_guider_general? I know it's like that in the source repo too but aren't they both just instances of the same APG class with the same params? Are both necessary or could we dispatch with one and simplify? And lastly, when using regular CFG for a significant portion of steps, guidance rescaling definitely seems beneficial to avoid oversaturation/overexposure. And just for reference I'm not referring to the rescale that's part of APG but
And I find a default guidance_rescale of 0.7 seems good, though it's only necessary when using raw CFG for more than 10 steps I'd say, so it should be optional(though it doesn't seem to hurt otherwise tbh)! Anyway these are my thoughts after having a look based on my own experimentation over the last week! Sorry if I missed anything that's already there! Also lastly a small note - I know right now the refiner is not part of this repo and may never be, which is fine as we discussed I think it's optional. But if you do ever mess with it, it's worth noting that the refiner vae DOES seem to support spatial tiling, unlike the base VAE! Just thought I'd share that too, cheers! |
I note that the inference script in this repo runs at about half the speed of my version of the source repo(which is just their repo plus i offload everything and added blockswap, really, plus fixed a few small issues) and engages my GPU less (~170 Watts versus 270) and I wonder if this is because they are using batched_cfg? I'm working on testing that theory but I'm having the issue that our embed and neg embed aren't the same size in dimension 1, guessing unpadded or something so I'm looking into that! |
A new option As a result, the |
Thank you for letting me know! It's annoying that the official script contains errors... apg_start_step and guidance_rescale can now be specified as options😀 Regarding cfg_guider_ocr and cfg_guider_general, it's true that the same thing is currently being used. I think it's possible to make them common, but I'll leave them as they are as I may want to change some parameters in the future. I haven't checked refiner yet, but it's great that VAE supports tiling. It will be worth checking out the VAE implementation. EDIT: |
I don't know why it's efficient, but it may be due to changes in the attention mechanism. Additionally, a "forward only" mode has been added to the block swap feature (which already existed in Musubi Tuner). This may improve the efficiency of block swapping during inference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for HunyuanImage-2.1, including inference and LoRA training functionality. It incorporates fp8 scaling and dynamic merging during LoRA inference from Musubi Tuner, and implements a unified attention method. The implementation allows using official weights for DiT models and ComfyUI weights for other components.
- Implements HunyuanImage-2.1 DiT model with multimodal double-stream and single-stream transformer blocks
- Adds comprehensive LoRA training support with regex-based learning rates and dimensions
- Integrates fp8 optimization for memory efficiency and performance improvements
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
train_network.py | Adds lazy loading, casting control methods, and model initialization improvements |
networks/lora_hunyuan_image.py | Complete LoRA implementation for HunyuanImage with qkv/mlp splitting |
networks/lora_flux.py | Refactors hardcoded dimensions to use class methods for better extensibility |
library/hunyuan_image_*.py | Core model implementation including VAE, text encoders, utilities, and modules |
library/fp8_optimization_utils.py | FP8 quantization system with block-wise and channel-wise modes |
library/attention.py | Unified attention interface supporting multiple backends |
hunyuan_image_train_network.py | Training script with sampling, caching, and optimization features |
Comments suppressed due to low confidence (2)
library/lora_utils.py:1
- Remove commented code that appears to be old implementation. If this is needed for reference, move it to a separate comment block with explanation.
import os
library/hunyuan_image_vae.py:1
- This TODO comment should be updated to explain the specific conditions when float() conversion is needed and provide a plan for removing this workaround.
from typing import Optional, Tuple
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
I thought about this as well, I experimented with using different values for the different guiders myself. It's not a huge burden to keep both anyway so that makes sense to me.
I pulled the new changes but the speed isn't really any different.
I agree, it also had an issue where OCR guidance was negated when using a negative prompt because of calling the same encode_glyph for both which sets a bool self.ocr_mask on the class and since it's called in the order positive, negative... that bool is negated unless there's text in your negative too which why would there be? I almost suspect the text features were a last minute addition in response to Qwen Image, though that's complete speculation on my part. I can almost imagine some poor researcher getting yelled at by the higher ups "Qwen has text! We can't ship without text and we ship in less than a week! Make it happen!" Regardless of the small issues, this model is awesome and I get more excited about it every new generation I make, every new thing I try. I know I'm engaging a lot and typing up big posts but that's just my excitement showing, please don't let it overwhelm you 🙂 Edit: Oh and thank you for implementing my suggestions! |
Update: I was able to implement batched uncond by having GPT help me pad the Qwen outputs to match length and yeah it improved the speed by a LOT(6.8s/it w/CFG versus over 11 before! No CFG is 5.5s/it with my current settings for reference.) while costing very little VRAM relatively(about a gig when making 3072x2304). This is in bf16 with 26 blocks swapped. This is excellent because now I can make this script my base and start working on it rather than my hacked up version of the base repo that I've been using until now! This is the padding code, it's just GPTs though in honesty b/c I wasn't sure how XD:
and then just something like(which I'm sure you know, just sharing mine for reference):
It seems to just make a really big diff here, unlike with other recent models so I definitely think adding it as an option would be good as it's not very heavy either and output is identical. Cheers and thanks for this script that will now become my base of operations! Edit: Small laugh, I see at one point you added regular single quote support too... and then later found out why that might have been left out just like I did 😂 So many of us walk the same paths without realizing. |
It's great that batching cond and uncond improved the speed. It's certainly strange that there's almost no difference in speed, but that's possible if the bottleneck is somewhere else. In this repository, I've kept cond and uncond separate to minimize memory usage. When I looked at the official repository, the single quote had been fixed, but the APG step had not yet been fixed😂 |
Totally fair if you wanna keep it that way to minimize resources and keep it simple, I thought I'd share my results either way for others who may come across. My suspicion is the gain might be significantly less (relatively) when not using block swap. This is a BIG model in full BF16 and so I'm swapping 26 blocks. With no batched uncond, I see around 150Watts usage on my GPU. With batched, I see around 220. This makes sense since we're only doing one round of block swapping with these large blocks instead of two so I think that's where the biggest gain lies tbh. I tested the outputs and they are identical whether uncond is batched or not so the computation is ultimately the same and this is my theory of how it's that much faster 🙂 Edit: Yeah my guess is the 1.3s per iteration increase with batched uncond is probably what my single batch iteration speed would be if I wasn't swapping like 60% of the model. But since we've already paid the ~4.2s cost to swap the blocks once, when batched it only adds that much. 1.3s for cond, 1.3s for uncond, 4.2s for swap = 6.8s per it! Add another 4.2s to swap again when unbatched = 11s which is exactly what I see. Edit2: Yeah that sounds like the same time I first looked at the repo. But I added single quotes back in and then learned why they were removed XD |
Thanks for the detailed explanation. That makes sense! I'm sure the bottleneck is transfer speed, not computation speed. |
We tested LoRA training on existing SDXL, FLUX.1, Chroma, and Lumina, and the loss trends seem to be roughly consistent. |
Inference and LoRA training function for HunyuanImage-2.1.
It also incorporates the fp8 scaling function and dynamic merging during LoRA inference from Musubi Tuner, and unified attention method.
The official weights can be used for DiT weights, and ComfyUI weights for all other weights.
The arguments for
hunyuan_image_minimal_inference.py
are almost the same as the inference script for Qwen-Image in Musubi Tuner.Block swap is not supported yet, so it uses about 24GB VRAM even with--fp8_scaled
.