-
-
Notifications
You must be signed in to change notification settings - Fork 165
Description
First of all please don't feel any pressure from this post, I know there are new models and tech every day and it's hard to keep up, but I've been playing with HunyuanImage 2.1 recently and I personally think it's better than Qwen. Qwen has kind of an artificial dreaminess to it's realistic style outputs, HyImage doesn't suffer from that. Also like HyVideo, HyImage is uncensored, though I wouldn't say it has seen a ton of "below the waist" anatomy out of the box, it doesn't seem artificially borked or censored either so it should be quite trainable, and the upper anatomy is decent from the get go. It also has really lovely text capabilities thanks to optionally using a BYT5 token free text encoder that operates directly on characters in addition to an MLLM text encoder. I was making posters and illustrations with it yesterday all day and didn't have a single text error! Also it generates natively at 2048x2048 or 2k resolution, though the internal token length is the same as 1k models so I wouldn't say the raw detail is that much better.
Downsides include their repo code is using a customized pipeline structure that wouldn't be easy to adapt(at least it wasn't easy for me to work with) - case in point, when I initially tried loading their checkpoint using load_safetensors_with_fp8 function and initializing the model myself outside of their "Lazy instantiate" function, it created all kinds of mismatched keys because their checkpoint as saved as combined QKV blocks, but as inferenced it wanted split QKV blocks. So in order to fast load it with our traditional functions, I had to initialize the model their way once, save it out in the correct format, and then use that from then on. Eventually I got it to the point of inferencing with block swap and optionally fp8 so I could play with it, but it was challenging compared to other models I've worked with and my implementation is currently far from clean or integrated, more hacked together just to get it working. I don't think it's within my capability to fully implement it for inference and training or I would attempt to do so myself.
Lastly let me note they released several checkpoints: A base 17B model, a distilled 17B model, and a distilled 14B refiner with a different, heavier VAE. I've only really played with the base, not distilled model and a little bit the refiner. The refiner can be helpful cleaning up small issues but also tends to smooth things a bit so I'd consider it optional. Anyway at the very least I wanted to make people aware of what I think is a cool but quirky model not a lot of people have noticed. If it can be added at some point that's awesome, if not, maybe when I get a little more familiar with the training side of things I can attempt to do so myself. Cheers!
Edit: Also I don't know if you were interested if it would be preferred to put it in Musubi or sdscripts, that would be up to you of course. There seems to be a small code overlap with HyVideo but not a great deal.