Replies: 49 comments 32 replies
-
|
In my tests the prompt weighting is definitely useful for video inference too, most notably for overriding LoRA that like to push certain concepts too hard in a given direction. It seems to work well even with weights as heavy as 2.0 or more! I should be pushing it to my branch later today if anyone is interested. |
Beta Was this translation helpful? Give feedback.
-
|
I feel like I tried this in the positive prompts for wan and it worked... I wonder if this is a false positive where "(red:0.7)" is being treated like a completely unique token instead of considering the weights like SD does. Might be something to consider. |
Beta Was this translation helpful? Give feedback.
-
|
If you just put the numbers in there without actual weighting algorithm you will get some effect almost like maybe that was in the training data? But it's not consistent or powerful I tested that(furthermore if you look into the code you can see weighting isn't considered by the tokenizer/encoder, at least not that I could find). I'm directly modifying the attention assigned to the tokens, like in other weighting methods. It's been very useful in my testing today! I've also got a better implementation of CFGZero* which is definitely useful. But see below to see how we modify the attention the token receives. The weighting factors are stripped from what the model sees with this method, so the effect comes from modifying the token weight. Like I said it's adapted from sd_embed, specifically the SD3/Flux implementation. |
Beta Was this translation helpful? Give feedback.
-
|
I didn't quite get it pushed today because I was doing a LOT of cleanup stuff, I also improved my scheduled CFG parser where you can specify per step weights now: --cfg_schedule "e~2:7.0, 1-10:6.0, 47-50:6.5" INFO:main:Guidance scale per step: 1:6.0, 2:6.0, 3:6.0, 4:6.0, 5:6.0, 6:6.0, 7:6.0, 8:6.0, 9:6.0, 10:6.0, 12:7.0, 14:7.0, 16:7.0, 18:7.0, 20:7.0, 22:7.0, 24:7.0, 26:7.0, 28:7.0, 30:7.0, 32:7.0, 34:7.0, 36:7.0, 38:7.0, 40:7.0, 42:7.0, 44:7.0, 46:7.0, 47:6.5, 48:6.5, 49:6.5, 50:6.5 I feel like this can be extra useful to step a higher guidance scale when skipping CFG steps versus not, and potentially other shenanigans as well! |
Beta Was this translation helpful? Give feedback.
-
|
I pushed it to my branch finally! If you want to use weighted prompts you need --prompt_weighting otherwise it takes the original TE path. Prompt weighting is currently only available for Wan T5 encoder. Experiments are still ongoing but definitely note that upweighting one token also effectively downweights the others since there's only so much attention to go around so results can be harder to predict than with simpler models. Edit: I think the implementation might still need a little work, I'll look into it more tomorrow XD There's a lot of extra features in my branch at this point. For both Wan and Hunyuan: fp16 accumulation, CFG scheduling with guidance per step, latent previews via either latent2rgb or taehv, optionally save video as Apple ProRes(perceptually lossless) in MKV with option to keep intermediate frames, load non Musubi LoRA without preconversion. For Hunyuan: there's scaled fp8 for inference and training, rescale CLIP/LLM ratio, provide different prompt to CLIP than LLM, expose more LLM settings. For Wan: CfgZero* with zero init, RifleX, prompt weighting, Comfy ROPE which doesn't use complex numbers and so is faster and more memory efficient when torch.compiled, noise_aug_strength for I2V. Some of the Wan things only work in one shot(i.e. non interactive mode) since I haven't ported them to that section yet. Also I made a port of GIMM-VFI framerate interpolation. All of this was done for my personal use since I really like Musubi I wanted to extend it, I just share it in case it's useful to anyone. It's completely UNOFFICIAL and I hope to eventually make extensions for these features so others can more effectively use them! Please no one bother kohya to merge these things before then, it's not my intention to create that kind of issue I just like to share my work. |
Beta Was this translation helpful? Give feedback.
-
|
I finally gave up on my own implementation and ported Kijai's from WanVideoWrapper. My initial implementation worked it's just there was side effects from tokenizing the prompt in chunks. Anyway I think it works now, fwiw. I also added upscaling with SwinIR or ESRGAN and face restoration with GFPGAN or Codeformer, as well as improving VFI with GIMM-VFI. If anyone is interested in using these things: GIMM-VFI: Upscaler: or Note for GIMM-VFI you specify the folder as the model but for the upscalers you specify their safetensors directly! The face restoration stuff I haven't quite got ready for primetime, but soon. You can specify an ouput filename but if not one is chosen for you. I've optimized for speed without sacrificing quality as much as possible. I use Musubi as my main video generative workflow now, hence why I've created my own versions of these tools. Oh while it should accept most types of video files as input, it currently writes out Apple ProRes into an MKV (perceptually lossless to avoid generational loss) but if anyone would like, I can add support for MP4 or others just let me know! |
Beta Was this translation helpful? Give feedback.
-
|
How much do the previews slow down inference? And which one do you prefer? |
Beta Was this translation helpful? Give feedback.
-
|
I added support for FramePack, including latent previews(preview is every N sections and previews the total generated length), fp16 accumulation, advanced saving options, CFG scheduling, string seeds, etc, as well as enabling torch.compile and fp8 fast which were already there but disabled likely for lack of testing. They seem to work in my tests today! |
Beta Was this translation helpful? Give feedback.
-
|
Today I've added wildcard prompts for all models and modes: --prompt_wildcards /path/to/wildcards/directory to enable wildcards in prompts and negative prompts and even prompt_2. For instance in a prompt will look for wildcards in color.txt in that directory. Wildcard files should have one possible replacement per line, optionally with a relative weight attached and also support nesting, like this: Also supports internal paths for Autumn.txt in the colors sub directory of the wildcards folder. I also added V2V inference for Wan: --video_path /path/to/input/video --v2v_noise noise_strength where noise_strength is a value 0.0-1.0 that controls how strong the noise added to the source video will be. Note supports scaling, padding, and truncation so the input doesn't have to be the same res as the output or even the same length! If --video_length is shorter than the input, the input will be truncated and include only the first --video_length frames. If --video_length is longer than the input, the first frame or last frame will be repeated to pad the length depending on --v2v_pad_mode. Lastly, note that V2V uses the T2V "--task" modes and models, not the I2V ones. And lastly I extended -te_multiplier support to FramePack as well. These changes are currently in my add features branch and should be merged into main shortly when I confirm stability. Cheers! |
Beta Was this translation helpful? Give feedback.
-
|
I've added a better implementation of extra latent noise for Wan than what I had before:
It's usable with Wan V2V/I2V mode. Low values (< 0.04) are best and it can help with fine detail/texture, especially when upscaling(output res > input res). I've also improved V2V and added a "direct" mode( My next addition is likely to be a simple little post processing UI, where you can load a video and adjust the color temp/saturation/brightness/contrast/gamma/white balance etc with a preview. |
Beta Was this translation helpful? Give feedback.
-
|
So I extended the V2V today to work in --task i2v-14b mode too. If no --image_path is provided it will use the first frame of the video as the conditioning image for i2v. Or you can specify your own(say if you've improved the details of the first frame with i2i but using something completely different might create strange results). Why is this beneficial? Well my thought was the I2V models were more used to working with pre-existing stuff so they might work better for a task where we're always starting part way through the diffusion process plus we have the extra conditioning context from clip visual to provide a boost to the model's understanding of the existing content. In my testing so far at least if your goal is just refinement/upscaling and you're starting from mid or better quality, it's definitely better to use the I2V model than the T2V model! Fine details like eyes/clothing definitely turn out better. More tests will be run, I currently am testing to see which upscales from low res better, T2V model or I2V model. Edit: So more tests later, I have some interesting results. When doing V2V: Using the T2V model:
But using the I2V model(with input video first frame as I2V conditioning):
When starting from medium/high resolution using I2V model tends to give me better results, when starting with lower res, the T2V model does. T2V adds more details but also changes more. I2V adds less details but changes less. Depends on what you want, I think having both is nice! |
Beta Was this translation helpful? Give feedback.
-
|
I've been doing some big refactoring lately and I finally got it pushed to add-features, after a bit more testing I'm going to merge it to main and make a tagged release. It turns out 'max-autotune-no-cudagraphs' mode for torch.compile subtly breaks determinism so I reverted the default mode to "default" which is actually faster anyway and doesn't have that issue. I also added support for perpendicular/orthogonal negative prompting to Wan/Hunyuan. This takes a third model pass per step so it increases your time 33% but provides a much more powerful, surgical form of CFG. For instance if you prompt "beautiful nature photography" but use the negative "tree", you will likely still have trees in your output unless you crank guidance scale(and thus hurt quality), because the concept "nature" includes the concept "tree" within itself. It will be less trees than before, but still some. With perp neg, the entire concept of "tree" is subtracted from the output. Anything vaguely tree like will be eliminated with extreme prejudice. For those curious about how we do this: Where cond is your positive conditioning, uncond is your negative, and nocond is with literally a blank prompt(ground truth). You can use it with Edit: Oh I also added saving of various metadata to mkv/png (not mp4 because it's more strict about metadata) by default and improved the metadata saved with latents. It can all still be disabled with |
Beta Was this translation helpful? Give feedback.
-
|
I've created a tagged release: https://github.com/Sarania/blissful-tuner/releases/tag/0.6.6 More testing has gone into this than usual to make sure the common paths through all the main files are solid(I know I break stuff from time to time >_<) so if you wanna use my version with a more stable reference point, that's what this is for! Cheers! |
Beta Was this translation helpful? Give feedback.
-
|
I've added a script for blurring out faces in images/videos. This is useful for training LoRA that don't affect faces when combined with appropriate captioning! Also, I finalized the face restoration script and uploaded those models as well! |
Beta Was this translation helpful? Give feedback.
-
|
I've added a lil metadata viewer at blissful-tuner/blissful_tuner/metaview.py it takes a single bare argument which is an MKV saved with BlissfulTuner with metadata enabled and then displays all the BlissfulTuner related metadata with convenient "Copy" button located next to each datapoint. This makes it easy to inspect what parameters created a given video and copy the prompt or w/e you like to use again! On my Linux system I installed the requirements at the system level and created a .desktop file so I could right click -> Open with Blissful Tuner Metadata Viewer on MKV files! It requires PySide6 and mediainfo cli P.S. Simple .desktop file I used in ~/.local/share/applications after making metaview.py executable: |
Beta Was this translation helpful? Give feedback.
-
|
So I've been working on improving 16GB VRAM, 720x720x81, fp8_scaled, 36 blocks swapped, SageAttn, compile, lazy_loading: Edit: I've made these available for training as well but they are as of yet untested and might cause quality loss of unknown degree so be mindful! Edit: Pushed! |
Beta Was this translation helpful? Give feedback.
-
|
In my testing |
Beta Was this translation helpful? Give feedback.
-
|
Today I went through and extended Rich features (logging, traceback, help) to Flux Kontext and Qwen Image, as well as a couple scripts I'd somehow missed. This should help homogenize the Blissful Tuner experience, even though I haven't added any features to those other than that yet at least they'll match and be pretty! I also started working with training Wan 2.2 last night, it's worth noting if you train one DiT at a time then it treats it pretty much like Wan 2.1 except for the min/max timestep etc, which also means it only does simple modulation so maybe that IS viable for training? We'll see, my high noise training looks like it went well but i can't be certain until I train the low noise tonight and have both to test. I will of course report! BTW someone requested a UI again the other day... I mentioned before I might and even made a cool prototype but the truth is that DOUBLES the maintenance work because every new thing has to not only be added or merged but also wired into the UI and that's just not a workload I'm looking to take on right now. I'm sure it would make Blissful more widely appealing and if someone wants to work on it I'm definitely willing to add it in if it's done well but I'm personally happy working in CLI and I'd rather focus my limited dev energy on interesting new tech and things like that over maintenance work. I hope that's understandable, cheers! |
Beta Was this translation helpful? Give feedback.
-
|
I've implemented Also I removed |
Beta Was this translation helpful? Give feedback.
-
|
YUSS I got NAG working with Wan lets go! That means negative prompt/guidance WITHOUT CFG so it can be used with distilled models etc. I'm currently testing with Wan 2.1 lightx2v LoRA and it's sweet. It will work for distilled Wan 2.2 as well! I still need to clean it up a bit and do I2V as well... but HYPE. The first time I tackled it I got a bit overwhelmed, but since diving in headfirst to optimize the attention blocks for Wan 2.2, when I came back to it it wasn't that bad. I should have it ready to go live within a couple days at most! |
Beta Was this translation helpful? Give feedback.
-
|
NAG is live for Wan 2.1/2.2! Enabled it by specifying Anyway, Copied from the implementation repo: It's not perfect at suppressing concepts all the time when used without CFG but it's definitely useful to have around. Cheers! |
Beta Was this translation helpful? Give feedback.
-
|
I fixed a pretty significant bug with NAG, it should work better now but it consumes more VRAM on than off now(like it should lol, around a gig or so on my system). Also I refactored Wan script, fixed a massive merge oversight in i2v mode, and made perpendicular negative work for i2v as well now. Also I added |
Beta Was this translation helpful? Give feedback.
-
|
@Sarania Thank you for working so much on this! |
Beta Was this translation helpful? Give feedback.
-
|
For the last couple days I've been working heavily in the Flux Kontext region and I've thoroughly implemented Rich (the f did I think I was doing before lmao) implemented CFG plus brought many Blissful features like fp16 accumulation, fp8 fast, compile, prompt wildcards, cfgzerostar init and scale, cfg schedule, optimized config to it. I'm going to implement latent previews today and then hopefully get that pushed. It seems to be able to inference normal Flux as well? @kohya-ss is this intended or just a happy side effect? The only real incompatibility I've noted is LoRA only apply to Unet not TE which is often more optimal anyway but I could easily add a switch to enable it if desired. |
Beta Was this translation helpful? Give feedback.
-
|
I've added support for latent previews(both tae and latent2rgb) for Flux, they are a little slower than you'd expect due to needing to rearrange x to produce a previewable latent every time but it's unfortunately unavoidable. I also allowed loading non bfloat16 Flux models, brought back power_seed to help with determinism and fixed various bugs. Lastly I also added support for loading Dedistilled Flux models but those aren't quite working right for reasons I've been unable to suss. That's all for now, cheers! |
Beta Was this translation helpful? Give feedback.
-
|
Head's up: I've discovered a bug I created in Wan where LoRAs past the first weren't loaded properly even if it said they were. The work I pushed tonight fixes that, my apologies for any inconvenience. Beyond that I integrated further into Flux/Qwen space, supporting conversion of foreign Qwen LoRA and improved the Rich features, also I added DPM++ scheduler to flux as well as |
Beta Was this translation helpful? Give feedback.
-
|
I've enabled batch processing for all media processing scripts (GIMMVFI, upscaler, facefix, yoloblur), if you specify a bare directory as the input path now it will locate all files in said directory with applicable extensions (mkv, mp4 for GIMMVFI, mkv, mp4, png, jpg, jpeg for others) and operate on those, asking once for confirmation after displaying the file list unless --yes is enabled and then it just goes. I also upgraded the model utility some more recently with the ability to fully exclude keys when converting models as well as stripping specific prefixes etc. And I've cleaned up the integration with Flux and Qwen a lot, as well as the implementation of various features under interactive/batch modes should work more reliably now. Cheers! |
Beta Was this translation helpful? Give feedback.
-
|
I've opened a separate Discussions section for Blissful now since it's started to gain some traction of it's own. I will still be hanging around the usual places too, but I figured it was time. I'll probably move my devlog there soon as well. Thanks to everyone for their continued support, your gratitude means more to me than you might imagine 🙏 Furthermore I've integrated kohya's new fp8 optimizations (yay, thank you!) which improve accuracy of I'm also considering absorbing HyImage2.1 from sdscripts and incorporating it, but we'll see how practical this is in practice before we make any promises. I'd like to, anyway. Also on that front I've been experimenting with a merge of Qwen2.5 VLM I made incorporating this model which shows promise, it improves diversity of backgrounds, quality of texture, and quality of NSFW at least for realistic styles which is all I've tested so far. I'm gonna try some further tweaking and if it ends up being useful I'll share that and post about it. Cheers! |
Beta Was this translation helpful? Give feedback.
-
|
I've made a small tweak to Side note that optimized_compile doesn't work well under native Windows according to users so for anyone who decides to experiment, WSL2 or native Linux might be the best way! My apologies for that it just seems that compile in general is a lot trickier on Windows and I don't have a Windows environment myself nor the extra time to do separate testing/maintenance for separate concerns thereof unfortunately. WSL2 is quite good and not hard to set up though so maybe it's not too bad! |
Beta Was this translation helpful? Give feedback.
-
|
Random thought bumping around in my head - at this point I've realized that my naming of |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
https://github.com/Sarania/blissful-tuner/
Blissful extension of Musubi Tuner by Blyss Sarania
Here you will find an extended version of Musubi Tuner with advanced and experimental features focused on creating a full suite of tools for working with generative video models. Preview videos as they generate, increase inference speed, make longer videos and gain more control over your creations and enhance them with VFI, upscaling and more! If you wanna get even more out of Musubi then you've come to the right place! Note for best performance and compatibility, Python 3.12 with PyTorch 2.7.0 or later is recommended! While development is done in Python 3.12, efforts are made to maintain compatibility back to 3.10 as well.
Super epic thanks to kohya-ss for his tireless work on Musubi Tuner, kijai for HunyuanVideoWrapper and WanVideoWrapper from which significant code is ported, and all other devs in the open source generative AI community! Please note that due to the experimental nature of many changes, some things might not work as well as the unmodified Musubi! If you find any issues please let me know and I'll do my best to fix them. Please do not post about issues with this version on the main Musubi Github repo but rather use this repo's issues section!
In order to keep this section maintainable as the project grows, each feature will be listed once along with a legend indicating which models in the project currently havee support that feature. Most features pertain to inference, if a feature is available for training that will be specifically noted. Many smaller optimizations and features too numerous to list have been done as well. For the latest updates, I maintain something of a devlog here
Legend of current models: Hunyuan Video: (HY), Wan 2.1/2.2: (WV), Framepack: (FP), Flux (FX), Qwen Image (QI), Available for training: (T)
Blissful Features:
--prompt_wildcards /path/to/wildcard/directory, for instance__color__in your prompt would look for color.txt in that directory. The wildcard file format is one potential replacement string per line, with an optional relative weight attached like red:2.0 or "some longer string:0.5" - wildcards can also contain wildcards themselves, the recursion limit is 50 steps!) (HY) (WV) (FP) (FX) (QI)--preview_latent_every Nwhere N is a number of steps(or sections for framepack). By default uses latent2rgb, TAE can be enabled with--preview_vae /path/to/modelmodels: https://huggingface.co/Blyss/BlissfulModels/tree/main/taehv) (HY) (WV) (FP) (FX)--optimized, enables various optimizations and settings based on the model. Requires SageAttention, Triton, PyTorch 2.7.0 or higher) (HY) (WV) (FP) (FX)--fp16_accumulation, works best with Wan FP16 models(but works with Hunyaun bf16 too!) and requires PyTorch 2.7.0 or higher but significantly accelerates inference speeds, especially with--compileit's almost as fast as fp8_fast/mmscaled without the loss of precision! And it works with fp8 scaled mode too!) (HY) (WV) (FP) (FX)--codec codec --container container, can save Apple ProRes(--codec prores, super high bitrate perceptually lossless) into--container mkv, or either ofh264,h265intomp4ormkv) (HY) (WV) (FP)--container mkvand when saving PNG, disable with--no-metadata, not available with--container mp4You can conveniently view/copy such metadata withsrc/blissful_tuner/metaview.py some_video.mkv, the viewer requires mediainfo_cli) (HY) (WV) (FP)--cfgzerostar_scaling --cfgzerostar_init_steps Nwhere N is the total number of steps to 0 out at the start. 2 is good for T2V, 1 for I2V but it's better for T2V in my experience. Support for Hunyuan is HIGHLY experimental and only available with CFG enabled.) (HY) (WV) (FX)--cfg_schedule, please see the--helpfor usage. Can specify guidance scale down to individual steps if you like!) (HY) (WV) (FX)--riflex_index Nwhere N is the RifleX frequency. 6 is good for Wan, can usually go to ~115 frames instead of just 81, requires--rope_func comfywith Wan; 4 is good for Hunyuan and you can make at least double length!) (HY) (WV)--perp_neg neg_strength, where neg_strength is a float that controls the string of the negative prompt. See--helpfor more!) (HY) (WV)--nag_scale 3.0and provide a negative prompt!) (WV)--sample_solver lcmor--sample_solver dpm++sdewith distilled Wan models/LoRA like https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill Plus I made a convenient LoRA: https://huggingface.co/Blyss/BlissfulModels/tree/main/wan_lcm ) (WV)--video_path /path/to/input/video --denoise_strength amountwhere amount is a float 0.0 - 1.0 that controls how strong the noise added to the source video will be. If--noise_mode traditionalthen it will run the last (amount * 100) percent of the timestep schedule like other implementations. If--noise_mode directit will directly control the amount of noise added as closely as possible by starting from wherever in the timestep schedule is closest to that value and proceeding from there. Supports scaling, padding, and truncation so the input doesn't have to be the same res as the output or even the same length! If--video_lengthis shorter than the input, the input will be truncated and include only the first--video_lengthframes. If--video_lengthis longer than the input, the first frame or last frame will be repeated to pad the length depending on--v2v_pad_mode. You can use either T2V or I2V--taskmodes and models(i2v mode produces better quality in my opinion)! In I2V mode, if--image_pathis not specified, the first frame of the video will be used to condition the model instead.--infer_stepsshould be the same amount it would for a full denoise e.g. by default 50 for T2V or 40 for I2V because we need to modify from a full schedule. Actual steps will depend on--noise_mode) (WV)--i2i_path /path/to/image- use with T2V model in T2I mode, specify strength with--denoise_strength. Supports--i2_extra_noisefor latent noise augmentation as well) (WV)--prompt_weightingand then in your prompt you can do like "a cat playing with a (large:1.4) red ball" to upweight the effect of "large". Note that [this] or (this) isn't supported, only (this:1.0) (WV)--compilefor inference or--optimized_compilefor training! (--rope_func comfy) (WV) (T)--v2_extra_noise 0.02 --i2_extra_noise 0.02, values less than 0.04 are recommended. This can improve fine detail and texture in but too much will cause artifacts and moving shadows. I use around 0.01-0.02 for V2V and 0.02-0.04 for I2V) (WV)--mixed_precision_transformerfor inference or training, see Blissful Tuner (unofficial, experimental extensions to Musubi Tuner) #232 (comment) for how to create such a transformer and why you might wanna) (WV) (T)--hidden_state_skip_layer N --apply_final_norm, please see the--helpfor explanations!) (HY)--fp8_scaled, HIGHLY recommend both for inference and training. It's just better fp8 that's all you need to know!) (HY) (T)--prompt_2 "second prompt goes here", provides a different prompt to CLIP since it's used to simpler text) (HY)--te_multiplier llm clipsuch as--te_multiplier 0.9 1.2to downweight the LLM slightly and upweight the CLIP slightly) (HY)Non model specific extras:
(Please make sure to install the project into your venv with
--group postprocess(e.g.pip install -e . --group postprocess --group devto fully install all requirements) if you want to use the below scripts!)src/blissful_tuner/GIMMVFI.py, please see it's--helpfor usage. Models: https://huggingface.co/Blyss/BlissfulModels/tree/main/VFI )src/blissful_tuner/upscaler.py, please see it's--helpfor usage. Models: https://huggingface.co/Blyss/BlissfulModels/tree/main/upscaling )blissful_tuner/yolo_blur.py, please see it's--helpfor usage. Recommended model: https://huggingface.co/Blyss/BlissfulModels/tree/main/yolo )src/blissful_tuner/facefix.py, per usual please have a look at the--help! Models: https://huggingface.co/Blyss/BlissfulModels/tree/main/face_restoration )Also a related project of mine ( https://github.com/Sarania/Envious ) is useful for managing Nvidia GPUs from the terminal on Linux. It requires nvidia-ml-py and supports realtime monitoring, over/underclocking, power limit adjustment, fan control, profiles, and more. It also has a little process monitor for the GPU VRAM! Basically it's like nvidia-smi except not bad 😂
Beta Was this translation helpful? Give feedback.
All reactions