Blissful Tuner (unofficial, experimental extensions to Musubi Tuner) #232

Sarania · 2025-04-20T20:15:56Z

Sarania
Apr 20, 2025
Sponsor

https://github.com/Sarania/blissful-tuner/
Blissful extension of Musubi Tuner by Blyss Sarania

Here you will find an extended version of Musubi Tuner with advanced and experimental features focused on creating a full suite of tools for working with generative video models. Preview videos as they generate, increase inference speed, make longer videos and gain more control over your creations and enhance them with VFI, upscaling and more! If you wanna get even more out of Musubi then you've come to the right place! Note for best performance and compatibility, Python 3.12 with PyTorch 2.7.0 or later is recommended! While development is done in Python 3.12, efforts are made to maintain compatibility back to 3.10 as well.

Super epic thanks to kohya-ss for his tireless work on Musubi Tuner, kijai for HunyuanVideoWrapper and WanVideoWrapper from which significant code is ported, and all other devs in the open source generative AI community! Please note that due to the experimental nature of many changes, some things might not work as well as the unmodified Musubi! If you find any issues please let me know and I'll do my best to fix them. Please do not post about issues with this version on the main Musubi Github repo but rather use this repo's issues section!

In order to keep this section maintainable as the project grows, each feature will be listed once along with a legend indicating which models in the project currently havee support that feature. Most features pertain to inference, if a feature is available for training that will be specifically noted. Many smaller optimizations and features too numerous to list have been done as well. For the latest updates, I maintain something of a devlog here

Legend of current models: Hunyuan Video: (HY), Wan 2.1/2.2: (WV), Framepack: (FP), Flux (FX), Qwen Image (QI), Available for training: (T)

Blissful Features:

Beautiful rich logging, rich argparse and rich tracebacks (HY) (WV) (FP) (FX) (QI) (T)
Use wildcards in your prompts for more variation! (--prompt_wildcards /path/to/wildcard/directory, for instance __color__ in your prompt would look for color.txt in that directory. The wildcard file format is one potential replacement string per line, with an optional relative weight attached like red:2.0 or "some longer string:0.5" - wildcards can also contain wildcards themselves, the recursion limit is 50 steps!) (HY) (WV) (FP) (FX) (QI)
Use strings as your seed because why not! Also easier to remember! (HY) (WV) (FP) (FX) (QI)
Load foreign LoRAs for inference without converting first (HY) (WV) (FP) (FX) (QI)
Latent preview during generation with either latent2RGB or TAEHV (--preview_latent_every N where N is a number of steps(or sections for framepack). By default uses latent2rgb, TAE can be enabled with --preview_vae /path/to/model models: https://huggingface.co/Blyss/BlissfulModels/tree/main/taehv) (HY) (WV) (FP) (FX)
Optimized generation settings for fast, high quality gens (--optimized, enables various optimizations and settings based on the model. Requires SageAttention, Triton, PyTorch 2.7.0 or higher) (HY) (WV) (FP) (FX)
FP16 accumulation (--fp16_accumulation, works best with Wan FP16 models(but works with Hunyaun bf16 too!) and requires PyTorch 2.7.0 or higher but significantly accelerates inference speeds, especially with --compile it's almost as fast as fp8_fast/mmscaled without the loss of precision! And it works with fp8 scaled mode too!) (HY) (WV) (FP) (FX)
Extended saving options (--codec codec --container container, can save Apple ProRes(--codec prores, super high bitrate perceptually lossless) into --container mkv, or either of h264, h265 into mp4 or mkv) (HY) (WV) (FP)
Save generation metadata in videos/images (automatic with --container mkv and when saving PNG, disable with --no-metadata, not available with --container mp4 You can conveniently view/copy such metadata with src/blissful_tuner/metaview.py some_video.mkv, the viewer requires mediainfo_cli) (HY) (WV) (FP)
CFGZero* e.g. https://github.com/WeichenFan/CFG-Zero-star (--cfgzerostar_scaling --cfgzerostar_init_steps N where N is the total number of steps to 0 out at the start. 2 is good for T2V, 1 for I2V but it's better for T2V in my experience. Support for Hunyuan is HIGHLY experimental and only available with CFG enabled.) (HY) (WV) (FX)
Advanced CFG scheduling: (--cfg_schedule, please see the --help for usage. Can specify guidance scale down to individual steps if you like!) (HY) (WV) (FX)
RifleX e.g. https://github.com/thu-ml/RIFLEx for longer vids (--riflex_index N where N is the RifleX frequency. 6 is good for Wan, can usually go to ~115 frames instead of just 81, requires --rope_func comfy with Wan; 4 is good for Hunyuan and you can make at least double length!) (HY) (WV)
Perpendicular Negative Guidance (--perp_neg neg_strength, where neg_strength is a float that controls the string of the negative prompt. See --help for more!) (HY) (WV)
Normalized Attention Guidance (NAG) (https://arxiv.org/pdf/2505.21179) (Provides negative guidance within cross attention layers. Works for distilled models as well as with regular CFG! Enable with --nag_scale 3.0 and provide a negative prompt!) (WV)
Distilled sampling with high quality and low steps (Use --sample_solver lcm or --sample_solver dpm++sde with distilled Wan models/LoRA like https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill Plus I made a convenient LoRA: https://huggingface.co/Blyss/BlissfulModels/tree/main/wan_lcm ) (WV)
V2V inferencing (--video_path /path/to/input/video --denoise_strength amount where amount is a float 0.0 - 1.0 that controls how strong the noise added to the source video will be. If --noise_mode traditional then it will run the last (amount * 100) percent of the timestep schedule like other implementations. If --noise_mode direct it will directly control the amount of noise added as closely as possible by starting from wherever in the timestep schedule is closest to that value and proceeding from there. Supports scaling, padding, and truncation so the input doesn't have to be the same res as the output or even the same length! If --video_length is shorter than the input, the input will be truncated and include only the first --video_length frames. If --video_length is longer than the input, the first frame or last frame will be repeated to pad the length depending on --v2v_pad_mode. You can use either T2V or I2V --task modes and models(i2v mode produces better quality in my opinion)! In I2V mode, if --image_path is not specified, the first frame of the video will be used to condition the model instead. --infer_steps should be the same amount it would for a full denoise e.g. by default 50 for T2V or 40 for I2V because we need to modify from a full schedule. Actual steps will depend on --noise_mode) (WV)
I2I inferencing (--i2i_path /path/to/image - use with T2V model in T2I mode, specify strength with --denoise_strength. Supports --i2_extra_noise for latent noise augmentation as well) (WV)
Prompt weighting (--prompt_weighting and then in your prompt you can do like "a cat playing with a (large:1.4) red ball" to upweight the effect of "large". Note that [this] or (this) isn't supported, only (this:1.0) (WV)
ROPE ported from ComfyUI that doesn't use complex numbers. Massive VRAM savings when used with --compile for inference or --optimized_compile for training! (--rope_func comfy) (WV) (T)
Optional extra latent noise for I2V/V2V/I2I (--v2_extra_noise 0.02 --i2_extra_noise 0.02, values less than 0.04 are recommended. This can improve fine detail and texture in but too much will cause artifacts and moving shadows. I use around 0.01-0.02 for V2V and 0.02-0.04 for I2V) (WV)
Load mixed precision transformers (--mixed_precision_transformer for inference or training, see Blissful Tuner (unofficial, experimental extensions to Musubi Tuner) #232 (comment) for how to create such a transformer and why you might wanna) (WV) (T)
Several more LLM options (--hidden_state_skip_layer N --apply_final_norm, please see the --help for explanations!) (HY)
FP8 scaled support using the same algo as Wan (--fp8_scaled, HIGHLY recommend both for inference and training. It's just better fp8 that's all you need to know!) (HY) (T)
Separate prompt for CLIP (--prompt_2 "second prompt goes here", provides a different prompt to CLIP since it's used to simpler text) (HY)
Rescale text encoders based on https://github.com/zer0int/ComfyUI-HunyuanVideo-Nyan (--te_multiplier llm clip such as --te_multiplier 0.9 1.2 to downweight the LLM slightly and upweight the CLIP slightly) (HY)

Non model specific extras:
(Please make sure to install the project into your venv with --group postprocess (e.g.pip install -e . --group postprocess --group dev to fully install all requirements) if you want to use the below scripts!)

GIMM-VFI framerate interpolation (src/blissful_tuner/GIMMVFI.py, please see it's --help for usage. Models: https://huggingface.co/Blyss/BlissfulModels/tree/main/VFI )
Upscaling with SwinIR or ESRGAN type models (src/blissful_tuner/upscaler.py, please see it's --help for usage. Models: https://huggingface.co/Blyss/BlissfulModels/tree/main/upscaling )
Face blurring script based on Yolo - helpful for training non face altering LoRA! ( blissful_tuner/yolo_blur.py, please see it's --help for usage. Recommended model: https://huggingface.co/Blyss/BlissfulModels/tree/main/yolo )
Face restoration with CodeFormer/GFPGAN (src/blissful_tuner/facefix.py, per usual please have a look at the --help! Models: https://huggingface.co/Blyss/BlissfulModels/tree/main/face_restoration )

Also a related project of mine ( https://github.com/Sarania/Envious ) is useful for managing Nvidia GPUs from the terminal on Linux. It requires nvidia-ml-py and supports realtime monitoring, over/underclocking, power limit adjustment, fan control, profiles, and more. It also has a little process monitor for the GPU VRAM! Basically it's like nvidia-smi except not bad 😂

Sarania · 2025-04-21T15:43:23Z

Sarania
Apr 21, 2025
Author Sponsor

In my tests the prompt weighting is definitely useful for video inference too, most notably for overriding LoRA that like to push certain concepts too hard in a given direction. It seems to work well even with weights as heavy as 2.0 or more! I should be pushing it to my branch later today if anyone is interested.

0 replies

GarbageHaus · 2025-04-21T19:41:41Z

GarbageHaus
Apr 21, 2025

I feel like I tried this in the positive prompts for wan and it worked...

I wonder if this is a false positive where "(red:0.7)" is being treated like a completely unique token instead of considering the weights like SD does. Might be something to consider.

1 reply

Sarania Apr 21, 2025
Author Sponsor

I should really learn how to reply IN the box -.- But my reply below was meant to go here lol.

Sarania · 2025-04-21T20:07:12Z

Sarania
Apr 21, 2025
Author Sponsor

If you just put the numbers in there without actual weighting algorithm you will get some effect almost like maybe that was in the training data? But it's not consistent or powerful I tested that(furthermore if you look into the code you can see weighting isn't considered by the tokenizer/encoder, at least not that I could find). I'm directly modifying the attention assigned to the tokens, like in other weighting methods. It's been very useful in my testing today! I've also got a better implementation of CFGZero* which is definitely useful. But see below to see how we modify the attention the token receives. The weighting factors are stripped from what the model sees with this method, so the effect comes from modifying the token weight. Like I said it's adapted from sd_embed, specifically the SD3/Flux implementation.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Apr 20 12:51:05 2025
Prompt weighting for WanVideo
Adapted and heavily modified from https://github.com/xhinker/sd_embed
License: Apache 2.0
@author: blyss
"""
from transformers import T5EncoderModel, T5Tokenizer
import torch
import re
from typing import Tuple, List


def get_prompts_tokens_with_weights_t5(t5_tokenizer: T5Tokenizer, prompt: str) -> Tuple[List[int], List[float]]:
    """
    Returns two lists:
      • ids_flat  = [int, int, …]    one raw token‑ID per real wordpiece
      • weights_flat    = [float, …]       matching weight per token
    """
    if not prompt:
        prompt = "empty"

    texts_and_weights = parse_prompt_attention(prompt)
    ids_flat, weights_flat = [], []

    for chunk, weight in texts_and_weights:
        # this returns tensor of (1, N)
        ids_tensor = t5_tokenizer(chunk, add_special_tokens=True, padding=False)
        # squeeze off the batch dim → shape (N,)
        token_ids = ids_tensor.squeeze(0).tolist()

        ids_flat.extend(token_ids)
        weights_flat.extend([weight] * len(token_ids))
    return ids_flat, weights_flat


def get_weighted_prompt_embeds_t5(prompt: str, t5: T5EncoderModel, device: torch.device, max_len: int = None) -> List[torch.Tensor]:
    # 1) build flat lists of ids & weights
    ids_flat, weights_flat = get_prompts_tokens_with_weights_t5(t5.tokenizer, prompt)

    # 2) optionally truncate
    if max_len is not None:
        ids_flat = ids_flat[:max_len]
        weights_flat = weights_flat[:max_len]

    # 3) wrap into a single batch‑dim Tensor
    ids = torch.tensor([ids_flat], dtype=torch.long, device=device)  # (1, seq_len)
    mask = (ids != 0).long()  # (1, seq_len)

    # 4) encode and drop batch dim
    hidden = t5.model(ids, mask)        # returns (1, seq_len, hidden_dim)
    hidden = hidden.squeeze(0)          # now (seq_len, hidden_dim)

    # 5) apply per‑token weights
    weight = torch.tensor(weights_flat, device=device)                      # (seq_len,)
    hidden = hidden * weight[:, None]  # broadcast → (seq_len, hidden_dim)

    return [hidden]


re_attention = re.compile(r"""
\\\(|
\\\)|
\\\[|
\\]|
\\\\|
\\|
\(|
\[|
:([+-]?[.\d]+)\)|
\)|
]|
[^\\()\[\]:]+|
:
""", re.X)

re_break = re.compile(r"\s*\bBREAK\b\s*", re.S)


def parse_prompt_attention(text: str) -> List[Tuple[str, float]]:
    """
    Parses a string with attention tokens and returns a list of (text, weight) pairs.
    """
    res: List[Tuple[str, float]] = []
    round_brackets: List[int] = []
    square_brackets: List[int] = []

    round_bracket_multiplier = 1.1
    square_bracket_multiplier = 1 / 1.1

    def multiply_range(start: int, multiplier: float):
        for p in range(start, len(res)):
            chunk, w = res[p]
            res[p] = (chunk, w * multiplier)

    for m in re_attention.finditer(text):
        tok = m.group(0)
        weight = m.group(1)

        if tok.startswith('\\'):
            res.append((tok[1:], 1.0))
        elif tok == '(':
            round_brackets.append(len(res))
        elif tok == '[':
            square_brackets.append(len(res))
        elif weight is not None and round_brackets:
            multiply_range(round_brackets.pop(), float(weight))
        elif tok == ')' and round_brackets:
            multiply_range(round_brackets.pop(), round_bracket_multiplier)
        elif tok == ']' and square_brackets:
            multiply_range(square_brackets.pop(), square_bracket_multiplier)
        else:
            parts = re.split(re_break, tok)
            for i, part in enumerate(parts):
                if i > 0:
                    res.append(("BREAK", -1.0))
                res.append((part, 1.0))

    # close any unclosed brackets
    for pos in round_brackets:
        multiply_range(pos, round_bracket_multiplier)
    for pos in square_brackets:
        multiply_range(pos, square_bracket_multiplier)

    if not res:
        res = [("", 1.0)]

    # merge consecutive runs with identical weights
    i = 0
    while i + 1 < len(res):
        text_i, w_i = res[i]
        text_j, w_j = res[i + 1]
        if w_i == w_j:
            # merge
            res[i] = (text_i + text_j, w_i)
            res.pop(i + 1)
        else:
            i += 1

    return res

0 replies

Sarania · 2025-04-22T00:34:29Z

Sarania
Apr 22, 2025
Author Sponsor

I didn't quite get it pushed today because I was doing a LOT of cleanup stuff, I also improved my scheduled CFG parser where you can specify per step weights now:

--cfg_schedule "e~2:7.0, 1-10:6.0, 47-50:6.5"

INFO:main:Guidance scale per step: 1:6.0, 2:6.0, 3:6.0, 4:6.0, 5:6.0, 6:6.0, 7:6.0, 8:6.0, 9:6.0, 10:6.0, 12:7.0, 14:7.0, 16:7.0, 18:7.0, 20:7.0, 22:7.0, 24:7.0, 26:7.0, 28:7.0, 30:7.0, 32:7.0, 34:7.0, 36:7.0, 38:7.0, 40:7.0, 42:7.0, 44:7.0, 46:7.0, 47:6.5, 48:6.5, 49:6.5, 50:6.5

I feel like this can be extra useful to step a higher guidance scale when skipping CFG steps versus not, and potentially other shenanigans as well!

0 replies

Sarania · 2025-04-22T20:03:40Z

Sarania
Apr 22, 2025
Author Sponsor

I pushed it to my branch finally! If you want to use weighted prompts you need --prompt_weighting otherwise it takes the original TE path. Prompt weighting is currently only available for Wan T5 encoder. Experiments are still ongoing but definitely note that upweighting one token also effectively downweights the others since there's only so much attention to go around so results can be harder to predict than with simpler models. Edit: I think the implementation might still need a little work, I'll look into it more tomorrow XD

There's a lot of extra features in my branch at this point. For both Wan and Hunyuan: fp16 accumulation, CFG scheduling with guidance per step, latent previews via either latent2rgb or taehv, optionally save video as Apple ProRes(perceptually lossless) in MKV with option to keep intermediate frames, load non Musubi LoRA without preconversion. For Hunyuan: there's scaled fp8 for inference and training, rescale CLIP/LLM ratio, provide different prompt to CLIP than LLM, expose more LLM settings. For Wan: CfgZero* with zero init, RifleX, prompt weighting, Comfy ROPE which doesn't use complex numbers and so is faster and more memory efficient when torch.compiled, noise_aug_strength for I2V. Some of the Wan things only work in one shot(i.e. non interactive mode) since I haven't ported them to that section yet. Also I made a port of GIMM-VFI framerate interpolation.

All of this was done for my personal use since I really like Musubi I wanted to extend it, I just share it in case it's useful to anyone. It's completely UNOFFICIAL and I hope to eventually make extensions for these features so others can more effectively use them! Please no one bother kohya to merge these things before then, it's not my intention to create that kind of issue I just like to share my work.

https://github.com/Sarania/musubi-tuner

0 replies

Sarania · 2025-04-25T18:09:24Z

Sarania
Apr 25, 2025
Author Sponsor

I finally gave up on my own implementation and ported Kijai's from WanVideoWrapper. My initial implementation worked it's just there was side effects from tokenizing the prompt in chunks. Anyway I think it works now, fwiw. I also added upscaling with SwinIR or ESRGAN and face restoration with GFPGAN or Codeformer, as well as improving VFI with GIMM-VFI. If anyone is interested in using these things:

GIMM-VFI:
models: https://www.dropbox.com/scl/fi/tcq68jxr52o2gi47eup37/gimm-vfi.7z?rlkey=skvzwxi9lv9455py5wrxv6r5j&st=ho4e3hz1&dl=0

python ./blissful_tuner/GIMMVFI.py --input videofile --model /home/blyss/projects/art/generative_models/gimm-vfi/ --dtype fp16 --factor 2

Upscaler:
models(SwinIR and NMKD-Siax_200k, you can use other ESRGAN type models as well): https://www.dropbox.com/scl/fi/wh5hw55o8rofg5mal9uek/upscale.7z?rlkey=oom3osa1zo0pf55092xcfnjp1&st=5tq9ln16&dl=0

python ./blissful_tuner/upscaler.py --input videofile --model "/home/blyss/projects/art/generative_models/ESRGAN/4x_NMKD-Siax_200k.safetensors" --scale 2 --dtype fp16 --mode esrgan

or

python ./blissful_tuner/upscaler.py --input videofile --model "/home/blyss/projects/art/generative_models/SwinIR/SwinIR.safetensors" --scale 2 --dtype fp16 --mode swinir

Note for GIMM-VFI you specify the folder as the model but for the upscalers you specify their safetensors directly!

The face restoration stuff I haven't quite got ready for primetime, but soon. You can specify an ouput filename but if not one is chosen for you. I've optimized for speed without sacrificing quality as much as possible. I use Musubi as my main video generative workflow now, hence why I've created my own versions of these tools. Oh while it should accept most types of video files as input, it currently writes out Apple ProRes into an MKV (perceptually lossless to avoid generational loss) but if anyone would like, I can add support for MP4 or others just let me know!

0 replies

maybleMyers · 2025-04-28T21:28:29Z

maybleMyers
Apr 28, 2025

How much do the previews slow down inference? And which one do you prefer?

3 replies

Sarania Apr 28, 2025
Author Sponsor

I have a 4070 Ti Super 16GB, i7 6900k at 4.2GHZ, and 64GB RAM for reference. Latent2rgb previews(what you get if you don't specify --preview_vae) are basically free they might cost a second at most, but they are limited to 1/8 res and 1/4 framerate. Still you can get a very good idea of how your gen is doing with that and the previewer does upscale them so you they aren't tiny. TAE previews are full res full framerate, but they cost maybe 3-5 seconds per time and also you wanna be careful not to OOM though honestly I've found Pytorch just kinda manages the memory and it works but if you're really close to the edge it might be a concern(seems to take like ~2GB to decode an average preview and the preview vae is offloaded when not in use).

Personally I tend to use TAE previews on every second step. One small note is the preview is written over in place and some media players might not like that if they are playing it at the time. I use VLC which handles it gracefully and actually reloads the updated version but it's just something to beware of. I'm actually planning to make the option to pop them in a little window soon as such!

Edit: Oh fwiw latent2RGB previews are what you get in Comfy unless you install TAE for whatever model you are working with. There are tinyautoencoders for many popular diffusion models! https://github.com/madebyollin?tab=repositories

maybleMyers Apr 29, 2025

It would be nice if we could incorporate Latent2rgb previews into the main branch. I guess hv_generate_video, wan_generate_video and fpack_generate_video would all need mods. I am right on the edge of oom on alot of my gens so I dont think the TAE previews would be good, at least for me.

Sarania Apr 29, 2025
Author Sponsor

I mean I am too but as I said it just kinda works though I do have these in my startup script:

export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync,expandable_segments:True

Which definitely seems to help lessen oom/improve memory management. There's no way to change those once torch is imported so I haven't added them to my fork but I usually run with about 15.5GB/16GB VRAM used, when a TAE preview happens I see that briefly drop to like 12GB and then continue at 15.5 when the step resumes so it's managing it on the backend somehow which is nice. I've only seen it OOM when I'm at like 16,300/16384 used. But yeah latent2rgb happens on the cpu because it's a single linear transformation without decode, so it's fast and nearly free.

Yes a lot of my extensions require small updates to the main scripts but we've got a mechanism in the works:
#213
to make these modifications more easy and shareable. I don't wanna pester kohya about it, he's aware of several of the mods I've made like latent previews(that's what caused him to create the extension draft) but he's managing multiple large projects and only has so much time and he also just wants to work on what's exciting to him so I'm content to be patient. That's why for now I just share my fork unofficially! Also some of my changes at this point like rich logging are too extensive to be done that way, so I will likely continue to maintain my own fork for anyone who wants to use it as well as myself. I keep it fully up to date with the main Musubi, usually merging any changes within a day!

Sarania · 2025-04-29T22:08:04Z

Sarania
Apr 29, 2025
Author Sponsor

I added support for FramePack, including latent previews(preview is every N sections and previews the total generated length), fp16 accumulation, advanced saving options, CFG scheduling, string seeds, etc, as well as enabling torch.compile and fp8 fast which were already there but disabled likely for lack of testing. They seem to work in my tests today!

0 replies

Sarania · 2025-05-01T19:40:22Z

Sarania
May 1, 2025
Author Sponsor

Today I've added wildcard prompts for all models and modes:

--prompt_wildcards /path/to/wildcards/directory to enable wildcards in prompts and negative prompts and even prompt_2. For instance

__color__

in a prompt will look for wildcards in color.txt in that directory. Wildcard files should have one possible replacement per line, optionally with a relative weight attached and also support nesting, like this:

red:2.0
bright purple:0.5
 yellow
 __girlycolors__

Also supports internal paths

__colors/Autumn__

for Autumn.txt in the colors sub directory of the wildcards folder.

I also added V2V inference for Wan:

--video_path /path/to/input/video --v2v_noise noise_strength where noise_strength is a value 0.0-1.0 that controls how strong the noise added to the source video will be. Note supports scaling, padding, and truncation so the input doesn't have to be the same res as the output or even the same length! If --video_length is shorter than the input, the input will be truncated and include only the first --video_length frames. If --video_length is longer than the input, the first frame or last frame will be repeated to pad the length depending on --v2v_pad_mode. Lastly, note that V2V uses the T2V "--task" modes and models, not the I2V ones.

And lastly I extended -te_multiplier support to FramePack as well. These changes are currently in my add features branch and should be merged into main shortly when I confirm stability. Cheers!

0 replies

Sarania · 2025-05-03T14:00:16Z

Sarania
May 3, 2025
Author Sponsor

I've added a better implementation of extra latent noise for Wan than what I had before:

--extra_noise 0.015

It's usable with Wan V2V/I2V mode. Low values (< 0.04) are best and it can help with fine detail/texture, especially when upscaling(output res > input res). I've also improved V2V and added a "direct" mode(--v2v_noise_mode direct), where --v2v_denoise 0.3 will directly control the amount of noise added (in this case 30%) as closely as possible by finding the timestep closest to that amount of noise remaining and proceeding from there. In traditional method(default) it would run the last 30% of the timestep schedule. Both can be useful at different times.

My next addition is likely to be a simple little post processing UI, where you can load a video and adjust the color temp/saturation/brightness/contrast/gamma/white balance etc with a preview.

0 replies

Sarania · 2025-05-05T21:10:21Z

Sarania
May 5, 2025
Author Sponsor

So I extended the V2V today to work in --task i2v-14b mode too. If no --image_path is provided it will use the first frame of the video as the conditioning image for i2v. Or you can specify your own(say if you've improved the details of the first frame with i2i but using something completely different might create strange results).

Why is this beneficial? Well my thought was the I2V models were more used to working with pre-existing stuff so they might work better for a task where we're always starting part way through the diffusion process plus we have the extra conditioning context from clip visual to provide a boost to the model's understanding of the existing content. In my testing so far at least if your goal is just refinement/upscaling and you're starting from mid or better quality, it's definitely better to use the I2V model than the T2V model! Fine details like eyes/clothing definitely turn out better. More tests will be run, I currently am testing to see which upscales from low res better, T2V model or I2V model.

Edit: So more tests later, I have some interesting results. When doing V2V:

Using the T2V model:

More total detail
Takes slightly longer for a given percent(since default is 50 steps)

But using the I2V model(with input video first frame as I2V conditioning):

Preserves subject identity MUCH better
Better fidelity in fine detail(round things being round, outfit consistency, pupils, etc)
Takes slightly shorter for a given percent(since default is 40 steps)

When starting from medium/high resolution using I2V model tends to give me better results, when starting with lower res, the T2V model does. T2V adds more details but also changes more. I2V adds less details but changes less. Depends on what you want, I think having both is nice!

0 replies

Sarania · 2025-05-09T14:49:57Z

Sarania
May 9, 2025
Author Sponsor

I've been doing some big refactoring lately and I finally got it pushed to add-features, after a bit more testing I'm going to merge it to main and make a tagged release. It turns out 'max-autotune-no-cudagraphs' mode for torch.compile subtly breaks determinism so I reverted the default mode to "default" which is actually faster anyway and doesn't have that issue. I also added support for perpendicular/orthogonal negative prompting to Wan/Hunyuan. This takes a third model pass per step so it increases your time 33% but provides a much more powerful, surgical form of CFG. For instance if you prompt "beautiful nature photography" but use the negative "tree", you will likely still have trees in your output unless you crank guidance scale(and thus hurt quality), because the concept "nature" includes the concept "tree" within itself. It will be less trees than before, but still some. With perp neg, the entire concept of "tree" is subtracted from the output. Anything vaguely tree like will be eliminated with extreme prejudice. For those curious about how we do this:

def perpendicular_negative_cfg(cond: torch.Tensor, uncond: torch.Tensor, nocond: torch.Tensor, negative_scale: float, guidance_scale: float) -> torch.Tensor:
    """Perpendicular negative CFG"""
    pos_cond = cond - nocond
    neg_cond = uncond - nocond
    perp_neg_cond = neg_cond - ((torch.mul(neg_cond, pos_cond).sum()) / (torch.norm(pos_cond)**2)) * pos_cond
    perp_neg_cond *= negative_scale
    noise_pred = nocond + guidance_scale * (pos_cond - perp_neg_cond)
    return noise_pred

Where cond is your positive conditioning, uncond is your negative, and nocond is with literally a blank prompt(ground truth). You can use it with --perp_neg neg_scale e.g. --perp_neg 1.6 and you wanna keep your neg_scale values around 2 or less because it's VERY powerful. It's available for Wan and Hunyuan but not currently FP!

Edit: Oh I also added saving of various metadata to mkv/png (not mp4 because it's more strict about metadata) by default and improved the metadata saved with latents. It can all still be disabled with --no-metadata if you like! I'm still finalizing the overall format so don't rely on it yet, but I will make a small viewer to make it easy to see what prompt/args made a given video!

3 replies

StrongerXi Jun 2, 2025

It turns out 'max-autotune-no-cudagraphs' mode for torch.compile subtly breaks determinism

Do you mind elaborating a bit on this? Thanks!

Sarania Jun 3, 2025
Author Sponsor

@StrongerXi Okay so this is a super elusive issue I've been trying to trace for a while so please bear with me. For some reason for me there are sometimes subtle shifts in the output on successive runs of the same seed/prompt/etc with Wan2.1 specifically. I initially thought I traced this to the usage of max-autotune-no-cudagraphs when compiling but now I'm honestly not so sure.

When this is happening the outputs will be clearly the same seed but instead of being identical like you'd expect with identical settings the main content will be mostly the same but there will be shifts in the composition of the background or subtle structure of things. A painting on the wall might be in a different place or not there, the scene through a background window might be different or the window might be fogged instead. Things like that. The reason I blamed max-autotune is because the first time I was trying to figure this out, when I disabled torch.compile the outputs immediately became identical whereas altering other settings had not made a difference, restoring it to default did not restore the issue but restoring max autotune did. But it seems to be an intermittent issue and I've since experienced it with default as well so I think I may have been misled by unfortunate/sneaky timing.

I've tried a few times to isolate it and my most recent investigation this week made me suspect it might be due to sometimes breaking inferences early with CTRL-C which causes things to explode quite ungracefully, so I added a method to break cleanly while still cleaning up with CTRL-Q via pynput. This would certainly explain the intermittent nature of the problem and since doing that I haven't seen the issue but... time will tell. That's what I can tell you currently about it. For my environment:

EndeavourOS on kernel 6.14.9 with Plasma 6, Python 3.12.9, Pytorch 2.7.0 with CUDA 12.8
i7 6900k, RTX 4070 TI Super 16GB VRAM, 64GB RAM, anything else feel free to ask!

StrongerXi Jun 3, 2025

Thank you for such detailed explanation and impressive debugging:).

might be due to sometimes breaking inferences early with CTRL-C which causes things to explode quite ungracefully
This sounds plausible... I'll take note of this, thanks again!

Sarania · 2025-05-10T15:50:11Z

Sarania
May 10, 2025
Author Sponsor

I've created a tagged release: https://github.com/Sarania/blissful-tuner/releases/tag/0.6.6

More testing has gone into this than usual to make sure the common paths through all the main files are solid(I know I break stuff from time to time >_<) so if you wanna use my version with a more stable reference point, that's what this is for! Cheers!

0 replies

Sarania · 2025-05-13T20:52:54Z

Sarania
May 13, 2025
Author Sponsor

I've added a script for blurring out faces in images/videos. This is useful for training LoRA that don't affect faces when combined with appropriate captioning! Also, I finalized the face restoration script and uploaded those models as well!

2 replies

JioJe May 27, 2025

我添加了一个脚本，用于模糊图像/视频中的人脸。这对于训练 LoRA 非常有用，使其与合适的字幕结合使用时不会影响人脸！此外，我还完成了人脸修复脚本，并上传了相应的模型！

Have you considered wan2.1vace's lora model training?

Sarania May 27, 2025
Author Sponsor

I haven't experimented with VACE yet (It's on the list of "Exciting things I haven't had the time to check out yet" lol) and my familiarity with the training code is modest - most of my changes have been on the inference side of things. For these reasons I doubt I would be able to do such a thing currently. Sorry! I imagine kohya may do so in the future though, VACE is a pretty big deal!

Sarania · 2025-05-23T13:29:21Z

Sarania
May 23, 2025
Author Sponsor

I've added a lil metadata viewer at blissful-tuner/blissful_tuner/metaview.py it takes a single bare argument which is an MKV saved with BlissfulTuner with metadata enabled and then displays all the BlissfulTuner related metadata with convenient "Copy" button located next to each datapoint. This makes it easy to inspect what parameters created a given video and copy the prompt or w/e you like to use again! On my Linux system I installed the requirements at the system level and created a .desktop file so I could right click -> Open with Blissful Tuner Metadata Viewer on MKV files! It requires PySide6 and mediainfo cli

P.S. Simple .desktop file I used in ~/.local/share/applications after making metaview.py executable:

[Desktop Entry]
Name=Blissful Tuner Metadata Viewer
Exec=/home/blyss/projects/blissful-tuner/blissful_tuner/metaview.py %f
Type=Application
Icon=utilities-terminal
MimeType=video/x-matroska;
NoDisplay=false
Terminal=false
Categories=Utility;

0 replies

Sarania · 2025-08-19T23:49:37Z

Sarania
Aug 19, 2025
Author Sponsor

So I've been working on improving --lower_precision_attention and it now saves a couple more GB without any more quality loss. I also implemented --simple_modulation which reverts to Wan 2.1 style modulation for Wan 2.2 This causes Wan 2.2 to use MUCH less VRAM, in fact exactly the same VRAM as 2.1 for a given HxWxF etc as well as faster for a small quality hit(Thanks to WanGP for making me aware this was feasible, I wondered if it might). At this point, I'm pretty pleased with my optimizations for Wan 2.2! I'll push them shortly after some more testing and cleanup, cheers!

16GB VRAM, 720x720x81, fp8_scaled, 36 blocks swapped, SageAttn, compile, lazy_loading:
Native: OOM
--lower_precision_attention: ~12.8 GB
--simple_modulation: ~8.5GB
--simple_modulation --lower_precision_attention: ~7.8GB

Edit: I've made these available for training as well but they are as of yet untested and might cause quality loss of unknown degree so be mindful!

Edit: Pushed!

0 replies

Sarania · 2025-08-21T15:30:33Z

Sarania
Aug 21, 2025
Author Sponsor

In my testing --simple_modulation is really good, it doesn't degrade quality noticeably if at all. In fact WanGP and WanVideoWrapper only implement the equivalent, not having a path that uses the heavier per token modulation at all that I can see. For this reason I'm likely to add it to --optimized for Wan 2.2 in Blissful because of the absolutely massive VRAM and compute savings. The default will still be as Musubi/Wan2.2, with full per token mod though(and if you wanna make that at least 50% lighter without a significant quality hit, you've got --lower_precision_attention too!) You can even use both together as noted above for maximum VRAM savings and speeds. I don't like taking things away, just adding more! Blissful has a LOT of switches, but that means it can be highly tuned to your device and needs. Cheers!

0 replies

Sarania · 2025-08-23T00:36:27Z

Sarania
Aug 23, 2025
Author Sponsor

Today I went through and extended Rich features (logging, traceback, help) to Flux Kontext and Qwen Image, as well as a couple scripts I'd somehow missed. This should help homogenize the Blissful Tuner experience, even though I haven't added any features to those other than that yet at least they'll match and be pretty!

I also started working with training Wan 2.2 last night, it's worth noting if you train one DiT at a time then it treats it pretty much like Wan 2.1 except for the min/max timestep etc, which also means it only does simple modulation so maybe that IS viable for training? We'll see, my high noise training looks like it went well but i can't be certain until I train the low noise tonight and have both to test. I will of course report!

BTW someone requested a UI again the other day... I mentioned before I might and even made a cool prototype but the truth is that DOUBLES the maintenance work because every new thing has to not only be added or merged but also wired into the UI and that's just not a workload I'm looking to take on right now. I'm sure it would make Blissful more widely appealing and if someone wants to work on it I'm definitely willing to add it in if it's done well but I'm personally happy working in CLI and I'd rather focus my limited dev energy on interesting new tech and things like that over maintenance work. I hope that's understandable, cheers!

0 replies

Sarania · 2025-08-24T23:09:43Z

Sarania
Aug 24, 2025
Author Sponsor

I've implemented --optimized_compile for Wan that is compatible with block_swap unlike --dynamo_backend. With it and the non-complex RoPE I ported a while ago it saves massive VRAM and speeds up training. It's available for inference too, but regular --compile seems better there. For training though I REALLY can't overstate the difference I HIGHLY recommend trying it out with --optimized_compile --rope_func comfy for training Wan or Wan 2.2. Currently training 1800 steps at a combination of 512x512x33f and 360x360x65f (8.5 megaframepixels) with 24 blocks swapped on my 4070 Ti Super 16GB in an estimated 6 hours 30 minutes. That's in fp8_scaled precision with flash_attn_2! This is fast enough I can train a full Wan 2.2 LoRA (e.g. both DiTs) in one night! Hype, talk about taming the beast!

Also I removed --upcast_quantization and --upcast_linear for all models. The autocast context means it didn't really do anything and the possible benefit was minimal anyway. Cheers!

8 replies

arledesma Sep 18, 2025

That's... weird. And the ONLY real evidence of that error elsewhere I can find is in a somewhat recent issue posted to WanGP:

deepbeepmeep/Wan2GP#681

I fixed another bug related to optimized compile just this morning, do you have that fix? I see a couple of other things that could cause issues too... let me poke it for a bit and then if you could try again that would be great. I have no Windows environment to test(or at all) so I can't probe it easily. There's not anything that functionally special about my optimized compile, really.

Edit: Yeah I've got some errors with recursiveness in compiling that could maybe cause that kind of problem so let me fix those up here and see if it helps.

I was current with main at the time. Sarania@614b807

❯ git -C .\musubi-tuner\ show-ref origin/HEAD
614b8079493bf30bb6871cd051c11c5c6d0b2805 refs/remotes/origin/HEAD

arledesma Sep 18, 2025

Alright I fixed and merged some oversights in that function, maybe it will help! BTW you mentioned

I'll test it out.

Which also requires setting up c++ development tooling for --optimized_compile which requires cl and openmp

is this just a dependency of torch.compile in general or?

Yeah, it looks like torch.compile currently requires a c++ compiler (and other python dependencies).

Windows: https://docs.pytorch.org/tutorials/unstable/inductor_windows.html
For Windows it requires MSVC (cl). Alternative compilers like icx-cl and clang-cl are also supported, but still depend on MSVC.

Linux: https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
For linux it requires g++ (or maybe some other undocumented c++ compiler).

I was hoping that triton-windows and MSVC would provide the support.

❯ uv pip freeze | rg 'torch|triton'
flash-attn @ https://github.com/sdbds/flash-attention-for-windows/releases/download/2.8.3/flash_attn-2.8.3+cu128torch2.8.0cxx11abiFALSEfullbackward-cp311-cp311-win_amd64.whl
pytorch-optimizer==3.8.0
sageattention @ https://github.com/sdbds/SageAttention-for-windows/releases/download/2.20_torch280%2Bcu128/sageattention-2.2.0+cu128torch2.8.0-cp311-cp311-win_amd64.whl
torch==2.8.0+cu128
torch-optimi==0.2.1
torchvision==0.23.0+cu128
triton-windows==3.3.1.post19

Sarania Sep 18, 2025
Author Sponsor

Oof that sucks but yeah AI says that's required for certain functions of torch.compile at least. On any distro of Linux you almost always have a compiler integrated at system level, usually either some version of gcc or clang or both. It's just pretty much guaranteed to be there in /usr/bin and it makes life so much easier. Now if you happen to need an older revision for some reason that can be a pain but overall development is so much smoother. Anyway I'll update the docs shortly to make a note that compile compatibility may be limited under pure Windows and suggest WSL2 in that case.

arledesma Sep 18, 2025

Your latest change has corrected the memory ballooning 👍🏽

Not sure that it is providing any improvement at this point. After about 10 minutes I was seeing 14.27s/it while I'm normally around 5s/it (low noise, batch size of 4, all images, up to 768x768, fira). 20 minutes in and it is now at 26.4s/it at growing but not progressing past step 35 of ~6000. Cpu's are not showing any usage, while GPU was pegged at 100% and has about 1GB free vram and no logs are printed to the console. I'm not sure if it is worth troubleshooting over here in the discussion.

Edit: started the same training with only comfy rope, no optimized_compile, and it's currently averaging 5.36s/it.

Sarania Sep 19, 2025
Author Sponsor

Well, I'm glad it's not exploding RAM anymore, I think that was my fault for double hitting some things with the compiler and that needed fixed anyway. But honestly there's not much more I can do to make it work better. There's not really that much going on:

    def blissful_optimize(self):
        self.optimized_compile = False  # Otherwise it would be called every step
        if self.training:  # This function changes shape if the input changes shape so it will trigger many recompiles for training, disable it there.
            bound = self.get_imgids  # Weird dance to support dynamically disabling it this way versus using a permanent decorator
            unbound = bound.__func__ if inspect.ismethod(bound) else bound
            disabled_fn = torch.compiler.disable(unbound, recursive=False)
            self.get_imgids = types.MethodType(disabled_fn, self)
        torch._dynamo.config.cache_size_limit = 64
        backend, mode, dynamic, fullgraph = self.compile_args
        dynamic = None if dynamic is None else dynamic.lower() in "true"
        fullgraph = fullgraph.lower() in "true"
        self.compile_args = {"backend": backend, "mode": mode, "dynamic": dynamic, "fullgraph": fullgraph}  # Now they've been processed to be compatible to be passed directly to torch.compile
        logger.info(f"Optimized compile enabled for attention{', RoPE, ' if self.rope_func == 'comfy' else ' '}and embeddings")
        logger.info(f"Compile parameters: Backend: {backend}; Mode: {mode}; Dynamic: {dynamic if dynamic is not None else 'Auto'}; Fullgraph: {fullgraph}")
        # The default RoPE uses complex numbers and doesn't compile well as such, so only compile the alternate one which avoids them.
        if self.rope_func == "comfy":
            self.rope_embedder = torch.compile(self.rope_embedder, **self.compile_args)
        self.get_time_embedding = torch.compile(self.get_time_embedding, **self.compile_args)
        self.head = torch.compile(self.head, **self.compile_args)
        for block in self.blocks:
            block._forward = torch.compile(block._forward, **self.compile_args)  # Actual rope will be compiled as part of forward, decorator disables for complex RoPE

It's unfortunately just gonna be a limitation of torch.compile under Windows which I will note in the docs. It sucks because on my system the speed is not that big of a difference maybe 10% faster, but the VRAM usage is /several/ gigabytes less. Thanks for testing it for me and I'm sorry I couldn't do more!

Sarania · 2025-08-25T19:14:46Z

Sarania
Aug 25, 2025
Author Sponsor

YUSS I got NAG working with Wan lets go! That means negative prompt/guidance WITHOUT CFG so it can be used with distilled models etc. I'm currently testing with Wan 2.1 lightx2v LoRA and it's sweet. It will work for distilled Wan 2.2 as well! I still need to clean it up a bit and do I2V as well... but HYPE. The first time I tackled it I got a bit overwhelmed, but since diving in headfirst to optimize the attention blocks for Wan 2.2, when I came back to it it wasn't that bad. I should have it ready to go live within a couple days at most!

0 replies

Sarania · 2025-08-26T17:51:45Z

Sarania
Aug 26, 2025
Author Sponsor

NAG is live for Wan 2.1/2.2! Enabled it by specifying --nag_scale 3.0 and a good place to start is 3 to 4. It applies negative guidance at the cross attention level and can be used with or without CFG, meaning it allows REAL negative with distilled models too. I used the defaults and math from the implementation (https://github.com/ChenDarYen/Normalized-Attention-Guidance) rather than the paper (https://arxiv.org/pdf/2505.21179) because it seems more controllable. It adds a very slight delay to the inference but it's not really noticeable, at least with SageAttn and torch.compile. Also somehow I actually use less VRAM with it enabled which just blows my mind, I can only think it must be causing cross attention to fuse better under torch.compile because that's wild.

Anyway, --nag_tau and --nag_alpha are also available for setting those parameters(generally higher = greater effect, see below to see what they actually do), though the defaults should be reasonable. I've gotten good results with Wan 2.1 + lightx2v LoRA. I'm not sure distilled Wan 2.2 is working very well, and I'm not sure if it's my implementation or what but that's a separate issue. Anyway, I should be able to also implement NAG for Hunyuan now within a few days as well. For those curious how it works without looking at the paper, in cross attn we compute attention for both q, k, v (z_positive) as well as q, neg_k, neg_v (z-negative) and then:

def nag(z_positive, z_negative, nag_scale, nag_tau, nag_alpha):
    z_tilde = z_positive * nag_scale - z_negative * (nag_scale - 1)  # Revised implementation
    # z_tilde = z_positive + nag_scale * (z_positive - z_negative)  # Paper implementation
    norm_positive = torch.norm(z_positive, p=1, dim=-1, keepdim=True).expand(*z_positive.shape)
    norm_tilde = torch.norm(z_tilde, p=1, dim=-1, keepdim=True).expand(*z_tilde.shape)

    ratio = norm_tilde / norm_positive
    z_hat = z_tilde * torch.minimum(ratio, ratio.new_ones(1) * nag_tau) / ratio
    z_guidance = z_hat * nag_alpha + z_positive * (1 - nag_alpha)
    return z_guidance

Copied from the implementation repo:
nag_scale: The scale for attention feature extrapolation. Higher values result in stronger negative guidance.
nag_tau: The normalisation threshold. Higher values result in stronger negative guidance.
nag_alpha: Blending factor between original and extrapolated attention. Higher values result in stronger negative guidance.

It's not perfect at suppressing concepts all the time when used without CFG but it's definitely useful to have around. Cheers!

0 replies

Sarania · 2025-08-28T01:00:37Z

Sarania
Aug 28, 2025
Author Sponsor

I fixed a pretty significant bug with NAG, it should work better now but it consumes more VRAM on than off now(like it should lol, around a gig or so on my system). Also I refactored Wan script, fixed a massive merge oversight in i2v mode, and made perpendicular negative work for i2v as well now. Also I added --nag_prompt to add a second negative prompt for NAG. If not specified but NAG is enabled, it uses the normal negative prompt. The reason you might not want to do that is NAG seems pretty strong when used together with CFG, and broad statements like "low quality" which work great in CFG can cause issues when used in both CFG and NAG at the same time. I think NAG is best used more surgically, or without CFG. Cheers!

0 replies

iamrohitanshu · 2025-08-31T05:57:48Z

iamrohitanshu
Aug 31, 2025

@Sarania Thank you for working so much on this!
If I may ask, Will Wan2.2 fp8_scaled model work for training purposes?
As far as I know, the musubi tuner doesn't support scaled models for training. I'm sorry, I tried to read all of this, but still wasn't sure.
Does blissful-tuner support a model like wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors for training?

6 replies

iamrohitanshu Sep 1, 2025

Thank you so much for the detailed reply. I really appreciate this.
I actually have limited data connection and hardware, but managed to train wan2.1 lora (with good results) using an fp8_e4m3fn model. But I couldn't find non-scaled versions of Wan2.2 models, so I was hoping that it would work.
I guess I'll search more or wait for someone to upload them. Again thank you for your reply. :)

FurkanGozukara Sep 1, 2025
Sponsor

@Sarania i found that out after analyzing with Claude. it is a nice design. only the model loading is slow. it is slow even when loading as BF16. i analyzed the code and we cant save on the fly quantized model to reuse either. that would be nice actually to avoid re-quantize and slower loading

Sarania Sep 1, 2025
Author Sponsor

@FurkanGozukara I load my models from NVME SSD so it's not too bad for me. I can imagine if you were loading from slower storage it could get a bit long. And yeah saving them IS technically possible I believe but there wouldn't be much benefit since we would need to dequantize the weights to merge LoRA and we would still need to patch the forward at runtime to work with the quantized weights. But I think it's worth the extra load time personally - The idea is to maintain the highest quality possible at each step while still being reasonable on hardware in the end. fp16 load -> fp16 merge -> fp8 scaled quantization, is naturally gonna be better than fp8 load -> dequantize to fp16 -> fp16 merge -> fp8 scaled quantization. In my experience and experimentation preserving those last little bits of precision leads to generations with more coherent backgrounds and less errors, and overall less "AI-ness" in the outputs. I've actually been in love with this fp8_scaled implementation since Kohya first did it, I even ported it to my local projects to use with LLMs and the like!

Sarania Sep 1, 2025
Author Sponsor

@iamrohitanshu You're welcome 🙂 I wondered if it might be a data/bandwidth concern when you asked. I looked a little myself and yeah I couldn't find a pure e4m3fn of 2.2 either unfortunately. My upload is only 1/30th my download speed so it's quite a chore for me or I would do it for you, sorry. Hopefully someone with a better connection can help!

iamrohitanshu Sep 1, 2025

@Sarania No worries, you're already doing so much. :)
I'll find it sooner or later, or I might have enough data and time at the end of the month to download the fp16 models themselves.

Sarania · 2025-09-04T14:01:01Z

Sarania
Sep 4, 2025
Author Sponsor

For the last couple days I've been working heavily in the Flux Kontext region and I've thoroughly implemented Rich (the f did I think I was doing before lmao) implemented CFG plus brought many Blissful features like fp16 accumulation, fp8 fast, compile, prompt wildcards, cfgzerostar init and scale, cfg schedule, optimized config to it. I'm going to implement latent previews today and then hopefully get that pushed.

It seems to be able to inference normal Flux as well? @kohya-ss is this intended or just a happy side effect? The only real incompatibility I've noted is LoRA only apply to Unet not TE which is often more optimal anyway but I could easily add a switch to enable it if desired.

0 replies

Sarania · 2025-09-06T17:01:45Z

Sarania
Sep 6, 2025
Author Sponsor

I've added support for latent previews(both tae and latent2rgb) for Flux, they are a little slower than you'd expect due to needing to rearrange x to produce a previewable latent every time but it's unfortunately unavoidable. I also allowed loading non bfloat16 Flux models, brought back power_seed to help with determinism and fixed various bugs. Lastly I also added support for loading Dedistilled Flux models but those aren't quite working right for reasons I've been unable to suss. That's all for now, cheers!

0 replies

Sarania · 2025-09-08T00:40:46Z

Sarania
Sep 8, 2025
Author Sponsor

Head's up: I've discovered a bug I created in Wan where LoRAs past the first weren't loaded properly even if it said they were. The work I pushed tonight fixes that, my apologies for any inconvenience.

Beyond that I integrated further into Flux/Qwen space, supporting conversion of foreign Qwen LoRA and improved the Rich features, also I added DPM++ scheduler to flux as well as --fp32_cpu_te to run the text encoder in fp32 on CPU - you should have a fp32 t5xxl if you use that. Additionally --fp32_working_dtype was added to keep contexts, noise, etc in fp32 rather than bfloat16 for maximum quality, this will also set the autocast context to fp32 and should be used with fp32_cpu_te as well. It will sacrifice a fair bit of speed for a tiny bit of quality, so IDK if it will stick around. Lastly Qwen should support prompt wildcards now. Cheers!

0 replies

Sarania · 2025-09-09T22:50:07Z

Sarania
Sep 9, 2025
Author Sponsor

I've enabled batch processing for all media processing scripts (GIMMVFI, upscaler, facefix, yoloblur), if you specify a bare directory as the input path now it will locate all files in said directory with applicable extensions (mkv, mp4 for GIMMVFI, mkv, mp4, png, jpg, jpeg for others) and operate on those, asking once for confirmation after displaying the file list unless --yes is enabled and then it just goes.

I also upgraded the model utility some more recently with the ability to fully exclude keys when converting models as well as stripping specific prefixes etc. And I've cleaned up the integration with Flux and Qwen a lot, as well as the implementation of various features under interactive/batch modes should work more reliably now. Cheers!

0 replies

Sarania · 2025-09-23T19:12:16Z

Sarania
Sep 23, 2025
Author Sponsor

I've opened a separate Discussions section for Blissful now since it's started to gain some traction of it's own. I will still be hanging around the usual places too, but I figured it was time. I'll probably move my devlog there soon as well. Thanks to everyone for their continued support, your gratitude means more to me than you might imagine 🙏

Furthermore I've integrated kohya's new fp8 optimizations (yay, thank you!) which improve accuracy of --fp8_scaled under the hood with no change needed from users. By default, Blissful will use the new blockwise quantization method when fp8 scaled is enabled, falling back to the old per tensor quantization if --fp8_fast is enabled since it requires that. Overall you should see more similar results to the unquantized model when using fp8_scaled than ever before. Additionally it may enable more VRAM savings for some models by allowing to quantize parts we previously excluded due to the increased accuracy. It requires some testing to see what's best but so far I've enabled quantizing modulation parameters for Flux where they were excluded before which should allow Flux models to inference with the script on 16GB with no blocks swapped(barely). It slightly changes the output versus not but it doesn't appear to be in a negative way.

I'm also considering absorbing HyImage2.1 from sdscripts and incorporating it, but we'll see how practical this is in practice before we make any promises. I'd like to, anyway. Also on that front I've been experimenting with a merge of Qwen2.5 VLM I made incorporating this model which shows promise, it improves diversity of backgrounds, quality of texture, and quality of NSFW at least for realistic styles which is all I've tested so far. I'm gonna try some further tweaking and if it ends up being useful I'll share that and post about it. Cheers!

0 replies

Sarania · 2025-10-04T01:15:21Z

Sarania
Oct 4, 2025
Author Sponsor

I've made a small tweak to --optimized_compile for Wan 2.2 training, it will no longer compile the head portion of the model which wasn't really consequential and allows optimized compile to work with combined low/high training! This has allowed me to do combined high/low training myself on 64GB sysram + 16GB VRAM using fp8_scaled, 36 blocks swapped per model, optimized_compile + rope_func comfy, simple_modulation/force_v2_1_time_embedding and reinvigorated my interest in training 2.2!

Side note that optimized_compile doesn't work well under native Windows according to users so for anyone who decides to experiment, WSL2 or native Linux might be the best way! My apologies for that it just seems that compile in general is a lot trickier on Windows and I don't have a Windows environment myself nor the extra time to do separate testing/maintenance for separate concerns thereof unfortunately. WSL2 is quite good and not hard to set up though so maybe it's not too bad!

0 replies

Sarania · 2025-10-05T16:32:09Z

Sarania
Oct 5, 2025
Author Sponsor

Random thought bumping around in my head - at this point I've realized that my naming of --simple_modulation was bad because it's really the timestep embedding that's simplified. I got modulation in my head because it interacts with that in multiple places and at this point I've referenced it under that name several times so I don't really wanna change it even through a deprecation process so... it's prolly just gonna keep the bad name XD. I've kinda been in maintenance mode recently, not adding too much new and just keeping up with upstream. I'm sure another flurry of activity will happen soon enough!

0 replies

Uh oh!

Blissful Tuner (unofficial, experimental extensions to Musubi Tuner) #232

Uh oh!

Uh oh!

Sarania Apr 20, 2025 Sponsor

Replies: 49 comments · 32 replies

Uh oh!

Sarania Apr 21, 2025 Author Sponsor

Uh oh!

GarbageHaus Apr 21, 2025

Uh oh!

Sarania Apr 21, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania Apr 21, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania Apr 22, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania Apr 22, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania Apr 25, 2025 Author Sponsor

Uh oh!

maybleMyers Apr 28, 2025

Uh oh!

Uh oh!

Sarania Apr 28, 2025 Author Sponsor

Uh oh!

maybleMyers Apr 29, 2025

Uh oh!

Uh oh!

Sarania Apr 29, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania Apr 29, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania May 1, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania May 3, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania May 5, 2025 Author Sponsor

Uh oh!

Uh oh!

Sarania May 9, 2025 Author Sponsor

Uh oh!

StrongerXi Jun 2, 2025

Uh oh!

Sarania Jun 3, 2025 Author Sponsor

Uh oh!

StrongerXi Jun 3, 2025

Uh oh!

Uh oh!

Sarania May 10, 2025 Author Sponsor

Uh oh!

Sarania May 13, 2025 Author Sponsor

Uh oh!

Sarania
Apr 20, 2025
Sponsor

Replies: 49 comments 32 replies

Sarania
Apr 21, 2025
Author Sponsor

GarbageHaus
Apr 21, 2025

Sarania Apr 21, 2025
Author Sponsor

Sarania
Apr 21, 2025
Author Sponsor

Sarania
Apr 22, 2025
Author Sponsor

Sarania
Apr 22, 2025
Author Sponsor

Sarania
Apr 25, 2025
Author Sponsor

maybleMyers
Apr 28, 2025

Sarania Apr 28, 2025
Author Sponsor

Sarania Apr 29, 2025
Author Sponsor

Sarania
Apr 29, 2025
Author Sponsor

Sarania
May 1, 2025
Author Sponsor

Sarania
May 3, 2025
Author Sponsor

Sarania
May 5, 2025
Author Sponsor

Sarania
May 9, 2025
Author Sponsor

Sarania Jun 3, 2025
Author Sponsor

Sarania
May 10, 2025
Author Sponsor

Sarania
May 13, 2025
Author Sponsor