-
Notifications
You must be signed in to change notification settings - Fork 131
feat: add LoRA support for Z-Image turbo models #739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- Add NunchakuZImageLoraLoader node for applying LoRAs to Z-Image models - Add ComfyZImageWrapper with _LoRALinear for SVDQuant linear layers - Add _fuse_qkv_lora and _fuse_w13_lora for fusion order parity - Add reset_lora for robust weight management between runs - Add test workflow for vintage cartoon style LoRA - Register vintage cartoon LoRA in models.yaml
|
Tested and it works. |
|
@flybirdxx |
|
LoRAs seem to stay on the GPU when the |
- Store adaLN LoRA A/B as buffers on the underlying nn.Linear so ModelPatcher leaf-module .to() moves them in lowvram mode - Keep Z-Image wrapper .to()/to_safely returning self - Ensure ModelPatcher is created with offload_device when copying Z-Image models
|
@ssitu |
|
The last commit gives the same savings as my quick hack, but there is still VRAM being retained. I tried some things out, and the diff below fixes the issue for me. I'd guess the buffers are not registered correctly, but I'm not sure. Any ideas? I am on torch 2.9.0 if that has any impact. +++ b/wrappers/zimage.py
@@ -29,6 +29,7 @@ from nunchaku.lora.flux.nunchaku_converter import (
unpack_lowrank_weight,
)
from nunchaku.models.linear import SVDQW4A4Linear
+from ..model_patcher import NunchakuModelPatcher
from nunchaku.utils import load_state_dict_in_safetensors
logger = logging.getLogger(__name__)
@@ -77,6 +78,7 @@ class _LoRALinear(nn.Module):
def __init__(self, base: nn.Linear):
super().__init__()
self.base = base
+ self.loras: List[Tuple[torch.Tensor, torch.Tensor]] = [] # (A, B) where delta = (x @ A.T) @ B.T
@property
def in_features(self) -> int:
@@ -93,6 +95,13 @@ class _LoRALinear(nn.Module):
@property
def bias(self) -> Optional[torch.Tensor]:
return self.base.bias
+
+ def _apply(self, fn):
+ """Override _apply to also apply to LoRA weights when model is moved."""
+ super()._apply(fn)
+ if self.loras:
+ self.loras = [(fn(A), fn(B)) for A, B in self.loras]
+ return self
@staticmethod
def _register_or_set_buffer(module: nn.Module, name: str, tensor: torch.Tensor) -> None:
@@ -671,5 +680,5 @@ def copy_with_ctx(model_wrapper: ComfyZImageWrapper) -> Tuple[ComfyZImageWrapper
device_id = ctx_for_copy.get("device_id", 0)
offload_device = ctx_for_copy.get("offload_device", torch.device("cpu"))
- ret_model = ModelPatcher(model_base, load_device=device, offload_device=offload_device)
+ ret_model = NunchakuModelPatcher(model_base, load_device=device, offload_device=offload_device)
return ret_model_wrapper, ret_model |
I’m having the same issue as @flybirdxx. Could there be differences between FP4 and INT4 models? Maybe those who got it working aren’t using RTX 50-series GPUs? I run ComfyUI with 5070ti. Supplement: I can successfully run the FP4 zimage model to generate images using the official ComfyUI-Nunchaku, but when using this PR even without LORA, it shows the above error messages and produces noisy output instead. model_type FLOW |
|
@ssitu Thanks for digging into this. I think the main difference with the NunchakuModelPatcher approach is that it bypasses ComfyUI’s standard ModelPatcher.load() bookkeeping (e.g. model.model_loaded_weight_memory). That can change ComfyUI’s smart-memory decisions and makes the logs/VRAM behavior harder to interpret. Patching NunchakuModelPatcher to fix that is outside the scope of this PR I think so I opted to keep the standard ModelPatcher for now. Also, nvidia-smi reports allocator “reserved” VRAM; a real leak would show up as torch.cuda.memory_allocated() staying high after offload/unload. I tested a small CUDA memory report to compare allocated vs reserved. On my side:
So I’m not seeing persistent allocations consistent with LoRA weights stuck on the GPU any more on my end and any remaining “VRAM used” in nvidia-smi is allocator caching (reserved), not live tensors (allocated) from my understanding. |
|
@flybirdxx @phatwila I consulted an AI assistant regarding this issue, and it suggested modifying After applying the changes below, I successfully ran generation using LoRA on an RTX 5070 Ti. I am sharing this for the author's reference, though I am not entirely sure if this modification is sufficient for all use cases. Modifications in
|
Introduces a new function '_apply_nvfp4_scale_keys()' to handle Nunchaku SVDQW4A4Linear scale parameters correctly: - Fill missing 'wcscales' with ones (stable default behavior) - Remap per-channel scales stored under 'wtscale' to 'wcscales' if shape matches - Pop float 'wtscale' from state dict and assign to module attribute Also: robust 'attention.to_out' key replacement for broader checkpoint formats. Fixes noisy output / "unexpected key" errors on RTX 50-series GPUs.
|
@flybirdxx @judian17 |
Yes, the latest version works very well,Thank you! |
Ah gotcha, I didn't have the smart memory turned off so it wasn't clear to me if it was a leak or not. Turning it off does make it clear that it is fixed, thanks! |




Enable LoRA support for Z-Image (NextDiT/Lumina2) models, including multi-LoRA fusion, state reset, and test workflows.
Summary of Changes
SVDQuant Linear Wrapper
Implemented
_LoRALinearinwrappers/zimage.pyto correctly apply LoRA weights to W4A4 SVDQuant linear modules.Fusion Order Parity
Added
_fuse_qkv_loraand_fuse_w13_loralogic to preserve parity with Z-Image fused projection layouts, ensuring LoRA composition matches the model’s native weight fusion order.Robust Weight Management
Implemented
reset_lorato safely restore original unpatched modules between workflow executions, preventing LoRA state leakage across runs.Node Integration
Fully integrated
ComfyZImageWrapperintoNunchakuZImageDiTLoader, enabling automatic LoRA patching through standard ComfyUI workflows.Motivation
Z-Image turbo models (NextDiT/Lumina2) are commonly deployed in Nunchaku-quantized form for performance reasons, but this previously prevented the use of LoRAs due to incompatible weight formats and fused projection layers.
This PR enables full LoRA support for these models within ComfyUI, allowing users to apply stylistic and domain LoRAs (for example anime, illustration, realism or vintage cartoon styles) while retaining the performance and memory advantages of W4A4 quantization.
The goal is functional parity with standard ComfyUI LoRA workflows, without requiring users to load or maintain full-precision model variants.
Modifications
wrappers/zimage.py_LoRALinear, a wrapper for applying LoRA deltas to SVDQuant linear layers._fuse_qkv_loraand_fuse_w13_lorato correctly merge LoRA weights into Z-Image fused projection matrices.reset_lorato restore original module state between executions.compose_lorasto manage multi-LoRA composition when standard patching is not applicable.ComfyZImageWrapper, which manages LoRA state and applies composition during forward passes when required.nodes/lora/zimage.pyNunchakuZImageLoraLoader, a dedicated LoRA loader node for Z-Image models.copy_with_ctxto clone model wrappers with attached LoRA state, preserving standard ComfyUI model cloning semantics.nodes/models/zimage.pyNunchakuZImageDiTLoaderto wrap the diffusion model withComfyZImageWrapper, enabling transparent LoRA support at load time.Tests
tests/workflows/nunchaku-z-image-turbo-lora/demonstrating a vintage cartoon style LoRA.88888, steps9, shift7, Euler sampler.test_cases.json.Compatibility Fix
model_patcher.load()to return(self,), ensuring correct state tracking by ComfyUI’s execution engine.Checklist
pre-commit run --all-files)tests/workflowstest_cases.jsonscripts/download_models.pyandtest_data/models.yaml