Skip to content

reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support#22672

Open
BBuf wants to merge 5 commits intomainfrom
codex/flux1-modelopt-nvfp4-resubmit
Open

reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support#22672
BBuf wants to merge 5 commits intomainfrom
codex/flux1-modelopt-nvfp4-resubmit

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 13, 2026

Summary

  • add a FLUX.1-dev ModelOpt NVFP4 mixed-transformer builder for SGLang diffusion
  • make NVFP4 loading configurable for nibble swapping and preserve validated FLUX.1-dev export layout
  • fix FLUX attention/single-block quant prefixes so FLUX.1 fallback excludes match the intended modules
  • add unit coverage for the new NVFP4 config and FLUX prefix behavior

Validation

  • Remote RTX 5090 (4 GPUs), torch.compile disabled throughout benchmark/profile/correctness runs
  • pytest -q python/sglang/multimodal_gen/test/unit/test_transformer_quant.py -q in the remote diffusion container
  • BF16 benchmark denoise: 37.6940s
  • NVFP4 benchmark denoise: 29.0421s (22.95% faster)
  • BF16 end-to-end: 38.2545s
  • NVFP4 end-to-end: 29.4954s (22.90% faster)
  • Correctness check against BF16 at 512x512 / 8 steps: trajectory cosine 0.9933, final image PSNR 28.16 dB

bf16:

flux1_bf16_main_4gpu_1024_layeroffload

nvfp4:

flux1_nvfp4_pr_4gpu_1024_layeroffload

Notes

  • The validated FLUX.1-dev path uses --transformer-path for the mixed SGLang transformer override.
  • Profiling traces were captured on both main and this branch with identical 4-GPU settings and torch.compile disabled.

@github-actions github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization blackwell SM100/SM120 diffusion SGLang Diffusion jit-kernel labels Apr 13, 2026
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 13, 2026

/tag-and-rerun-ci

@BBuf BBuf changed the title [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support Apr 13, 2026
@BBuf BBuf marked this pull request as ready for review April 13, 2026 07:36
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands ModelOpt quantization support for diffusion models, introducing FP8 and NVFP4 compatibility for FLUX and LTX-2 families. It adds a new tool for building mixed-precision NVFP4 checkpoints, implements JIT module prewarming for torch.compile, and adds support for NVFP4 nibble swapping. Review feedback suggests broadening exception handling in the FSDP loader and improving the safety of directory management in the build scripts.

Comment on lines +413 to +422
try:
weight_loader(temp_param, full_tensor)
except AssertionError as exc:
raise AssertionError(
"Failed to shard/load parameter "
f"{target_param_name}: full_tensor.shape={tuple(full_tensor.shape)}, "
f"meta_sharded_param.shape={tuple(meta_sharded_param.shape)}, "
f"temp_param.shape={tuple(temp_param.shape)}, "
f"param_cls={type(actual_param).__name__}"
) from exc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While catching AssertionError provides useful context for shape mismatches during weight loading, it might be safer to catch a broader Exception or specifically RuntimeError as well, as some weight loaders might raise different exception types depending on the underlying failure (e.g., device-side errors or memory allocation issues). If the intent is strictly to debug shape mismatches, this is fine, but consider if other loading failures should also be wrapped with this diagnostic information.

Comment on lines +234 to +240
if output_path.exists():
if not overwrite:
raise FileExistsError(
f"Output directory already exists: {output_path}. "
"Use --overwrite to replace it."
)
shutil.rmtree(output_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of shutil.rmtree(output_path) when overwrite=True is dangerous if the user accidentally provides a path to a directory containing important data (like the base model directory). It would be safer to only delete specific files that the tool expects to write, or at least issue a warning before deletion.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 13, 2026

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 13, 2026

I dug into the default sglang JIT/CUTLASS NVFP4 failure we saw on B200 without the FlashInfer override.

The failure is not in initialize() or run(): cutlass_scaled_fp4_mm_sm100 is already rejecting the problem at can_implement(). The first failing Wan2.2 case comes from the to_q projection and resolves to:

  • KernelConfigLargeM
  • m=37800, n=5120, k=5120
  • packed inputs A=37800x2560, B=5120x2560
  • scales A_sf=37888x320, B_sf=5120x320

I also ran two controls:

  • synthetic FP4 GEMM with the same shape
  • model-level Wan2.2 generation with the same shape path

Those checks succeeded once I forced the m > 1024 dispatch away from KernelConfigLargeM to the default config, so this looks like a current sm100 LargeM dispatch/cluster-selection issue rather than "Wan2.2 NVFP4 JIT is unsupported" in general.

I pushed a small follow-up on this branch (5f4462f9f) to document the current Blackwell workaround and leave a TODO in code:

  • if you need the validated path today, set SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn
  • long-term, the right fix is to add a validated CUTLASS fallback for these large-M shapes instead of relying on the env override

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 diffusion SGLang Diffusion documentation Improvements or additions to documentation jit-kernel quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant