reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support by BBuf · Pull Request #22672 · sgl-project/sglang

BBuf · 2026-04-13T07:34:16Z

Summary

add a FLUX.1-dev ModelOpt NVFP4 mixed-transformer builder for SGLang diffusion
make NVFP4 loading configurable for nibble swapping and preserve validated FLUX.1-dev export layout
fix FLUX attention/single-block quant prefixes so FLUX.1 fallback excludes match the intended modules
add unit coverage for the new NVFP4 config and FLUX prefix behavior

Validation

Remote RTX 5090 (4 GPUs), torch.compile disabled throughout benchmark/profile/correctness runs
pytest -q python/sglang/multimodal_gen/test/unit/test_transformer_quant.py -q in the remote diffusion container
BF16 benchmark denoise: 37.6940s
NVFP4 benchmark denoise: 29.0421s (22.95% faster)
BF16 end-to-end: 38.2545s
NVFP4 end-to-end: 29.4954s (22.90% faster)
Correctness check against BF16 at 512x512 / 8 steps: trajectory cosine 0.9933, final image PSNR 28.16 dB

bf16:

nvfp4:

Notes

The validated FLUX.1-dev path uses --transformer-path for the mixed SGLang transformer override.
Profiling traces were captured on both main and this branch with identical 4-GPU settings and torch.compile disabled.

…2574)" (#22649)" This reverts commit bf022e1.

BBuf · 2026-04-13T07:36:15Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request expands ModelOpt quantization support for diffusion models, introducing FP8 and NVFP4 compatibility for FLUX and LTX-2 families. It adds a new tool for building mixed-precision NVFP4 checkpoints, implements JIT module prewarming for torch.compile, and adds support for NVFP4 nibble swapping. Review feedback suggests broadening exception handling in the FSDP loader and improving the safety of directory management in the build scripts.

gemini-code-assist · 2026-04-13T07:39:37Z

python/sglang/multimodal_gen/runtime/loader/fsdp_load.py

+                try:
+                    weight_loader(temp_param, full_tensor)
+                except AssertionError as exc:
+                    raise AssertionError(
+                        "Failed to shard/load parameter "
+                        f"{target_param_name}: full_tensor.shape={tuple(full_tensor.shape)}, "
+                        f"meta_sharded_param.shape={tuple(meta_sharded_param.shape)}, "
+                        f"temp_param.shape={tuple(temp_param.shape)}, "
+                        f"param_cls={type(actual_param).__name__}"
+                    ) from exc


While catching AssertionError provides useful context for shape mismatches during weight loading, it might be safer to catch a broader Exception or specifically RuntimeError as well, as some weight loaders might raise different exception types depending on the underlying failure (e.g., device-side errors or memory allocation issues). If the intent is strictly to debug shape mismatches, this is fine, but consider if other loading failures should also be wrapped with this diagnostic information.

gemini-code-assist · 2026-04-13T07:39:37Z

python/sglang/multimodal_gen/tools/build_modelopt_nvfp4_transformer.py

+    if output_path.exists():
+        if not overwrite:
+            raise FileExistsError(
+                f"Output directory already exists: {output_path}. "
+                "Use --overwrite to replace it."
+            )
+        shutil.rmtree(output_path)


The use of shutil.rmtree(output_path) when overwrite=True is dangerous if the user accidentally provides a path to a directory containing important data (like the base model directory). It would be safer to only delete specific files that the tool expects to write, or at least issue a warning before deletion.

BBuf · 2026-04-13T08:21:51Z

/tag-and-rerun-ci

BBuf · 2026-04-13T10:35:58Z

I dug into the default sglang JIT/CUTLASS NVFP4 failure we saw on B200 without the FlashInfer override.

The failure is not in initialize() or run(): cutlass_scaled_fp4_mm_sm100 is already rejecting the problem at can_implement(). The first failing Wan2.2 case comes from the to_q projection and resolves to:

KernelConfigLargeM
m=37800, n=5120, k=5120
packed inputs A=37800x2560, B=5120x2560
scales A_sf=37888x320, B_sf=5120x320

I also ran two controls:

synthetic FP4 GEMM with the same shape
model-level Wan2.2 generation with the same shape path

Those checks succeeded once I forced the m > 1024 dispatch away from KernelConfigLargeM to the default config, so this looks like a current sm100 LargeM dispatch/cluster-selection issue rather than "Wan2.2 NVFP4 JIT is unsupported" in general.

I pushed a small follow-up on this branch (5f4462f9f) to document the current Blackwell workaround and leave a TODO in code:

if you need the validated path today, set SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn
long-term, the right fix is to add a validated CUTLASS fallback for these large-M shapes instead of relying on the env override

BBuf added 2 commits April 13, 2026 15:32

Revert "Revert "[Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support (#2…

fbdc957

…2574)" (#22649)" This reverts commit bf022e1.

test: isolate flux transformer quant unit test

c2c5486

github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization blackwell SM100/SM120 diffusion SGLang Diffusion jit-kernel labels Apr 13, 2026

BBuf changed the title ~~[Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support~~ reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support Apr 13, 2026

github-actions bot added the run-ci label Apr 13, 2026

BBuf marked this pull request as ready for review April 13, 2026 07:36

BBuf requested review from DarkSharpness, HydraQYH, celve, mickqian, ping1jing2, yhyang201, yingluosanqian and yuan-luo as code owners April 13, 2026 07:36

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

BBuf and others added 3 commits April 13, 2026 16:21

Merge branch 'main' into codex/flux1-modelopt-nvfp4-resubmit

9904c26

[diffusion] quant: support modelOpt nvfp4 for wan2.2 (#22681)

85863f4

[Diffusion] Document Blackwell NVFP4 GEMM workaround

5f4462f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support#22672

reland [Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support#22672
BBuf wants to merge 5 commits intomainfrom
codex/flux1-modelopt-nvfp4-resubmit

BBuf commented Apr 13, 2026

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BBuf commented Apr 13, 2026

Summary

Validation

Notes

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant