[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4#29523
Open
JustinTong0323 wants to merge 1 commit into
Open
[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4#29523JustinTong0323 wants to merge 1 commit into
JustinTong0323 wants to merge 1 commit into
Conversation
…dsl FP4 The deepep->flashinfer_cutedsl FP4 fused func only implements the low-latency (masked, 6-field) dispatch. In deepep `auto` mode prefill uses the normal (5-field) dispatch, which fell into the same 6-tuple unpack and crashed with an opaque `ValueError: not enough values to unpack (expected 6, got 5)`. CuteDSL FP4 has only a masked grouped-GEMM kernel, so normal dispatch is not supported here. Branch on dispatch_output.format and raise an actionable NotImplementedError pointing at --deepep-mode low_latency instead of the cryptic unpack error. Fixes sgl-project#29521
Contributor
There was a problem hiding this comment.
Code Review
This pull request adds a check in fused_experts_deepep_to_flashinfer_cutedsl_fp4 to verify that the dispatch output format is DeepEP low-latency (is_deepep_ll()). If it is not, a clear NotImplementedError is raised with actionable advice, preventing an opaque tuple-unpacking error when a 5-field normal layout is received. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes #29521.
Serving an NVFP4 /
modelopt_fp4MoE model with the default high-throughput DeepEPautomode crashes during the first prefill forward inside theflashinfer_cutedslMoE runner:@register_fused_func("deepep", "flashinfer_cutedsl")only implements the low-latency (masked, 6-fieldDeepEPLLDispatchOutput) dispatch. Indeepep automode prefill uses the normal (5-fieldDeepEPNormalDispatchOutput) dispatch, which falls into the same unconditional 6-tuple unpack. This path was never wired for cutedsl-FP4 — #25525 migrated only the DeepEP low-latency cutedsl path toMoeRunner.Modifications
CuteDSL FP4 only has a masked grouped-GEMM kernel (
flashinfer_cutedsl_moe_masked/grouped_gemm_nt_masked); there is no contiguous/normal CuteDSL FP4 kernel, so normal-dispatch support would require a new kernel (a feature, out of scope for this fix). Instead, branch ondispatch_output.formatand raise an actionableNotImplementedErrorfor the unsupported normal/prefill case, pointing at--deepep-mode low_latency, instead of the opaque tuple-unpack error.The low-latency path is unchanged: for
DEEPEP_LL,is_deepep_ll()isTrue, the guard is skipped, and the existing unpack runs as before.Test / E2E
4×B200,
nvidia/Qwen3-30B-A3B-NVFP4(modelopt_fp4),--moe-a2a-backend deepep --moe-runner-backend flashinfer_cutedsl:--deepep-mode autoflashinfer_cutedsl.py:476withValueError: not enough values to unpack (expected 6, got 5)--deepep-mode autoValueErrorgone; replaced by the clearNotImplementedError(receivedDispatchOutputFormat.DEEPEP_NORMAL, hint--deepep-mode low_latency)--deepep-mode low_latency(The
deep_ep.cpp:1105 num_max_dispatch_tokens_per_rankcapacity assertion hit while exercising the LL path is unrelated to this change — it is the DeepEP per-rank capacity default; worked around withSGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024.)Checklist
CI States
Latest PR Test (Base): ❌ Run #28296529819
Latest PR Test (Extra): ❌ Run #28296529741