Skip to content

[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4#29523

Open
JustinTong0323 wants to merge 1 commit into
sgl-project:mainfrom
JustinTong0323:xinyuan/cutedsl-deepep-normal-guard
Open

[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4#29523
JustinTong0323 wants to merge 1 commit into
sgl-project:mainfrom
JustinTong0323:xinyuan/cutedsl-deepep-normal-guard

Conversation

@JustinTong0323

@JustinTong0323 JustinTong0323 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Fixes #29521.

Serving an NVFP4 / modelopt_fp4 MoE model with the default high-throughput DeepEP auto mode crashes during the first prefill forward inside the flashinfer_cutedsl MoE runner:

File ".../layers/moe/moe_runner/flashinfer_cutedsl.py", line 476,
  in fused_experts_deepep_to_flashinfer_cutedsl_fp4
    hidden_states, hidden_states_scale, _, _, masked_m, _ = dispatch_output
ValueError: not enough values to unpack (expected 6, got 5)

@register_fused_func("deepep", "flashinfer_cutedsl") only implements the low-latency (masked, 6-field DeepEPLLDispatchOutput) dispatch. In deepep auto mode prefill uses the normal (5-field DeepEPNormalDispatchOutput) dispatch, which falls into the same unconditional 6-tuple unpack. This path was never wired for cutedsl-FP4 — #25525 migrated only the DeepEP low-latency cutedsl path to MoeRunner.

Modifications

CuteDSL FP4 only has a masked grouped-GEMM kernel (flashinfer_cutedsl_moe_masked / grouped_gemm_nt_masked); there is no contiguous/normal CuteDSL FP4 kernel, so normal-dispatch support would require a new kernel (a feature, out of scope for this fix). Instead, branch on dispatch_output.format and raise an actionable NotImplementedError for the unsupported normal/prefill case, pointing at --deepep-mode low_latency, instead of the opaque tuple-unpack error.

if not dispatch_output.format.is_deepep_ll():
    raise NotImplementedError(
        "flashinfer_cutedsl FP4 MoE only supports DeepEP low_latency dispatch "
        f"(masked layout), but received {dispatch_output.format}. DeepEP "
        "normal/prefill dispatch has no CuteDSL FP4 handler. Pass "
        "--deepep-mode low_latency, or use a MoE runner backend that supports "
        "DeepEP normal dispatch."
    )

The low-latency path is unchanged: for DEEPEP_LL, is_deepep_ll() is True, the guard is skipped, and the existing unpack runs as before.

Test / E2E

4×B200, nvidia/Qwen3-30B-A3B-NVFP4 (modelopt_fp4), --moe-a2a-backend deepep --moe-runner-backend flashinfer_cutedsl:

Scenario Config Result
Reproduce (before) --deepep-mode auto All ranks crash at flashinfer_cutedsl.py:476 with ValueError: not enough values to unpack (expected 6, got 5)
Fixed --deepep-mode auto Opaque ValueError gone; replaced by the clear NotImplementedError (received DispatchOutputFormat.DEEPEP_NORMAL, hint --deepep-mode low_latency)
No regression --deepep-mode low_latency Server serves; GSM8K accuracy 0.94, stop-rate 0.97 (no runaway)

(The deep_ep.cpp:1105 num_max_dispatch_tokens_per_rank capacity assertion hit while exercising the LL path is unrelated to this change — it is the DeepEP per-rank capacity default; worked around with SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024.)

Checklist


CI States

Latest PR Test (Base): ❌ Run #28296529819
Latest PR Test (Extra): ❌ Run #28296529741

…dsl FP4

The deepep->flashinfer_cutedsl FP4 fused func only implements the
low-latency (masked, 6-field) dispatch. In deepep `auto` mode prefill
uses the normal (5-field) dispatch, which fell into the same 6-tuple
unpack and crashed with an opaque
`ValueError: not enough values to unpack (expected 6, got 5)`.

CuteDSL FP4 has only a masked grouped-GEMM kernel, so normal dispatch
is not supported here. Branch on dispatch_output.format and raise an
actionable NotImplementedError pointing at --deepep-mode low_latency
instead of the cryptic unpack error.

Fixes sgl-project#29521

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a check in fused_experts_deepep_to_flashinfer_cutedsl_fp4 to verify that the dispatch output format is DeepEP low-latency (is_deepep_ll()). If it is not, a clear NotImplementedError is raised with actionable advice, preventing an opaque tuple-unpacking error when a 5-field normal layout is received. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DeepEP normal (prefill) dispatch crashes flashinfer_cutedsl FP4 MoE: "not enough values to unpack (expected 6, got 5)"

1 participant