[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4 by JustinTong0323 · Pull Request #29523 · sgl-project/sglang

JustinTong0323 · 2026-06-27T17:29:18Z

Motivation

Serving an NVFP4 / modelopt_fp4 MoE model with the default high-throughput DeepEP auto mode crashes during the first prefill forward inside the flashinfer_cutedsl MoE runner:

File ".../layers/moe/moe_runner/flashinfer_cutedsl.py", line 476,
  in fused_experts_deepep_to_flashinfer_cutedsl_fp4
    hidden_states, hidden_states_scale, _, _, masked_m, _ = dispatch_output
ValueError: not enough values to unpack (expected 6, got 5)

@register_fused_func("deepep", "flashinfer_cutedsl") only implements the low-latency (masked, 6-field DeepEPLLDispatchOutput) dispatch. In deepep auto mode prefill uses the normal (5-field DeepEPNormalDispatchOutput) dispatch, which falls into the same unconditional 6-tuple unpack. This path was never wired for cutedsl-FP4 — #25525 migrated only the DeepEP low-latency cutedsl path to MoeRunner.

Modifications

CuteDSL FP4 only has a masked grouped-GEMM kernel (flashinfer_cutedsl_moe_masked / grouped_gemm_nt_masked); there is no contiguous/normal CuteDSL FP4 kernel, so normal-dispatch support would require a new kernel (a feature, out of scope for this fix). Instead, branch on dispatch_output.format and raise an actionable NotImplementedError for the unsupported normal/prefill case, pointing at --deepep-mode low_latency, instead of the opaque tuple-unpack error.

if not dispatch_output.format.is_deepep_ll():
    raise NotImplementedError(
        "flashinfer_cutedsl FP4 MoE only supports DeepEP low_latency dispatch "
        f"(masked layout), but received {dispatch_output.format}. DeepEP "
        "normal/prefill dispatch has no CuteDSL FP4 handler. Pass "
        "--deepep-mode low_latency, or use a MoE runner backend that supports "
        "DeepEP normal dispatch."
    )

The low-latency path is unchanged: for DEEPEP_LL, is_deepep_ll() is True, the guard is skipped, and the existing unpack runs as before.

Test / E2E

4×B200, nvidia/Qwen3-30B-A3B-NVFP4 (modelopt_fp4), --moe-a2a-backend deepep --moe-runner-backend flashinfer_cutedsl:

Scenario	Config	Result
Reproduce (before)	`--deepep-mode auto`	All ranks crash at `flashinfer_cutedsl.py:476` with `ValueError: not enough values to unpack (expected 6, got 5)`
Fixed	`--deepep-mode auto`	Opaque `ValueError` gone; replaced by the clear `NotImplementedError` (received `DispatchOutputFormat.DEEPEP_NORMAL`, hint `--deepep-mode low_latency`)
No regression	`--deepep-mode low_latency`	Server serves; GSM8K accuracy 0.94, stop-rate 0.97 (no runaway)

(The deep_ep.cpp:1105 num_max_dispatch_tokens_per_rank capacity assertion hit while exercising the LL path is unrelated to this change — it is the DeepEP per-rank capacity default; worked around with SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024.)

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests — N/A: error-path guard for an unsupported config; validated by the e2e above.

CI States

Latest PR Test (Base): ❌ Run #28296529819
Latest PR Test (Extra): ❌ Run #28296529741

…dsl FP4 The deepep->flashinfer_cutedsl FP4 fused func only implements the low-latency (masked, 6-field) dispatch. In deepep `auto` mode prefill uses the normal (5-field) dispatch, which fell into the same 6-tuple unpack and crashed with an opaque `ValueError: not enough values to unpack (expected 6, got 5)`. CuteDSL FP4 has only a masked grouped-GEMM kernel, so normal dispatch is not supported here. Branch on dispatch_output.format and raise an actionable NotImplementedError pointing at --deepep-mode low_latency instead of the cryptic unpack error. Fixes sgl-project#29521

gemini-code-assist

Code Review

This pull request adds a check in fused_experts_deepep_to_flashinfer_cutedsl_fp4 to verify that the dispatch output format is DeepEP low-latency (is_deepep_ll()). If it is not, a clear NotImplementedError is raised with actionable advice, preventing an opaque tuple-unpacking error when a 5-field normal layout is received. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

JustinTong0323 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners June 27, 2026 17:29

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4#29523

[MoE] Raise clear error for DeepEP normal dispatch in flashinfer_cutedsl FP4#29523
JustinTong0323 wants to merge 1 commit into
sgl-project:mainfrom
JustinTong0323:xinyuan/cutedsl-deepep-normal-guard

JustinTong0323 commented Jun 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JustinTong0323 commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test / E2E

Checklist

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JustinTong0323 commented Jun 27, 2026 •

edited by github-actions Bot

Loading