Checklist
Describe the bug
Serving an NVFP4 / modelopt_fp4 MoE model with the default high-throughput DeepEP auto mode crashes during the first prefill forward (server-init warmup) inside the flashinfer_cutedsl MoE runner:
File ".../layers/moe/moe_runner/flashinfer_cutedsl.py", line 476,
in fused_experts_deepep_to_flashinfer_cutedsl_fp4
hidden_states, hidden_states_scale, _, _, masked_m, _ = dispatch_output
ValueError: not enough values to unpack (expected 6, got 5)
Full crashing stack:
_execute_extend → deepseek_v2.py forward → self.experts(...) → ep_moe/layer.py forward_impl → fused_moe_triton/layer.py run_moe_core → modelopt_quant.py apply → moe_runner/runner.py run → flashinfer_cutedsl.py:476.
The only registered DeepEP handler for the flashinfer_cutedsl runner implements only the low-latency (decode) dispatch format. In deepep auto mode, prefill uses the normal dispatch format, which has a different arity, so it can never reach a working code path.
Reproduction
python3 -m sglang.launch_server \
--model-path nvidia/GLM-5.2-NVFP4 --trust-remote-code \
--tp 4 --enable-dp-attention --dp 4 \
--moe-a2a-backend deepep --deepep-mode auto \
--moe-runner-backend flashinfer_cutedsl
nvidia/GLM-5.2-NVFP4 is a public NVFP4 checkpoint; any modelopt_fp4 MoE model takes the same path. Crashes during init warmup with the unpack ValueError above.
Root cause
deepep auto = NORMAL dispatch for prefill, LOW_LATENCY dispatch for decode. The two dispatch outputs have different arity:
| NamedTuple |
location |
fields |
DeepEPNormalDispatchOutput |
token_dispatcher/deepep.py |
5: hidden_states, hidden_states_scale, topk_ids, topk_weights, num_recv_tokens_per_expert |
DeepEPLLDispatchOutput |
token_dispatcher/deepep.py |
6: hidden_states, hidden_states_scale, topk_ids, topk_weights, masked_m, expected_m |
The cutedsl runner registers exactly one DeepEP handler, typed and unpacked for the 6-field LL layout only:
# moe_runner/flashinfer_cutedsl.py
@register_fused_func("deepep", "flashinfer_cutedsl")
def fused_experts_deepep_to_flashinfer_cutedsl_fp4(
dispatch_output: DeepEPLLDispatchOutput, # <- LL only
...
) -> DeepEPLLCombineInput:
...
hidden_states, hidden_states_scale, _, _, masked_m, _ = dispatch_output # unconditional 6-unpack
During prefill the 5-field DeepEPNormalDispatchOutput is handed to this LL-only func → expected 6, got 5. There is no normal-dispatch handler for the deepep → flashinfer_cutedsl FP4 path.
This traces back to #25525, whose description states it migrated only "CuteDSL v1 (DeepEP low-latency + NVFP4)" to MoeRunner — the new @register_fused_func("deepep", "flashinfer_cutedsl") was added for the LL path only. The deepep auto mode's NORMAL (prefill) dispatch was never wired for cutedsl-FP4, and nothing rejects the combination early, so it surfaces as this cryptic unpack instead.
(Note: SGLANG_MOE_NVFP4_DISPATCH does not change this — both LL paths still build the 6-field tuple, and the Try SGLANG_MOE_NVFP4_DISPATCH=0 hint in the file is for a different downstream stride assertion, not this unpack.)
Workaround
Force --deepep-mode low_latency so prefill also uses the LL (6-field) dispatch, which matches the only registered cutedsl DeepEP func. This dodges this bug (prefill MoE no longer crashes). On B200 it then hits a separate cuda-graph-capture failure for NVFP4, which is out of scope for this report — but the unpack bug itself is confirmed gone, so this issue is specifically about the missing normal-dispatch handler.
Suggested fix
Either:
- (a) Add a normal-path handler
fused_experts_deepep_normal_to_flashinfer_cutedsl_fp4 that unpacks the 5-field DeepEPNormalDispatchOutput and runs the contiguous (non-masked) cutedsl path; or
- (b) Make
fused_experts_deepep_to_flashinfer_cutedsl_fp4 branch on dispatch_output.format (DEEPEP_NORMAL vs DEEPEP_LL) and unpack accordingly.
The W4AFp8 MoE path already demonstrates this normal-vs-LL split: EPMoE.forward_cutlass_w4afp8 (NORMAL → apply_deepep_normal) vs forward_cutlass_w4afp8_masked (LL → apply_deepep_ll) in ep_moe/layer.py. The cutedsl-FP4 path is missing the NORMAL half.
If supporting normal dispatch for cutedsl-FP4 is out of scope, the minimum fix is to fail early with a clear message (e.g. require --deepep-mode low_latency when moe_runner_backend=flashinfer_cutedsl + modelopt_fp4), instead of crashing mid-warmup with the unpack error.
Related
Same underlying gap as #28412 (deepep NORMAL/prefill dispatch has no fused-func handler for a runner backend), but a different backend and failure mode: #28412 is ("deepep", "marlin") with WNA16 + --enable-prefill-cp, raising a clean NotImplementedError (registration entirely missing); this issue is ("deepep", "flashinfer_cutedsl") with modelopt_fp4, where the registration exists but is LL-only and mis-unpacks the 5-field normal tuple. Both point to the broader pattern that DeepEP auto's NORMAL (prefill) dispatch is under-supported across non-deepgemm MoE runner backends.
Origin of the LL-only handler: #25525 ([MoE Refactor] Migrate flashinfer_cutedsl + DeepEP to MoeRunner). The older DeepEP+NVFP4 tracking issue #12293 (closed/completed, pre-refactor) covered a different crash and code path.
Environment
- 4×B200 (183 GB each).
- flashinfer
0.6.12, sglang editable install (latest main).
- Not covered by CI: the DeepEP CI suite runs on H100 with an FP8 model (
lmsys/sglang-ci-dsv3-test), which takes the deepgemm path and never exercises the cutedsl-FP4 DeepEP path. There is no NVFP4-DeepEP test anywhere in CI.
Checklist
Describe the bug
Serving an NVFP4 /
modelopt_fp4MoE model with the default high-throughput DeepEPautomode crashes during the first prefill forward (server-init warmup) inside theflashinfer_cutedslMoE runner:Full crashing stack:
_execute_extend→deepseek_v2.py forward→self.experts(...)→ep_moe/layer.py forward_impl→fused_moe_triton/layer.py run_moe_core→modelopt_quant.py apply→moe_runner/runner.py run→flashinfer_cutedsl.py:476.The only registered DeepEP handler for the
flashinfer_cutedslrunner implements only the low-latency (decode) dispatch format. Indeepep automode, prefill uses the normal dispatch format, which has a different arity, so it can never reach a working code path.Reproduction
nvidia/GLM-5.2-NVFP4is a public NVFP4 checkpoint; anymodelopt_fp4MoE model takes the same path. Crashes during init warmup with the unpackValueErrorabove.Root cause
deepep auto= NORMAL dispatch for prefill, LOW_LATENCY dispatch for decode. The two dispatch outputs have different arity:DeepEPNormalDispatchOutputtoken_dispatcher/deepep.pyhidden_states, hidden_states_scale, topk_ids, topk_weights, num_recv_tokens_per_expertDeepEPLLDispatchOutputtoken_dispatcher/deepep.pyhidden_states, hidden_states_scale, topk_ids, topk_weights, masked_m, expected_mThe cutedsl runner registers exactly one DeepEP handler, typed and unpacked for the 6-field LL layout only:
During prefill the 5-field
DeepEPNormalDispatchOutputis handed to this LL-only func →expected 6, got 5. There is no normal-dispatch handler for thedeepep→flashinfer_cutedslFP4 path.This traces back to #25525, whose description states it migrated only "CuteDSL v1 (DeepEP low-latency + NVFP4)" to
MoeRunner— the new@register_fused_func("deepep", "flashinfer_cutedsl")was added for the LL path only. Thedeepep automode's NORMAL (prefill) dispatch was never wired for cutedsl-FP4, and nothing rejects the combination early, so it surfaces as this cryptic unpack instead.(Note:
SGLANG_MOE_NVFP4_DISPATCHdoes not change this — both LL paths still build the 6-field tuple, and theTry SGLANG_MOE_NVFP4_DISPATCH=0hint in the file is for a different downstream stride assertion, not this unpack.)Workaround
Force
--deepep-mode low_latencyso prefill also uses the LL (6-field) dispatch, which matches the only registered cutedsl DeepEP func. This dodges this bug (prefill MoE no longer crashes). On B200 it then hits a separate cuda-graph-capture failure for NVFP4, which is out of scope for this report — but the unpack bug itself is confirmed gone, so this issue is specifically about the missing normal-dispatch handler.Suggested fix
Either:
fused_experts_deepep_normal_to_flashinfer_cutedsl_fp4that unpacks the 5-fieldDeepEPNormalDispatchOutputand runs the contiguous (non-masked) cutedsl path; orfused_experts_deepep_to_flashinfer_cutedsl_fp4branch ondispatch_output.format(DEEPEP_NORMALvsDEEPEP_LL) and unpack accordingly.The W4AFp8 MoE path already demonstrates this normal-vs-LL split:
EPMoE.forward_cutlass_w4afp8(NORMAL →apply_deepep_normal) vsforward_cutlass_w4afp8_masked(LL →apply_deepep_ll) inep_moe/layer.py. The cutedsl-FP4 path is missing the NORMAL half.If supporting normal dispatch for cutedsl-FP4 is out of scope, the minimum fix is to fail early with a clear message (e.g. require
--deepep-mode low_latencywhenmoe_runner_backend=flashinfer_cutedsl+modelopt_fp4), instead of crashing mid-warmup with the unpack error.Related
Same underlying gap as #28412 (deepep NORMAL/prefill dispatch has no fused-func handler for a runner backend), but a different backend and failure mode: #28412 is
("deepep", "marlin")with WNA16 +--enable-prefill-cp, raising a cleanNotImplementedError(registration entirely missing); this issue is("deepep", "flashinfer_cutedsl")withmodelopt_fp4, where the registration exists but is LL-only and mis-unpacks the 5-field normal tuple. Both point to the broader pattern that DeepEPauto's NORMAL (prefill) dispatch is under-supported across non-deepgemm MoE runner backends.Origin of the LL-only handler: #25525 ([MoE Refactor] Migrate flashinfer_cutedsl + DeepEP to MoeRunner). The older DeepEP+NVFP4 tracking issue #12293 (closed/completed, pre-refactor) covered a different crash and code path.
Environment
0.6.12, sglang editable install (latestmain).lmsys/sglang-ci-dsv3-test), which takes the deepgemm path and never exercises the cutedsl-FP4 DeepEP path. There is no NVFP4-DeepEP test anywhere in CI.