Skip to content

MegaMOE fallback to Triton crashes for DeepSeek-V4 FP4 when token count exceeds cap #27416

Description

@xutizhou

Summary

When running DeepSeek-V4-Flash FP4 with --moe-a2a-backend megamoe on B200, MegaMOE correctly falls back once the per-rank token count exceeds SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK. However, the fallback path currently crashes in the Triton MoE runner with a FP8 scale shape assertion.

This reproduces on sglang main without DeepEP Waterfill enabled, so it does not appear to be caused by Waterfill.

Fallback trigger

python/sglang/srt/layers/moe/mega_moe.py::should_use_mega_moe() uses:

cap = envs.SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK.get()
return max_tokens_per_rank <= cap

The env var defaults to 1024 in python/sglang/srt/environ.py. If the env var is set to 4096, requests below that cap use the MegaMOE fast path, while larger prefill requests fall back. If the env var is unset, the default 1024 cap makes even moderate prefill requests hit fallback.

Repro environment

  • Hardware: B200 x2
  • Model: /home/scratch.xutingz_wwfo_2/model/DeepSeek-V4-Flash FP4
  • Commit tested: 4248695b079aa218bf8720d178dfad08790cb648 (origin/main)
  • Waterfill: disabled
  • Radix cache: disabled to make the actual prefill token count deterministic

Launch args:

python -m sglang.launch_server \
  --model-path /host_scratch/model/DeepSeek-V4-Flash \
  --trust-remote-code \
  --tp 2 \
  --dp 1 \
  --moe-dense-tp-size 1 \
  --moe-a2a-backend megamoe \
  --disable-cuda-graph \
  --skip-server-warmup \
  --disable-radix-cache \
  --max-running-requests 8 \
  --host 127.0.0.1 \
  --port 30000

Observed results

Case 1: SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096

A small request stays under the cap and succeeds:

SMALL_PROMPT_TOKENS 3307
SMALL_GENERATE_OK
Prefill batch, #new-token: 3328, #cached-token: 0

A larger uncached prefill exceeds the cap and crashes after fallback:

LARGE_PROMPT_TOKENS 5207
LARGE_GENERATE_FAIL RemoteDisconnected: Remote end closed connection without response

Case 2: env var unset

Since the default cap is 1024, the same small request already exceeds the cap and crashes:

SMALL_PROMPT_TOKENS 3307
SMALL_GENERATE_FAIL RemoteDisconnected: Remote end closed connection without response

Error

The fallback path enters the Triton MoE runner and fails here:

Using default MoE kernel config. Performance might be sub-optimal! Config file not found at .../triton_3_6_0/E=128,N=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 999, in forward_normal
  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton.py", line 93, in run
  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py", line 484, in _fused_moe_kernel_sequence
  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe_triton_kernels.py", line 762, in invoke_fused_moe_kernel
    assert triton.cdiv(B.shape[-2], block_n) == B_scale.shape[-2]
AssertionError

Expected behavior

When the request exceeds the MegaMOE token cap, the fallback path should still support DeepSeek-V4 FP4 MoE weights and complete the request, or the server should route to a compatible fallback backend instead of crashing.

Notes

This was also checked with SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096: the under-cap request succeeds, and the over-cap request fails. With the env var unset, the default cap is lower (1024), so fallback is triggered earlier and the same assertion appears on the smaller request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions