Summary
When running DeepSeek-V4-Flash FP4 with --moe-a2a-backend megamoe on B200, MegaMOE correctly falls back once the per-rank token count exceeds SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK. However, the fallback path currently crashes in the Triton MoE runner with a FP8 scale shape assertion.
This reproduces on sglang main without DeepEP Waterfill enabled, so it does not appear to be caused by Waterfill.
Fallback trigger
python/sglang/srt/layers/moe/mega_moe.py::should_use_mega_moe() uses:
cap = envs.SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK.get()
return max_tokens_per_rank <= cap
The env var defaults to 1024 in python/sglang/srt/environ.py. If the env var is set to 4096, requests below that cap use the MegaMOE fast path, while larger prefill requests fall back. If the env var is unset, the default 1024 cap makes even moderate prefill requests hit fallback.
Repro environment
- Hardware: B200 x2
- Model:
/home/scratch.xutingz_wwfo_2/model/DeepSeek-V4-Flash FP4
- Commit tested:
4248695b079aa218bf8720d178dfad08790cb648 (origin/main)
- Waterfill: disabled
- Radix cache: disabled to make the actual prefill token count deterministic
Launch args:
python -m sglang.launch_server \
--model-path /host_scratch/model/DeepSeek-V4-Flash \
--trust-remote-code \
--tp 2 \
--dp 1 \
--moe-dense-tp-size 1 \
--moe-a2a-backend megamoe \
--disable-cuda-graph \
--skip-server-warmup \
--disable-radix-cache \
--max-running-requests 8 \
--host 127.0.0.1 \
--port 30000
Observed results
Case 1: SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096
A small request stays under the cap and succeeds:
SMALL_PROMPT_TOKENS 3307
SMALL_GENERATE_OK
Prefill batch, #new-token: 3328, #cached-token: 0
A larger uncached prefill exceeds the cap and crashes after fallback:
LARGE_PROMPT_TOKENS 5207
LARGE_GENERATE_FAIL RemoteDisconnected: Remote end closed connection without response
Case 2: env var unset
Since the default cap is 1024, the same small request already exceeds the cap and crashes:
SMALL_PROMPT_TOKENS 3307
SMALL_GENERATE_FAIL RemoteDisconnected: Remote end closed connection without response
Error
The fallback path enters the Triton MoE runner and fails here:
Using default MoE kernel config. Performance might be sub-optimal! Config file not found at .../triton_3_6_0/E=128,N=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 999, in forward_normal
File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton.py", line 93, in run
File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py", line 484, in _fused_moe_kernel_sequence
File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe_triton_kernels.py", line 762, in invoke_fused_moe_kernel
assert triton.cdiv(B.shape[-2], block_n) == B_scale.shape[-2]
AssertionError
Expected behavior
When the request exceeds the MegaMOE token cap, the fallback path should still support DeepSeek-V4 FP4 MoE weights and complete the request, or the server should route to a compatible fallback backend instead of crashing.
Notes
This was also checked with SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096: the under-cap request succeeds, and the over-cap request fails. With the env var unset, the default cap is lower (1024), so fallback is triggered earlier and the same assertion appears on the smaller request.
Summary
When running DeepSeek-V4-Flash FP4 with
--moe-a2a-backend megamoeon B200, MegaMOE correctly falls back once the per-rank token count exceedsSGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK. However, the fallback path currently crashes in the Triton MoE runner with a FP8 scale shape assertion.This reproduces on
sglangmain without DeepEP Waterfill enabled, so it does not appear to be caused by Waterfill.Fallback trigger
python/sglang/srt/layers/moe/mega_moe.py::should_use_mega_moe()uses:The env var defaults to
1024inpython/sglang/srt/environ.py. If the env var is set to4096, requests below that cap use the MegaMOE fast path, while larger prefill requests fall back. If the env var is unset, the default1024cap makes even moderate prefill requests hit fallback.Repro environment
/home/scratch.xutingz_wwfo_2/model/DeepSeek-V4-FlashFP44248695b079aa218bf8720d178dfad08790cb648(origin/main)Launch args:
Observed results
Case 1:
SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096A small request stays under the cap and succeeds:
A larger uncached prefill exceeds the cap and crashes after fallback:
Case 2: env var unset
Since the default cap is
1024, the same small request already exceeds the cap and crashes:Error
The fallback path enters the Triton MoE runner and fails here:
Expected behavior
When the request exceeds the MegaMOE token cap, the fallback path should still support DeepSeek-V4 FP4 MoE weights and complete the request, or the server should route to a compatible fallback backend instead of crashing.
Notes
This was also checked with
SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096: the under-cap request succeeds, and the over-cap request fails. With the env var unset, the default cap is lower (1024), so fallback is triggered earlier and the same assertion appears on the smaller request.