MegaMOE fallback to Triton crashes for DeepSeek-V4 FP4 when token count exceeds cap

## Summary

When running DeepSeek-V4-Flash FP4 with `--moe-a2a-backend megamoe` on B200, MegaMOE correctly falls back once the per-rank token count exceeds `SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK`. However, the fallback path currently crashes in the Triton MoE runner with a FP8 scale shape assertion.

This reproduces on `sglang` main without DeepEP Waterfill enabled, so it does not appear to be caused by Waterfill.

## Fallback trigger

`python/sglang/srt/layers/moe/mega_moe.py::should_use_mega_moe()` uses:

```python
cap = envs.SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK.get()
return max_tokens_per_rank <= cap
```

The env var defaults to `1024` in `python/sglang/srt/environ.py`. If the env var is set to `4096`, requests below that cap use the MegaMOE fast path, while larger prefill requests fall back. If the env var is unset, the default `1024` cap makes even moderate prefill requests hit fallback.

## Repro environment

- Hardware: B200 x2
- Model: `/home/scratch.xutingz_wwfo_2/model/DeepSeek-V4-Flash` FP4
- Commit tested: `4248695b079aa218bf8720d178dfad08790cb648` (`origin/main`)
- Waterfill: disabled
- Radix cache: disabled to make the actual prefill token count deterministic

Launch args:

```bash
python -m sglang.launch_server \
  --model-path /host_scratch/model/DeepSeek-V4-Flash \
  --trust-remote-code \
  --tp 2 \
  --dp 1 \
  --moe-dense-tp-size 1 \
  --moe-a2a-backend megamoe \
  --disable-cuda-graph \
  --skip-server-warmup \
  --disable-radix-cache \
  --max-running-requests 8 \
  --host 127.0.0.1 \
  --port 30000
```

## Observed results

### Case 1: `SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096`

A small request stays under the cap and succeeds:

```text
SMALL_PROMPT_TOKENS 3307
SMALL_GENERATE_OK
Prefill batch, #new-token: 3328, #cached-token: 0
```

A larger uncached prefill exceeds the cap and crashes after fallback:

```text
LARGE_PROMPT_TOKENS 5207
LARGE_GENERATE_FAIL RemoteDisconnected: Remote end closed connection without response
```

### Case 2: env var unset

Since the default cap is `1024`, the same small request already exceeds the cap and crashes:

```text
SMALL_PROMPT_TOKENS 3307
SMALL_GENERATE_FAIL RemoteDisconnected: Remote end closed connection without response
```

## Error

The fallback path enters the Triton MoE runner and fails here:

```text
Using default MoE kernel config. Performance might be sub-optimal! Config file not found at .../triton_3_6_0/E=128,N=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 999, in forward_normal
  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton.py", line 93, in run
  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py", line 484, in _fused_moe_kernel_sequence
  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe_triton_kernels.py", line 762, in invoke_fused_moe_kernel
    assert triton.cdiv(B.shape[-2], block_n) == B_scale.shape[-2]
AssertionError
```

## Expected behavior

When the request exceeds the MegaMOE token cap, the fallback path should still support DeepSeek-V4 FP4 MoE weights and complete the request, or the server should route to a compatible fallback backend instead of crashing.

## Notes

This was also checked with `SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096`: the under-cap request succeeds, and the over-cap request fails. With the env var unset, the default cap is lower (`1024`), so fallback is triggered earlier and the same assertion appears on the smaller request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MegaMOE fallback to Triton crashes for DeepSeek-V4 FP4 when token count exceeds cap #27416

Summary

Fallback trigger

Repro environment

Observed results

Case 1: `SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096`

Case 2: env var unset

Error

Expected behavior

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

MegaMOE fallback to Triton crashes for DeepSeek-V4 FP4 when token count exceeds cap #27416

Description

Summary

Fallback trigger

Repro environment

Observed results

Case 1: SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096

Case 2: env var unset

Error

Expected behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Case 1: `SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096`