Skip to content

Fix MegaMOE FP4 fallback runner#29534

Draft
ronhuafeng wants to merge 1 commit into
sgl-project:mainfrom
ronhuafeng:sglang-27416-megamoe-fp4-fallback
Draft

Fix MegaMOE FP4 fallback runner#29534
ronhuafeng wants to merge 1 commit into
sgl-project:mainfrom
ronhuafeng:sglang-27416-megamoe-fp4-fallback

Conversation

@ronhuafeng

@ronhuafeng ronhuafeng commented Jun 28, 2026

Copy link
Copy Markdown

Motivation

Fixes #27416.

DeepSeek-V4 FP4 with --moe-a2a-backend megamoe can crash when moe_runner_backend=auto and a request exceeds the MegaMOE token cap. In that case MegaMOE falls back to the normal MoE runner, but auto previously selected Triton. Triton does not consume the FP4 scale layout prepared for the MegaMOE/DeepGEMM path, so the fallback hit a shape assertion instead of serving the long prompt.

After switching the FP4 MegaMOE fallback runner to DeepGEMM, the issue-specific repro exposed a second DeepSeek-V4-only fallback mismatch: the TP attention A2A scatter optimization can shrink the MHC post-attention layout while the fallback hc_post path still expects the full-token layout.

Modifications

  • In Fp8MoEMethod.create_moe_runner, keep explicit backend choices unchanged, but make auto select MoeRunnerBackend.DEEP_GEMM for FP4 experts when the configured MoE A2A backend is MegaMOE.
  • Preserve the existing regular FP8 auto -> Triton behavior and the existing DeepGEMM detection logic.
  • In DeepSeek-V4, detect the FP4 MegaMOE fallback path and skip SGLANG_DSV4_FIX_TP_ATTN_A2A_SCATTER only for that fallback, so the surrounding MHC state keeps the full-token layout expected by hc_post.
  • Add focused unit tests for FP4 MegaMOE fallback backend selection and the DeepSeek-V4 scatter guard.

Accuracy Tests

This is a crash fix for a fallback path, not an intended output change.

  • B200 smoke validation with deepseek-ai/DeepSeek-V4-Flash, TP=2, --moe-a2a-backend megamoe, SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096:
    • 3318-token prompt: generation succeeded.
    • 5209-token prompt: generation succeeded.
  • The original baseline failed on the 5209-token prompt with a Triton FP4 scale shape assertion. The patched run had no scheduler exception, Triton assertion, mhc_post_tilelang mismatch, or client disconnect in the checked logs.
  • No full accuracy benchmark was run because unaffected paths should preserve their previous backend behavior and this PR targets a crash-only fallback.

Speed Tests and Profiling

No speed benchmark was run. The hot-path change is limited to a backend selection branch during runner creation and a fallback-only guard around the DeepSeek-V4 scatter optimization.

The native MegaMOE path still uses the existing scatter optimization. The guard only disables that optimization when FP4 MegaMOE has already exceeded its token cap and is using the fallback runner, where correctness requires the full-token MHC layout.

Validation

  • PATH="$HOME/.cargo/bin:$PATH" /tmp/sglang27416-precommit-venv/bin/pre-commit run --all-files
  • PYTHONPATH=/home/bef0rewind/Projects/hobby/sglang-27416/python /tmp/sglang27416-venv/bin/python -m pytest /home/bef0rewind/Projects/hobby/sglang-27416/test/registered/unit/layers/quantization/test_fp8_megamoe_fp4_fallback.py /home/bef0rewind/Projects/hobby/sglang-27416/test/registered/unit/models/test_deepseek_v4_megamoe_scatter_guard.py -q
    • 8 passed, 20 warnings
  • /tmp/sglang27416-venv/bin/python -m py_compile python/sglang/srt/layers/quantization/fp8.py python/sglang/srt/models/deepseek_v4.py
  • git diff --check HEAD~1..HEAD

Checklist


CI States

Latest PR Test (Base): ❌ Run #28307017959
Latest PR Test (Extra): ❌ Run #28307017892

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MegaMOE fallback to Triton crashes for DeepSeek-V4 FP4 when token count exceeds cap

1 participant