Fix MegaMOE FP4 fallback runner#29534
Draft
ronhuafeng wants to merge 1 commit into
Draft
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes #27416.
DeepSeek-V4 FP4 with
--moe-a2a-backend megamoecan crash whenmoe_runner_backend=autoand a request exceeds the MegaMOE token cap. In that case MegaMOE falls back to the normal MoE runner, butautopreviously selected Triton. Triton does not consume the FP4 scale layout prepared for the MegaMOE/DeepGEMM path, so the fallback hit a shape assertion instead of serving the long prompt.After switching the FP4 MegaMOE fallback runner to DeepGEMM, the issue-specific repro exposed a second DeepSeek-V4-only fallback mismatch: the TP attention A2A scatter optimization can shrink the MHC post-attention layout while the fallback
hc_postpath still expects the full-token layout.Modifications
Fp8MoEMethod.create_moe_runner, keep explicit backend choices unchanged, but makeautoselectMoeRunnerBackend.DEEP_GEMMfor FP4 experts when the configured MoE A2A backend is MegaMOE.auto -> Tritonbehavior and the existing DeepGEMM detection logic.SGLANG_DSV4_FIX_TP_ATTN_A2A_SCATTERonly for that fallback, so the surrounding MHC state keeps the full-token layout expected byhc_post.Accuracy Tests
This is a crash fix for a fallback path, not an intended output change.
deepseek-ai/DeepSeek-V4-Flash, TP=2,--moe-a2a-backend megamoe,SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096:mhc_post_tilelangmismatch, or client disconnect in the checked logs.Speed Tests and Profiling
No speed benchmark was run. The hot-path change is limited to a backend selection branch during runner creation and a fallback-only guard around the DeepSeek-V4 scatter optimization.
The native MegaMOE path still uses the existing scatter optimization. The guard only disables that optimization when FP4 MegaMOE has already exceeded its token cap and is using the fallback runner, where correctness requires the full-token MHC layout.
Validation
PATH="$HOME/.cargo/bin:$PATH" /tmp/sglang27416-precommit-venv/bin/pre-commit run --all-filesPYTHONPATH=/home/bef0rewind/Projects/hobby/sglang-27416/python /tmp/sglang27416-venv/bin/python -m pytest /home/bef0rewind/Projects/hobby/sglang-27416/test/registered/unit/layers/quantization/test_fp8_megamoe_fp4_fallback.py /home/bef0rewind/Projects/hobby/sglang-27416/test/registered/unit/models/test_deepseek_v4_megamoe_scatter_guard.py -q8 passed, 20 warnings/tmp/sglang27416-venv/bin/python -m py_compile python/sglang/srt/layers/quantization/fp8.py python/sglang/srt/models/deepseek_v4.pygit diff --check HEAD~1..HEADChecklist
CI States
Latest PR Test (Base): ❌ Run #28307017959
Latest PR Test (Extra): ❌ Run #28307017892