[ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support#38665
[ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support#38665ChuanLi1101 wants to merge 3 commits intovllm-project:mainfrom
Conversation
…pport Enable dual-stream shared expert overlap on ROCm by using is_cuda_alike() instead of is_cuda() in the MoE forward path. This allows shared experts and routed experts to execute concurrently on separate HIP streams, matching the optimization already available on CUDA. Also add GLM-5 (glm_moe_dsa) to the Quark dynamic MXFP4 model types so that its attention projections use the same dynamic re-quantization path as DeepSeek-V3 family models. Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request updates the MoE runner to use is_cuda_alike() for platform compatibility checks and extends Quark quantization support to include the glm_moe_dsa model type, which belongs to the DSA-MoE architecture family. I have no feedback to provide as there are no review comments to evaluate.
Benchmark Status UpdateBenchmarking on MI355X (TP=8) is currently blocked by an upstream AITER bug:
What we verified
Theoretical performance impact
Will update with benchmark numbers once the upstream AITER fix lands. |
AITER's deepgemm_fp8_paged_mqa_logits_stage1 kernel computes TileQCount from num_heads; when heads < 16 (e.g. GLM-5 with TP=8 giving 8 heads per GPU), TileQCount becomes 0, causing ZeroDivisionError. Guard both rocm_fp8_paged_mqa_logits and rocm_fp8_mqa_logits to fall back to the PyTorch reference implementation when num_heads < 16, with a one-time warning log. Tracked upstream: ROCm/aiter#2563 Co-authored-by: Claude Made-with: Cursor
|
Hi @ChuanLi1101, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Benchmark Update (follow-up)After implementing a workaround for the The dense MLA backend ( Both bugs are tracked in ROCm/aiter#2563. What was verified
Blocking issueAITER sparse MLA kernels do not support Workaround committedAdded a third commit to the PR: guards |
Summary
Two targeted changes to improve GLM-5 MXFP4 inference on ROCm (AMD MI355X):
Enable dual-stream MoE shared expert overlap on ROCm: The
forward_implgate inDefaultMoERunnerusedcurrent_platform.is_cuda(), restricting dual-stream execution to NVIDIA only. Changed tois_cuda_alike()so ROCm/HIP streams are used as well. The constructor already callsaux_stream()which works on ROCm, so only the forward-path guard needed updating.Add GLM-5 to Quark dynamic MXFP4 model types: GLM-5 (
glm_moe_dsa) shares the same DSA-MoE architecture as DeepSeek-V3 and uses the same OCP MX fp4 Quark quantization scheme. Added it to_DEEPSEEK_V3_FAMILY_MODEL_TYPESso its attention projections use dynamic MXFP4 re-quantization.Context
Reference: amd/GLM-5-MXFP4
The ATOM project (ROCm/atom) achieves high performance on GLM-5 MXFP4 on MI355X partly through dual-stream shared expert execution. This PR ports that optimization to vLLM.
AI assistance (Claude) was used. The submitting human has reviewed all changed lines.
Not duplicating existing PRs: PR #35968 (DeepSeek V3.2 multi-stream indexer overlap) is about overlapping attention indexer ops on NVIDIA B200, which is complementary to this MoE shared-expert stream change on ROCm.
Test plan
--enforce-eagerand verify server startsvllm bench servewith baseline (is_cuda only) vs this PR and compare output throughput