[ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support by ChuanLi1101 · Pull Request #38665 · vllm-project/vllm

ChuanLi1101 · 2026-03-31T23:00:28Z

Summary

Two targeted changes to improve GLM-5 MXFP4 inference on ROCm (AMD MI355X):

Enable dual-stream MoE shared expert overlap on ROCm: The forward_impl gate in DefaultMoERunner used current_platform.is_cuda(), restricting dual-stream execution to NVIDIA only. Changed to is_cuda_alike() so ROCm/HIP streams are used as well. The constructor already calls aux_stream() which works on ROCm, so only the forward-path guard needed updating.
Add GLM-5 to Quark dynamic MXFP4 model types: GLM-5 (glm_moe_dsa) shares the same DSA-MoE architecture as DeepSeek-V3 and uses the same OCP MX fp4 Quark quantization scheme. Added it to _DEEPSEEK_V3_FAMILY_MODEL_TYPES so its attention projections use dynamic MXFP4 re-quantization.

Context

The ATOM project (ROCm/atom) achieves high performance on GLM-5 MXFP4 on MI355X partly through dual-stream shared expert execution. This PR ports that optimization to vLLM.

AI assistance (Claude) was used. The submitting human has reviewed all changed lines.

Not duplicating existing PRs: PR #35968 (DeepSeek V3.2 multi-stream indexer overlap) is about overlapping attention indexer ops on NVIDIA B200, which is complementary to this MoE shared-expert stream change on ROCm.

Test plan

Serve GLM-5-MXFP4 on MI355X (TP=8) with --enforce-eager and verify server starts
Run vllm bench serve with baseline (is_cuda only) vs this PR and compare output throughput
Verify DeepSeek-V3 MXFP4 is not regressed on ROCm
Verify NVIDIA CUDA path is unaffected (is_cuda_alike is a superset of is_cuda)

…pport Enable dual-stream shared expert overlap on ROCm by using is_cuda_alike() instead of is_cuda() in the MoE forward path. This allows shared experts and routed experts to execute concurrently on separate HIP streams, matching the optimization already available on CUDA. Also add GLM-5 (glm_moe_dsa) to the Quark dynamic MXFP4 model types so that its attention projections use the same dynamic re-quantization path as DeepSeek-V3 family models. Co-authored-by: Claude Signed-off-by: Chuan Li <Chuan.Li2@amd.com> Made-with: Cursor

gemini-code-assist

Code Review

This pull request updates the MoE runner to use is_cuda_alike() for platform compatibility checks and extends Quark quantization support to include the glm_moe_dsa model type, which belongs to the DSA-MoE architecture family. I have no feedback to provide as there are no review comments to evaluate.

ChuanLi1101 · 2026-04-01T03:01:34Z

Benchmark Status Update

Benchmarking on MI355X (TP=8) is currently blocked by an upstream AITER bug:

Issue: ZeroDivisionError in deepgemm_fp8_paged_mqa_logits_stage1 when GLM-5 runs with 8 attention heads per GPU (64 total / 8 TP)
Root cause: TileQCount evaluates to 0 in the sparse MLA kernel's tiling logic for small head counts
Filed upstream: ZeroDivisionError in deepgemm_fp8_paged_mqa_logits_stage1 with 8 attention heads (GLM-5 TP=8) ROCm/aiter#2563
Stack trace: The server starts and passes health checks, but crashes on the first inference request in aiter/ops/triton/attention/pa_mqa_logits.py:198

What we verified

Server starts successfully with GLM-5-MXFP4 on MI355X TP=8 with --enforce-eager
Health check passes (model loads correctly with both changes applied)
The two code changes are logically correct:
1. is_cuda_alike() is a strict superset of is_cuda() -- ROCm's HIP stream API is compatible
2. glm_moe_dsa shares identical Quark MXFP4 quantization config with deepseek_v3

Theoretical performance impact

Dual-stream shared experts: Overlaps shared expert computation with routed expert dispatch. For GLM-5 (1 shared expert + 8/128 routed), this can hide ~50-80% of shared expert latency, translating to ~3-8% end-to-end throughput improvement (decode-bound, as MoE is typically 30-40% of total forward time)
Quark MXFP4 re-quantization: Enables dynamic MX fp4 quantization path (identical to DeepSeek-V3 which already works)

Will update with benchmark numbers once the upstream AITER fix lands.

AITER's deepgemm_fp8_paged_mqa_logits_stage1 kernel computes TileQCount from num_heads; when heads < 16 (e.g. GLM-5 with TP=8 giving 8 heads per GPU), TileQCount becomes 0, causing ZeroDivisionError. Guard both rocm_fp8_paged_mqa_logits and rocm_fp8_mqa_logits to fall back to the PyTorch reference implementation when num_heads < 16, with a one-time warning log. Tracked upstream: ROCm/aiter#2563 Co-authored-by: Claude Made-with: Cursor

mergify · 2026-04-01T03:13:40Z

Hi @ChuanLi1101, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ChuanLi1101 · 2026-04-01T04:10:53Z

Benchmark Update (follow-up)

After implementing a workaround for the deepgemm_fp8_paged_mqa_logits_stage1 ZeroDivisionError (by falling back to PyTorch reference), we hit a second AITER bug: mla_decode_stage1_asm_fwd also does not support gqa=8:

RuntimeError: get_heuristic_kernel_mla: cannot get heuristic kernel! q_type:bf16 kv_type:bf16 gqa:8 ps:0 prefill:0 causal:0 qseqlen:1

The dense MLA backend (rocm_aiter_mla.py) handles this via head-repeat (8→16), but the sparse MLA backend used by GLM-5 doesn't have this logic.

Both bugs are tracked in ROCm/aiter#2563.

What was verified

Server starts successfully and passes health check (model loads, weights are correct)
All three code changes compile and load correctly
The AITER workaround (falling back PyTorch reference for paged MQA logits) works as intended

Blocking issue

AITER sparse MLA kernels do not support gqa < 16, blocking GLM-5 TP=8 inference. Once the upstream AITER fix lands, benchmarks can be collected.

Workaround committed

Added a third commit to the PR: guards rocm_fp8_paged_mqa_logits and rocm_fp8_mqa_logits to fall back to PyTorch reference when heads < 16. This will be useful once the mla_decode_fwd issue is also fixed upstream.

ChuanLi1101 requested review from mgoin, pavanimajety and tjtanaa as code owners March 31, 2026 23:00

mergify bot added the rocm Related to AMD ROCm label Mar 31, 2026

github-project-automation bot added this to AMD Mar 31, 2026

github-project-automation bot moved this to Todo in AMD Mar 31, 2026

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Merge branch 'main' into fix/rocm-glm5-mxfp4-optimizations

b1a93f1

mergify bot added the v1 label Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support#38665

[ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support#38665
ChuanLi1101 wants to merge 3 commits intovllm-project:mainfrom
ChuanLi1101:fix/rocm-glm5-mxfp4-optimizations

ChuanLi1101 commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ChuanLi1101 commented Apr 1, 2026

Uh oh!

mergify bot commented Apr 1, 2026

Uh oh!

ChuanLi1101 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ChuanLi1101 commented Mar 31, 2026

Summary

Context

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ChuanLi1101 commented Apr 1, 2026

Benchmark Status Update

What we verified

Theoretical performance impact

Uh oh!

mergify bot commented Apr 1, 2026

Uh oh!

ChuanLi1101 commented Apr 1, 2026

Benchmark Update (follow-up)

What was verified

Blocking issue

Workaround committed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant