[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy by mgehre-amd · Pull Request #940 · ROCm/vllm

mgehre-amd · 2026-05-19T08:48:31Z

Summary

In the modular MoE pipeline the trailing TopKWeightAndReduceNoOP.apply() unconditionally copies fused_expert_output → output every layer. For experts that already write the post-reduce result to fused_out (e.g. HybridW4A16MoEExperts via moe_unpermute), this copy is pure overhead — one __amd_rocclr_copyBuffer per MoE layer (48 per decode step on Qwen3-Omni-30B-A3B).

Adds an opt-in accepts_output_alias() hook on FusedMoEExpertsModular. When the expert opts in (and there are no shared experts), the we pass in the output tensor as argument, so the results can be directly written there instead of doing an extra copy.

Benchmark — cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4, max_concurrency=1.

	TPOT median	Δ vs gfx11
gfx11 (this PR reverted)	12.82 ms	—
with this PR	12.68 ms	−1.09%

copyBuffer per decode step: 48 → 0 (per profile capture).

Tests

No new test — the existing parametric tests/kernels/moe/test_hybrid_w4a16_moe.py exercises the modified FusedMoEKernelModularImpl end-to-end through HybridW4A16MoEExperts (the only expert that opts in). The aliasing branch is taken automatically when shared_experts is None. Existing tolerance vs torch reference holds; 38/38 pass locally.

Test plan

pytest tests/kernels/moe/test_hybrid_w4a16_moe.py -v
End-to-end Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit run; generated text matches a baseline run

Adds an opt-in mechanism for MoE expert kernels to write their result directly into the final output buffer, eliminating the trailing fused_out->output copy in TopKWeightAndReduceNoOP.apply(). Changes: - modular_kernel.py: add accepts_output_alias() to FusedMoEExpertsModular (default False). When True, FusedMoEKernelModularImpl.apply() passes the pre-allocated output tensor as the fused_out buffer so the expert kernel writes to it directly. The shared_experts is None guard is omitted: in the non-async path _maybe_apply_shared_experts is never called during expert kernel execution, and shared expert I/O uses separate tensors from output. Fix two bugs introduced during development: self.shared_experts (non-existent attribute) -> shared_experts parameter; stale output_alias name in rocm_aiter block -> out. - hybrid_w4a16_moe.py: HybridW4A16MoEExpertsModular.accepts_output_alias() returns True. The data-flow through the wvSplitK kernel (gemm1->act-> gemm2->moe_unpermute) writes its final result to the output parameter after all reads of hidden_states are complete, making aliasing safe. - topk_weight_and_reduce.py: TopKWeightAndReduceNoOP.apply() adds a data_ptr() equality check as a fallback for the identity check, covering cases where the aliased tensor is wrapped in a different Python object. On Qwen3.5-35B-A3B-W4A16 decode (40 MoE layers, 128 steps, gfx1151): Memcpy DtoD calls: 7075 -> 1955 (-72%) DtoD GPU time: 15.1ms -> 4.2ms (-10.9ms) Accuracy verified: arc_challenge 25-shot acc_norm 0.78 (no regression). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd · 2026-06-12T14:27:34Z

@roberteg16 , does this PR help Qwen3.6?

roberteg16 · 2026-06-12T14:54:45Z

@roberteg16 , does this PR help Qwen3.6?

This seems to be impactful for decode, not prefill.

● A/B complete (single sequence, max_num_seqs=1, 128 output tokens):            

  ┌────────┬──────────────────┬────────────────────┐                                                                                                                                  
  │ metric │ with cherry-pick │ without (baseline) │
  ├────────┼──────────────────┼────────────────────┤                                                                                                                                  
  │ Decode │ 79.4 tok/s       │ 79.4 tok/s         │              
  ├────────┼──────────────────┼────────────────────┤
  │ TPOT   │ 12.59 ms         │ 12.60 ms           │                                                                                                                                  
  ├────────┼──────────────────┼────────────────────┤
  │ TTFT   │ 646 ms           │ 653 ms             │                                                                                                                                  
  └────────┴──────────────────┴────────────────────┘                                                                                                                                  
   
  No measurable difference — decode is identical (79.4 vs 79.4 tok/s). Even decode doesn't benefit here.

mgehre-amd commented May 19, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated

mgehre-amd commented May 19, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated

mgehre-amd commented May 19, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py

mgehre-amd commented May 19, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated

mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from 3ec42bc to fe1ef08 Compare May 19, 2026 13:11

mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from fe1ef08 to d134c72 Compare May 28, 2026 11:14

mgehre-amd requested a review from amd-callumm May 28, 2026 11:16

mgehre-amd marked this pull request as ready for review May 28, 2026 11:16

mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch 2 times, most recently from 1da299c to a373afe Compare June 8, 2026 22:55

mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from a373afe to 6a565c1 Compare June 9, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940

[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940
mgehre-amd wants to merge 1 commit into
gfx11from
matthias.moe-modular-alias-fused-out

mgehre-amd commented May 19, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgehre-amd commented Jun 12, 2026

Uh oh!

roberteg16 commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mgehre-amd commented May 19, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark — cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

Tests

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgehre-amd commented Jun 12, 2026

Uh oh!

roberteg16 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgehre-amd commented May 19, 2026 •

edited by github-actions Bot

Loading

roberteg16 commented Jun 12, 2026 •

edited

Loading