Skip to content

[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940

Open
mgehre-amd wants to merge 2 commits into
gfx11from
matthias.moe-modular-alias-fused-out
Open

[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940
mgehre-amd wants to merge 2 commits into
gfx11from
matthias.moe-modular-alias-fused-out

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented May 19, 2026

Summary

In the modular MoE pipeline the trailing TopKWeightAndReduceNoOP.apply() unconditionally copies fused_expert_output → output every layer. For experts that already write the post-reduce result to fused_out (e.g. HybridW4A16MoEExperts via moe_unpermute), this copy is pure overhead — one __amd_rocclr_copyBuffer per MoE layer (48 per decode step on Qwen3-Omni-30B-A3B).

Adds an opt-in accepts_output_alias() hook on FusedMoEExpertsModular. When the expert opts in (and there are no shared experts), the we pass in the output tensor as argument, so the results can be directly written there instead of doing an extra copy.

Benchmark — cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4, max_concurrency=1.

TPOT median Δ vs gfx11
gfx11 (this PR reverted) 12.82 ms
with this PR 12.68 ms −1.09%

copyBuffer per decode step: 48 → 0 (per profile capture).

Tests

No new test — the existing parametric tests/kernels/moe/test_hybrid_w4a16_moe.py exercises the modified FusedMoEKernelModularImpl end-to-end through HybridW4A16MoEExperts (the only expert that opts in). The aliasing branch is taken automatically when shared_experts is None. Existing tolerance vs torch reference holds; 38/38 pass locally.

Test plan

  • pytest tests/kernels/moe/test_hybrid_w4a16_moe.py -v
  • End-to-end Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit run; generated text matches a baseline run

Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
@mgehre-amd mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from 3ec42bc to fe1ef08 Compare May 19, 2026 13:11
… copy

In the modular MoE pipeline, the trailing TopKWeightAndReduceNoOP.apply()
unconditionally copies fused_expert_output -> output every layer.  For
experts that already write the post-reduce result to fused_out (e.g.
HybridW4A16MoEExperts via moe_unpermute), this copy is pure overhead --
one __amd_rocclr_copyBuffer per MoE layer (48 per decode step on
Qwen3-Omni-30B-A3B).

Adds an `accepts_output_alias()` opt-in on FusedMoEExpertsModular.
When the expert opts in (and there are no shared experts), the modular
kernel passes the final `output` tensor as `output_alias` into
_allocate_buffers, which returns it as `fused_out`.  moe_unpermute then
writes directly to the final destination.  TopKWeightAndReduceNoOP
detects the aliasing via data_ptr() and skips the copy_().

HybridW4A16MoEExperts opts in: by the time moe_unpermute writes to
output, hidden_states is no longer read (gemm2_out is a separate
workspace2 tensor).  The invariant is documented on the override; any
other expert kernel that adopts the opt-in must re-verify the same
property.

Measured on cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit,
Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4:
- TPOT 12.72 ms -> 12.62 ms (-0.8%)
- copyBuffer count per decode step: 48 -> 0
- All 4 generated_texts byte-identical to baseline.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
- Rename the caller-supplied output parameter to `out` in
  `_fused_experts` (was `output_alias`) — clearer signature.
- `_allocate_buffers` no longer takes the tensor; it takes a boolean
  `allocate_output` and returns `None` for `fused_out` when False
  (caller will supply it).
- Move the shape / dtype / contiguity / device validation out of
  `_allocate_buffers` into `_fused_experts` as asserts gated on
  `out is not None`.  The contract documentation now lives next to
  where callers can act on it.

No behavior change for the alias-output fast path; tests pass:
  pytest tests/kernels/moe/test_hybrid_w4a16_moe.py  → 38 passed.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from fe1ef08 to d134c72 Compare May 28, 2026 11:14
@mgehre-amd mgehre-amd requested a review from amd-callumm May 28, 2026 11:16
@mgehre-amd mgehre-amd marked this pull request as ready for review May 28, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant