[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940
Open
mgehre-amd wants to merge 2 commits into
Open
[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940mgehre-amd wants to merge 2 commits into
mgehre-amd wants to merge 2 commits into
Conversation
mgehre-amd
commented
May 19, 2026
mgehre-amd
commented
May 19, 2026
mgehre-amd
commented
May 19, 2026
mgehre-amd
commented
May 19, 2026
3ec42bc to
fe1ef08
Compare
… copy In the modular MoE pipeline, the trailing TopKWeightAndReduceNoOP.apply() unconditionally copies fused_expert_output -> output every layer. For experts that already write the post-reduce result to fused_out (e.g. HybridW4A16MoEExperts via moe_unpermute), this copy is pure overhead -- one __amd_rocclr_copyBuffer per MoE layer (48 per decode step on Qwen3-Omni-30B-A3B). Adds an `accepts_output_alias()` opt-in on FusedMoEExpertsModular. When the expert opts in (and there are no shared experts), the modular kernel passes the final `output` tensor as `output_alias` into _allocate_buffers, which returns it as `fused_out`. moe_unpermute then writes directly to the final destination. TopKWeightAndReduceNoOP detects the aliasing via data_ptr() and skips the copy_(). HybridW4A16MoEExperts opts in: by the time moe_unpermute writes to output, hidden_states is no longer read (gemm2_out is a separate workspace2 tensor). The invariant is documented on the override; any other expert kernel that adopts the opt-in must re-verify the same property. Measured on cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit, Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4: - TPOT 12.72 ms -> 12.62 ms (-0.8%) - copyBuffer count per decode step: 48 -> 0 - All 4 generated_texts byte-identical to baseline. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
- Rename the caller-supplied output parameter to `out` in `_fused_experts` (was `output_alias`) — clearer signature. - `_allocate_buffers` no longer takes the tensor; it takes a boolean `allocate_output` and returns `None` for `fused_out` when False (caller will supply it). - Move the shape / dtype / contiguity / device validation out of `_allocate_buffers` into `_fused_experts` as asserts gated on `out is not None`. The contract documentation now lives next to where callers can act on it. No behavior change for the alias-output fast path; tests pass: pytest tests/kernels/moe/test_hybrid_w4a16_moe.py → 38 passed. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
fe1ef08 to
d134c72
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In the modular MoE pipeline the trailing
TopKWeightAndReduceNoOP.apply()unconditionally copiesfused_expert_output → outputevery layer. For experts that already write the post-reduce result tofused_out(e.g.HybridW4A16MoEExpertsviamoe_unpermute), this copy is pure overhead — one__amd_rocclr_copyBufferper MoE layer (48 per decode step on Qwen3-Omni-30B-A3B).Adds an opt-in
accepts_output_alias()hook onFusedMoEExpertsModular. When the expert opts in (and there are no shared experts), the we pass in the output tensor as argument, so the results can be directly written there instead of doing an extra copy.Benchmark — cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit
Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4, max_concurrency=1.
copyBufferper decode step: 48 → 0 (per profile capture).Tests
No new test — the existing parametric
tests/kernels/moe/test_hybrid_w4a16_moe.pyexercises the modifiedFusedMoEKernelModularImplend-to-end throughHybridW4A16MoEExperts(the only expert that opts in). The aliasing branch is taken automatically whenshared_experts is None. Existing tolerance vs torch reference holds; 38/38 pass locally.Test plan
pytest tests/kernels/moe/test_hybrid_w4a16_moe.py -v