Skip redundant moe_sum_reduce for single-expert routing on XPU by rahulvijayaraghavan · Pull Request #22660 · sgl-project/sglang

rahulvijayaraghavan · 2026-04-13T04:38:52Z

When topk_ids.shape[1] == 1 and routed_scaling_factor == 1.0, the second invoke_fused_moe_kernel call already writes its output directly into out_hidden_states, so the subsequent moe_sum_reduce is a no-op reduction over a single element. This adds an early-exit check on the XPU path to skip the unnecessary kernel launch, matching the existing optimization already present in the CUDA path.

This is particularly relevant for Llama-4-Scout models (e.g. Llama-4-Scout-17B-16E-Instruct), which set num_experts_per_tok = 1, meaning this fast path is hit on every MoE layer forward pass.

When topk_ids.shape[1] == 1 and routed_scaling_factor == 1.0, the second invoke_fused_moe_kernel call already writes its output directly into out_hidden_states, so the subsequent moe_sum_reduce is a no-op reduction over a single element. This adds an early-exit check on the XPU path to skip the unnecessary kernel launch, matching the existing optimization already present in the CUDA path. This is particularly relevant for Llama-4-Scout models (e.g. Llama-4-Scout-17B-16E-Instruct), which set num_experts_per_tok = 1, meaning this fast path is hit on every MoE layer forward pass.

gemini-code-assist · 2026-04-13T04:38:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

mingfeima · 2026-04-13T06:10:05Z

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

+            if topk_ids.shape[1] == 1 and routed_scaling_factor == 1.0:
+                pass  # we write directly into out_hidden_states
+            else:
+                moe_sum_reduce(
+                    intermediate_cache3.view(*intermediate_cache3.shape),
+                    out_hidden_states[begin_chunk_idx:end_chunk_idx],
+                    routed_scaling_factor,
+                )


make sure we have test cases to cover:

topk == 1, routed_scaling_factor == 1.0

topk == 1, routed_scaling_factor != 1.0

addtionally, is it possible to move topk_ids.shape[1] == 1 and routed_scaling_factor == 1.0 up and skip intermediate_cache3 allocation at the first place.

rahulvijayaraghavan requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 13, 2026 04:38

polisettyvarma approved these changes Apr 13, 2026

View reviewed changes

mingfeima requested changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip redundant moe_sum_reduce for single-expert routing on XPU#22660

Skip redundant moe_sum_reduce for single-expert routing on XPU#22660
rahulvijayaraghavan wants to merge 1 commit intosgl-project:mainfrom
rahulvijayaraghavan:skip-redundant-moe-sum-reduce-xpu

rahulvijayaraghavan commented Apr 13, 2026

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

mingfeima Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rahulvijayaraghavan commented Apr 13, 2026

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

mingfeima Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants