Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -678,11 +678,14 @@ def fused_experts_impl(
routed_scaling_factor,
)
elif _is_xpu:
moe_sum_reduce(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
routed_scaling_factor,
)
if topk_ids.shape[1] == 1 and routed_scaling_factor == 1.0:
pass # we write directly into out_hidden_states
else:
moe_sum_reduce(
intermediate_cache3.view(*intermediate_cache3.shape),
out_hidden_states[begin_chunk_idx:end_chunk_idx],
routed_scaling_factor,
)
Comment on lines +681 to +688
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure we have test cases to cover:

  • topk == 1, routed_scaling_factor == 1.0
  • topk == 1, routed_scaling_factor != 1.0

addtionally, is it possible to move topk_ids.shape[1] == 1 and routed_scaling_factor == 1.0 up and skip intermediate_cache3 allocation at the first place.

else:
if _has_vllm_ops:
vllm_ops.moe_sum(
Expand Down
Loading