[XPU] Support apply_router_weight_on_input for Llama4 for fused_experts by rahulvijayaraghavan · Pull Request #22654 · sgl-project/sglang

rahulvijayaraghavan · 2026-04-13T04:11:42Z

When apply_router_weight_on_input is True (as used by Llama4's MoE architecture), apply router weights directly to the input tensor before calling fused_experts, and replace topk_weights with ones. This is needed because fused_experts does not natively handle this flag.

Enables Llama4 model support on XPU fused_experts() where apply_router_weight_on_input was previously unhandled.

SGLANG_USE_SGL_XPU=1 python3 -m sglang.launch_server --model models--meta-llama--Llama-4-Scout-17B-16E-Instruct/snapshots/92f3b1597a195b523d8d9e5700e57e4fbb8f20d3/ --tp 8 --mem-fraction-static 0.7 --attention-backend triton --cpu-offload-gb 20 --context-length 8192

Before:

# python benchmark/gsm8k/bench_sglang.py --num-questions 200 --num-shots 5 --host http://127.0.0.1 --port 30000
Accuracy: 0.935
Invalid: 0.000
Latency: 3394.296 s
Output throughput: 5.874 token/s

After

# python benchmark/gsm8k/bench_sglang.py --num-questions 200 --num-shots 5 --host http://127.0.0.1/ --port 30000
Accuracy: 0.945
Invalid: 0.000
Latency: 2413.049 s
Output throughput: 8.180 token/s

When apply_router_weight_on_input is True (as used by Llama4's MoE architecture), apply router weights directly to the input tensor before calling fused_experts, and replace topk_weights with ones. This is needed because fused_experts does not natively handle this flag. Enables Llama4 model support on XPU fused_experts() where apply_router_weight_on_input was previously unhandled.

gemini-code-assist · 2026-04-13T04:11:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

rahulvijayaraghavan requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw, b8zhong and ch-wan as code owners April 13, 2026 04:11

github-actions bot added the quant LLM Quantization label Apr 13, 2026

polisettyvarma approved these changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] Support apply_router_weight_on_input for Llama4 for fused_experts#22654

[XPU] Support apply_router_weight_on_input for Llama4 for fused_experts#22654
rahulvijayaraghavan wants to merge 1 commit intosgl-project:mainfrom
rahulvijayaraghavan:llama4-fused-experts-apply-router-weight-on-input

rahulvijayaraghavan commented Apr 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rahulvijayaraghavan commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahulvijayaraghavan commented Apr 13, 2026 •

edited

Loading