Skip to content

[XPU] Support apply_router_weight_on_input for Llama4 for fused_experts#22654

Open
rahulvijayaraghavan wants to merge 1 commit intosgl-project:mainfrom
rahulvijayaraghavan:llama4-fused-experts-apply-router-weight-on-input
Open

[XPU] Support apply_router_weight_on_input for Llama4 for fused_experts#22654
rahulvijayaraghavan wants to merge 1 commit intosgl-project:mainfrom
rahulvijayaraghavan:llama4-fused-experts-apply-router-weight-on-input

Conversation

@rahulvijayaraghavan
Copy link
Copy Markdown
Contributor

@rahulvijayaraghavan rahulvijayaraghavan commented Apr 13, 2026

When apply_router_weight_on_input is True (as used by Llama4's MoE architecture), apply router weights directly to the input tensor before calling fused_experts, and replace topk_weights with ones. This is needed because fused_experts does not natively handle this flag.

Enables Llama4 model support on XPU fused_experts() where apply_router_weight_on_input was previously unhandled.

SGLANG_USE_SGL_XPU=1 python3 -m sglang.launch_server --model models--meta-llama--Llama-4-Scout-17B-16E-Instruct/snapshots/92f3b1597a195b523d8d9e5700e57e4fbb8f20d3/ --tp 8 --mem-fraction-static 0.7 --attention-backend triton --cpu-offload-gb 20 --context-length 8192

Before:

# python benchmark/gsm8k/bench_sglang.py --num-questions 200 --num-shots 5 --host http://127.0.0.1 --port 30000
Accuracy: 0.935
Invalid: 0.000
Latency: 3394.296 s
Output throughput: 5.874 token/s

After

# python benchmark/gsm8k/bench_sglang.py --num-questions 200 --num-shots 5 --host http://127.0.0.1/ --port 30000
Accuracy: 0.945
Invalid: 0.000
Latency: 2413.049 s
Output throughput: 8.180 token/s

When apply_router_weight_on_input is True (as used by Llama4's MoE
architecture), apply router weights directly to the input tensor before
calling fused_experts, and replace topk_weights with ones. This is needed
because fused_experts does not natively handle this flag.

Enables Llama4 model support on XPU fused_experts() where
apply_router_weight_on_input was previously unhandled.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the quant LLM Quantization label Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants