[NPU]GLM-4.7-Flash optimize with fused kernels by Estrella-xx · Pull Request #29509 · sgl-project/sglang

Estrella-xx · 2026-06-27T10:48:55Z

Motivation

Introduce a fused Triton kernel to improve model performance.

Modifications

Replace the original split + RMSNorm pipeline with a fused Triton kernel.

Accuracy Tests

Before:

After:

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28286980781
Latest PR Test (Extra): ❌ Run #28286980724

gemini-code-assist

Code Review

This pull request optimizes the DeepSeek-V2 MLA attention preparation on NPU by introducing a fused split and normalization kernel (fused_split_qk_norm) for smaller sequence lengths. However, the review identifies two critical issues: first, using the fused kernel when context parallel is enabled leads to a NameError because latent_cache is not defined; second, removing the definition of k_pe from the outer scope causes a NameError when m.q_lora_rank is None.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-27T10:50:12Z

+                k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)
+                k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1)
+            else:
+                if qkv_latent.shape[0] < 65536:


When context parallel is enabled (dsa_use_prefill_cp(forward_batch) is True), latent_cache is required later in m.rebuild_cp_kv_cache (line 249). However, if qkv_latent.shape[0] < 65536 is True, the fused kernel fused_split_qk_norm is called, which does not define latent_cache, leading to a NameError at runtime.

We should add a check to ensure we do not use the fused kernel when context parallel is enabled, similar to the check in forward_dsa_prepare_npu.

Suggested change

if qkv_latent.shape[0] < 65536:

if qkv_latent.shape[0] < 65536 and not dsa_use_prefill_cp(forward_batch):

gemini-code-assist · 2026-06-27T10:50:12Z

@@ -217,7 +237,6 @@ def forward_mla_prepare_npu(
            k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)


By removing the line k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1) from the outer scope (previously line 220 on the LEFT side), k_pe is no longer defined when m.q_lora_rank is None (the else branch of the outer conditional). This will cause a NameError when attempting to use k_pe in m.rotary_emb (line 245).

We should define k_pe inside the else block to ensure it is available when m.q_lora_rank is None.

Suggested change

k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)

k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)

k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1)

[NPU]GLM-4.7-Flash optimize with fused kernels

01face6

Estrella-xx requested review from iforgetmyname, ping1jing2 and whybeyoung as code owners June 27, 2026 10:48

github-actions Bot added deepseek npu labels Jun 27, 2026

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPU]GLM-4.7-Flash optimize with fused kernels#29509

[NPU]GLM-4.7-Flash optimize with fused kernels#29509
Estrella-xx wants to merge 1 commit into
sgl-project:mainfrom
Estrella-xx:glm4.7flash_fusion_op

Estrella-xx commented Jun 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if qkv_latent.shape[0] < 65536:
	if qkv_latent.shape[0] < 65536 and not dsa_use_prefill_cp(forward_batch):

		@@ -217,7 +237,6 @@ def forward_mla_prepare_npu(
		k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)

	k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)
	k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)
	k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1)

Uh oh!

Conversation

Estrella-xx commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Estrella-xx commented Jun 27, 2026 •

edited by github-actions Bot

Loading