Skip to content

[NPU]GLM-4.7-Flash optimize with fused kernels#29509

Open
Estrella-xx wants to merge 1 commit into
sgl-project:mainfrom
Estrella-xx:glm4.7flash_fusion_op
Open

[NPU]GLM-4.7-Flash optimize with fused kernels#29509
Estrella-xx wants to merge 1 commit into
sgl-project:mainfrom
Estrella-xx:glm4.7flash_fusion_op

Conversation

@Estrella-xx

@Estrella-xx Estrella-xx commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Motivation

Introduce a fused Triton kernel to improve model performance.

Modifications

Replace the original split + RMSNorm pipeline with a fused Triton kernel.

Accuracy Tests

Before:
image
After:
image

Speed Tests and Profiling

屏幕截图 2026-06-27 184820

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28286980781
Latest PR Test (Extra): ❌ Run #28286980724

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the DeepSeek-V2 MLA attention preparation on NPU by introducing a fused split and normalization kernel (fused_split_qk_norm) for smaller sequence lengths. However, the review identifies two critical issues: first, using the fused kernel when context parallel is enabled leads to a NameError because latent_cache is not defined; second, removing the definition of k_pe from the outer scope causes a NameError when m.q_lora_rank is None.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)
k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1)
else:
if qkv_latent.shape[0] < 65536:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When context parallel is enabled (dsa_use_prefill_cp(forward_batch) is True), latent_cache is required later in m.rebuild_cp_kv_cache (line 249). However, if qkv_latent.shape[0] < 65536 is True, the fused kernel fused_split_qk_norm is called, which does not define latent_cache, leading to a NameError at runtime.

We should add a check to ensure we do not use the fused kernel when context parallel is enabled, similar to the check in forward_dsa_prepare_npu.

Suggested change
if qkv_latent.shape[0] < 65536:
if qkv_latent.shape[0] < 65536 and not dsa_use_prefill_cp(forward_batch):

@@ -217,7 +237,6 @@ def forward_mla_prepare_npu(
k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

By removing the line k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1) from the outer scope (previously line 220 on the LEFT side), k_pe is no longer defined when m.q_lora_rank is None (the else branch of the outer conditional). This will cause a NameError when attempting to use k_pe in m.rotary_emb (line 245).

We should define k_pe inside the else block to ensure it is available when m.q_lora_rank is None.

Suggested change
k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)
k_nope = m.kv_a_layernorm(k_nope).unsqueeze(1)
k_pe = latent_cache[..., m.kv_lora_rank :].unsqueeze(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant