Skip to content

[NPU] Qwen3-VL-30B use split_qkv_rmsnorm_rope for extend#29505

Open
silencejade wants to merge 1 commit into
sgl-project:mainfrom
silencejade:br_skip_mrope
Open

[NPU] Qwen3-VL-30B use split_qkv_rmsnorm_rope for extend#29505
silencejade wants to merge 1 commit into
sgl-project:mainfrom
silencejade:br_skip_mrope

Conversation

@silencejade

@silencejade silencejade commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Motivation

When using the torch_npu.npu_mrope interface with Qwen3-VL-30B, certain precision issues occur. We need to adjust the invocation path: during the extend phase, utilize the fused operator split_qkv_rmsnorm_rope to bypass this operator.

Modifications

Adjust the branch judgment so that forward_prepare of Qwen3MoeAttention uses forward_prepare_npu by default in NPU scenarios.

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28285454272
Latest PR Test (Extra): 🚫 Run #28285454197

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the NPU condition in forward_prepare for the Qwen3 MoE model, routing all NPU forward passes to forward_prepare_npu. However, the reviewer identified a critical issue where this change will cause runtime errors or incorrect embeddings for models like Qwen3-VL-30B because forward_prepare_npu does not handle multi-dimensional positions required by MRotaryEmbedding (MRoPE). A code suggestion was provided to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

not _is_npu
or forward_batch.forward_mode.is_extend_or_draft_extend_or_mixed()
):
if not _is_npu:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

By changing this condition to if not _is_npu:, all NPU forward passes (both extend and decode) will now invoke forward_prepare_npu.\n\nHowever, in forward_prepare_npu (lines 561-585), self.rotary_emb.get_cos_sin_with_position(positions) is called with the 1D positions tensor. For Qwen3-VL-30B, self.rotary_emb is an instance of MRotaryEmbedding (MRoPE), which expects multi-dimensional positions (forward_batch.mrope_positions of shape [3, seq_len]) rather than 1D positions. Passing 1D positions to MRotaryEmbedding will result in incorrect rotary embedding application or runtime errors.\n\nTo prevent this, forward_prepare_npu should be updated to check if self.rotary_emb is an instance of MRotaryEmbedding and pass forward_batch.mrope_positions accordingly:\n\npython\n def forward_prepare_npu(\n self,\n positions: torch.Tensor,\n hidden_states: torch.Tensor,\n forward_batch: ForwardBatch,\n ):\n qkv, _ = self.qkv_proj(hidden_states)\n if self.attn.layer_id == self.start_layer:\n if isinstance(self.rotary_emb, MRotaryEmbedding):\n self.rotary_emb.get_cos_sin_with_position(forward_batch.mrope_positions)\n else:\n self.rotary_emb.get_cos_sin_with_position(positions)\n ...\n

@sglang-npu-bot

Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants