[NPU] Qwen3-VL-30B use split_qkv_rmsnorm_rope for extend#29505
[NPU] Qwen3-VL-30B use split_qkv_rmsnorm_rope for extend#29505silencejade wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request simplifies the NPU condition in forward_prepare for the Qwen3 MoE model, routing all NPU forward passes to forward_prepare_npu. However, the reviewer identified a critical issue where this change will cause runtime errors or incorrect embeddings for models like Qwen3-VL-30B because forward_prepare_npu does not handle multi-dimensional positions required by MRotaryEmbedding (MRoPE). A code suggestion was provided to resolve this issue.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| not _is_npu | ||
| or forward_batch.forward_mode.is_extend_or_draft_extend_or_mixed() | ||
| ): | ||
| if not _is_npu: |
There was a problem hiding this comment.
By changing this condition to if not _is_npu:, all NPU forward passes (both extend and decode) will now invoke forward_prepare_npu.\n\nHowever, in forward_prepare_npu (lines 561-585), self.rotary_emb.get_cos_sin_with_position(positions) is called with the 1D positions tensor. For Qwen3-VL-30B, self.rotary_emb is an instance of MRotaryEmbedding (MRoPE), which expects multi-dimensional positions (forward_batch.mrope_positions of shape [3, seq_len]) rather than 1D positions. Passing 1D positions to MRotaryEmbedding will result in incorrect rotary embedding application or runtime errors.\n\nTo prevent this, forward_prepare_npu should be updated to check if self.rotary_emb is an instance of MRotaryEmbedding and pass forward_batch.mrope_positions accordingly:\n\npython\n def forward_prepare_npu(\n self,\n positions: torch.Tensor,\n hidden_states: torch.Tensor,\n forward_batch: ForwardBatch,\n ):\n qkv, _ = self.qkv_proj(hidden_states)\n if self.attn.layer_id == self.start_layer:\n if isinstance(self.rotary_emb, MRotaryEmbedding):\n self.rotary_emb.get_cos_sin_with_position(forward_batch.mrope_positions)\n else:\n self.rotary_emb.get_cos_sin_with_position(positions)\n ...\n
|
/tag-and-rerun-ci |
Motivation
When using the torch_npu.npu_mrope interface with Qwen3-VL-30B, certain precision issues occur. We need to adjust the invocation path: during the extend phase, utilize the fused operator split_qkv_rmsnorm_rope to bypass this operator.
Modifications
Adjust the branch judgment so that forward_prepare of Qwen3MoeAttention uses forward_prepare_npu by default in NPU scenarios.
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): ❌ Run #28285454272
Latest PR Test (Extra): 🚫 Run #28285454197