Skip to content

Conversation

@ceciliapeng2011
Copy link
Contributor

@ceciliapeng2011 ceciliapeng2011 commented Jan 4, 2026

Improve KVCache quantization, XAttention flexibility, and sparse attention performance.

Details:

  • Use float as internal precision for KVCache quantization in kvcache_update CM kernel to fix accuracy issues in QWen3-32B int8 model.
  • Remove restriction in PA 2nd token CM kernel that limited heads_num / kv_heads_num <= 8, resolving MiniCPM4 failure.
  • Fix phi-3-mini-128k-instruct issue caused by head_size=96 not divisible by 64 in xattention_gemm_qk kernel.
  • Optimize sparse attention with fp16 kvcache when sparsity is small.

Tickets:

…ed with float precision to avoid an onverflow zp.
@ceciliapeng2011 ceciliapeng2011 requested review from a team as code owners January 4, 2026 03:17
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 4, 2026
@ceciliapeng2011 ceciliapeng2011 marked this pull request as draft January 4, 2026 03:18
@ceciliapeng2011 ceciliapeng2011 changed the title fix QWen3-32B int8 model accuracy issue: scale_val should be calculat… [GPU] some fixes and optimizations to CM PA and XAttention kernels Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants