Skip to content

Conversation

@l-bat
Copy link
Contributor

@l-bat l-bat commented Oct 7, 2025

Changes

Integration of Sparse Attention algorithm to GenAI Optimizations Module in OpenVINO Contrib.
Sparse Attention is designed to accelerate the prefill stage in LLMs and MMLLMs with long prompts, high-resolution images, or videos by attending only to the most relevant query-key blocks. This block-wise attention mechanism reduces memory usage and FLOPs while preserving model accuracy.
Supported modes:

  • Tri-Shape Mode – A static block-sparse attention pattern that preserves the initial tokens, local windows, and the final segment of the query, forming a triangular structure to capture critical tokens while maintaining instruction-following performance in both turn-0 and multi-request scenarios. Paper: https://arxiv.org/pdf/2412.10319
  • XAttention Mode – A dynamic block-sparse attention mechanism that accelerates inference by focusing computation on the most important regions of the attention matrix using antidiagonal block scoring, reducing FLOPs and memory usage without significant loss of accuracy. Paper: https://arxiv.org/pdf/2503.16428

Related tickets

169957

@l-bat l-bat requested a review from a team as a code owner October 7, 2025 13:39
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Oct 7, 2025
@apaniukov apaniukov enabled auto-merge (squash) October 7, 2025 14:29
@apaniukov apaniukov merged commit 7462a45 into openvinotoolkit:master Oct 7, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants