Skip to content

Conversation

@ceciliapeng2011
Copy link
Contributor

@ceciliapeng2011 ceciliapeng2011 commented Jan 7, 2026

Details:

  • [GPU] Extend XAttention to support block sizes 128 and 256.

Tickets:

@ceciliapeng2011 ceciliapeng2011 requested review from a team as code owners January 7, 2026 06:25
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 7, 2026
@peterchen-intel peterchen-intel added this to the 2026.0 milestone Jan 8, 2026
Copy link
Contributor

@riverlijunjie riverlijunjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, totally LGTM.

svmptr_t sparse_mask_base [[type("svmptr_t")]],
svmptr_t wg_sparse_mask_base [[type("svmptr_t")]],
bool validate,
int SPARSE_BLOCK_SIZE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this parameter if it is a macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need. SPARSE_BLOCK_SIZE is a runtime parameter now for PA kernel, instead of a compile time jit const.

res_event = {execute_stage(res_event, instance, xattn_estimate_find_block)};
res_event = {execute_stage(res_event, instance, xattn_estimate_post_proc)};
if (!bypass_xattn(params)) {
if (rt_params->xattn_block_size == 128) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If xattn_block_size is fixed value, we don't need add_stage for both 128 and 256.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately xattn_block_size is a compile time jit const for xattention kernels, while it is also a runtime parameter of model with PA node. This means users can dynamically switch it from time to time during inferencing. So this PR has to create two stages (one for 128, the other for 256) to switch in fly, accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin Code Freeze

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants