Skip to content

[GPU] Add the capability for KV cache to update past KV #33114

Merged
Kotomi-Du merged 43 commits intoopenvinotoolkit:masterfrom
Kotomi-Du:update_kvcache_node
Jan 21, 2026
Merged

[GPU] Add the capability for KV cache to update past KV #33114
Kotomi-Du merged 43 commits intoopenvinotoolkit:masterfrom
Kotomi-Du:update_kvcache_node

Conversation

@Kotomi-Du
Copy link
Contributor

@Kotomi-Du Kotomi-Du commented Dec 3, 2025

Details:

This PR is to recognize the pattern of ScatterElementUpdate+Slice node(blue nodes in the picture below) and fuse them into multi-stages KVCache node. Besides, past_seq_len from onnx GQA which serves for correcting the length of KV Cache is missing in decomposition of onnx operator, it is added in the PR to make sure it is benefited from the new capability of KVCache.

After fusion, two related changes happened.

  1. ScatteElementUpdate is handled by adding reorder_stage to execute ScatteElementUpdate kernel
  2. Slice is handled by in-place crop by updating the data padding of variableState.

The picture below shows the graph changes before and after fusion.
image

Motivation and Context

The target application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.

The current KV Cache API available is OV is very slow which cannot meet customer requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to recognize the pattern of added nodes and fuse them into multi-stages KVCache node to be more performant.

Tickets:

CVS-176367

Related PR

#32708

@Kotomi-Du Kotomi-Du requested review from a team as code owners December 3, 2025 19:18
@Kotomi-Du Kotomi-Du marked this pull request as draft December 3, 2025 19:19
@github-actions github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: GPU OpenVINO GPU plugin labels Dec 3, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Dec 3, 2025
@p-durandin
Copy link
Contributor

build_jenkins

@ZackyLake
Copy link
Contributor

build_jenkins

@songbell
Copy link
Contributor

songbell commented Jan 9, 2026

is your pipeline only able to run on GPU not CPU?

@Kotomi-Du
Copy link
Contributor Author

is your pipeline only able to run on GPU not CPU?

yes, our pipeline is required to be runnable on GPU.

@p-durandin
Copy link
Contributor

@Kotomi-Du please fix CI errors

Copy link
Contributor

@isanghao isanghao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments are left

// readvalue --> any
// | |
// | v
// ------> kvcache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate more why/how it can be optimized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If read_value is not optimized, we will get incorrect result among scatterelementupdate, so some change here is needed.

Original code is simply checking if readvalue is being used by single user, to be honest I don't know if it can prove anything --- that user could be actually a no-op with multiple further users.

From the comment in its caller, looks like it's actually trying to ensure assign will not impact any following user of readvalue, the original logic looks not very promising already.

Anyway, for our case, readvalue's user eventually need to pass kvcache before assign, which makes kvcache node the dominator of assign node, so it could be safely treated as if readvalue is directly connecting to kvcache, and could be optimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my ask here was to add comment on "why/how". As it is not blocking code merge, could you follow-up as a separate PR?

@Kotomi-Du
Copy link
Contributor Author

Kotomi-Du commented Jan 16, 2026

Hi, @mvafin @mryzhov Please review the frontend and transformation part for GQA, thanks.

Copy link
Contributor

@isanghao isanghao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you check this comment? #33114 (comment)

Copy link
Contributor

@mryzhov mryzhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from the Transformations perspective

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin category: IE Tests OpenVINO Test: plugins and common category: transformations OpenVINO Runtime library - Transformations Code Freeze ExternalIntelPR External contributor from Intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants