[GPU] Add the capability for KV cache to update past KV by Kotomi-Du · Pull Request #33114 · openvinotoolkit/openvino

Kotomi-Du · 2025-12-03T19:18:58Z

Details:

This PR is to recognize the pattern of ScatterElementUpdate+Slice node(blue nodes in the picture below) and fuse them into multi-stages KVCache node. Besides, past_seq_len from onnx GQA which serves for correcting the length of KV Cache is missing in decomposition of onnx operator, it is added in the PR to make sure it is benefited from the new capability of KVCache.

After fusion, two related changes happened.

ScatteElementUpdate is handled by adding reorder_stage to execute ScatteElementUpdate kernel
Slice is handled by in-place crop by updating the data padding of variableState.

The picture below shows the graph changes before and after fusion.

Motivation and Context

The target application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.

The current KV Cache API available is OV is very slow which cannot meet customer requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to recognize the pattern of added nodes and fuse them into multi-stages KVCache node to be more performant.

Tickets:

CVS-176367

Related PR

#32708

1. trigger trim flag when Slice pattern is matched 2. pass past_seq_len which is input data into trim info 3. store trim info in kv operator and kernel parameter 4. update input[0] and output[0] layout for trim

src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp

src/plugins/intel_gpu/src/plugin/transformations/kv_cache_fusion.cpp

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/reorder_kv_cache_ref.cl

...ugins/intel_gpu/src/kernel_selector/kernels/reorder_kv_cache/reorder_kv_cache_kernel_ref.cpp

p-durandin · 2025-12-09T05:35:57Z

build_jenkins

src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp

ZackyLake · 2026-01-07T20:43:09Z

build_jenkins

songbell · 2026-01-09T07:28:39Z

is your pipeline only able to run on GPU not CPU?

fix past_len checking during initialize

Kotomi-Du · 2026-01-11T21:54:26Z

is your pipeline only able to run on GPU not CPU?

yes, our pipeline is required to be runnable on GPU.

fix accuracy when there's no beam_idx add past_key_len handling

p-durandin · 2026-01-15T15:34:13Z

@Kotomi-Du please fix CI errors

fix some naming.

isanghao

LGTM, minor comments are left

isanghao · 2026-01-16T08:55:57Z

src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp

+    // readvalue --> any
+    //       |         |
+    //       |         v
+    //       ------> kvcache


could you elaborate more why/how it can be optimized?

If read_value is not optimized, we will get incorrect result among scatterelementupdate, so some change here is needed.

Original code is simply checking if readvalue is being used by single user, to be honest I don't know if it can prove anything --- that user could be actually a no-op with multiple further users.

From the comment in its caller, looks like it's actually trying to ensure assign will not impact any following user of readvalue, the original logic looks not very promising already.

Anyway, for our case, readvalue's user eventually need to pass kvcache before assign, which makes kvcache node the dominator of assign node, so it could be safely treated as if readvalue is directly connecting to kvcache, and could be optimized.

Actually, my ask here was to add comment on "why/how". As it is not blocking code merge, could you follow-up as a separate PR?

src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp

Kotomi-Du · 2026-01-16T18:06:09Z

Hi, @mvafin @mryzhov Please review the frontend and transformation part for GQA, thanks.

Harden fusion check on trim.

isanghao

LGTM, could you check this comment? #33114 (comment)

mryzhov

Looks good from the Transformations perspective

Kotomi-Du and others added 12 commits December 3, 2025 11:13

fuse GQA slice node into kvCache for in-place crop

5a95cc6

1. trigger trim flag when Slice pattern is matched 2. pass past_seq_len which is input data into trim info 3. store trim info in kv operator and kernel parameter 4. update input[0] and output[0] layout for trim

fix conformance issue

84d8095

Use RemoteTensor to reorder KV cache

ab5be7c

Add a kernel to reorder KV cache

a0ed479

Add KVCache index fusion for reorder

1c72628

Fix basic issues

4d67227

Prevent KV reorder execution for cases where it's not required

ff94cfe

Fix scalar arguments bug, remove debug prints

c85841f

Remove unused gather_by_axis code

817c983

Fix input offsets

b9d5f30

Add unit test case

07c75d8

Add feature bounds check

1586380

Kotomi-Du requested review from a team as code owners December 3, 2025 19:18

Kotomi-Du marked this pull request as draft December 3, 2025 19:19

github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: GPU OpenVINO GPU plugin labels Dec 3, 2025

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Dec 3, 2025

mdvoretc-intel reviewed Dec 4, 2025

View reviewed changes

Kotomi-Du added 5 commits December 5, 2025 15:26

clean up code

d338bcc

clean up execution stage

648a5fc

use scatterElementUpdate kernel instead of self customized kernel

d7043fe

delete customized kernel path

145e0f5

clean up code

e01cedf

Kotomi-Du mentioned this pull request Dec 9, 2025

Add Node to update KV cache in Stateful LLM model intel/onnxruntime#872

Merged

Kotomi-Du added 2 commits December 9, 2025 18:46

fix code style

4c2d73a

adjust index for compressed KV stage when update_kv stage is existed

9db4cc7

Kotomi-Du commented Jan 6, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp Show resolved Hide resolved

ZackyLake added 2 commits January 6, 2026 18:33

add debug priont for skipped kernel

5326963

Merge branch 'master' into update_kvcache_node

f321f9b

Merge branch 'master' into update_kvcache_node

ea925ed

ZackyLake added 2 commits January 9, 2026 20:29

fix kv fusion pattern。

6cc91a1

fix past_len checking during initialize

Merge branch 'master' into update_kvcache_node

561bc54

ZackyLake and others added 4 commits January 12, 2026 10:30

fix test

d836186

move trim_length to kv_cache_inst.h

6344883

fix kv fusion for test(stridedslice)

4d06b8e

fix accuracy when there's no beam_idx add past_key_len handling

include trim-only support

644cc75

ZackyLake added 4 commits January 15, 2026 10:49

fix concat_axis signness

abcdf72

fix fusion logic

a1ecd57

fix signedness

f8f1a58

allow trim on indirect kvcache.

3d65d54

fix some naming.

isanghao reviewed Jan 16, 2026

View reviewed changes

Merge branch 'master' into update_kvcache_node

313a748

ZackyLake and others added 5 commits January 16, 2026 18:00

Make CompressedKV compatible with trim.

c429feb

Harden fusion check on trim.

fix

6781639

Merge branch 'master' into update_kvcache_node

6a3bbd2

Merge branch 'master' into update_kvcache_node

3df8bac

Merge branch 'master' into update_kvcache_node

765e167

isanghao approved these changes Jan 20, 2026

View reviewed changes

add comment

0dc75bc

mryzhov approved these changes Jan 21, 2026

View reviewed changes

p-durandin approved these changes Jan 21, 2026

View reviewed changes

Conversation

Kotomi-Du commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Motivation and Context

Tickets:

Related PR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p-durandin commented Dec 9, 2025

Uh oh!

Uh oh!

ZackyLake commented Jan 7, 2026

Uh oh!

songbell commented Jan 9, 2026

Uh oh!

Kotomi-Du commented Jan 11, 2026

Uh oh!

p-durandin commented Jan 15, 2026

Uh oh!

isanghao left a comment

Choose a reason for hiding this comment

Uh oh!

isanghao Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ZackyLake Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

isanghao Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kotomi-Du commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isanghao left a comment

Choose a reason for hiding this comment

Uh oh!

mryzhov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Kotomi-Du commented Dec 3, 2025 •

edited

Loading

Kotomi-Du commented Jan 16, 2026 •

edited

Loading