[GPU] Fix Gemma4-E4B SDPA model by Lyamin-Roman · Pull Request #35642 · openvinotoolkit/openvino

Lyamin-Roman · 2026-05-01T21:48:12Z

Details:

This PR fixes two consecutive errors related to the inference of Gemma4-E4B SDPA version model on iGPU

In this model, a single KVCache is shared between multiple SDPA layers
So problems arise in SyncInferRequest::allocate_states and VariableStateIndirectKVCacheCompressed is created when one kv_cache_prim is compressed and the other is not, which leads to output mismatches in the plugin.
Currently, this only affects iGPUs with supports_immad=false, so we can temporarily fix it by disabling KVCacheCompression for such a graph, which removes potential mismatches.
It will need to be redone when PA models are enabled.
Out-of-bounds SLM access in sdpa_opt finalization kernel.
The finalization stage allocated tmp_slm[SUBGROUP_SIZE] elements for cross-subgroup reduction, but the actual number of subgroups per workgroup is SUBGROUPS_PER_WG = CEIL_DIV(V_HEAD_SIZE * SG_SCALE_FACTOR, SUBGROUP_SIZE).
When V_HEAD_SIZE > SUBGROUP_SIZE^2 (e.g. head_size=512, SUBGROUP_SIZE=16 gives SUBGROUPS_PER_WG=32 > 16), tmp_slm[sgid] writes go out of bounds.
So changed tmp_slm allocation to SUBGROUPS_PER_WG and replaced the single-pass lane-indexed reduction with a folded loop over CEIL_DIV(SUBGROUPS_PER_WG, SUBGROUP_SIZE) iterations, correctly reducing across all subgroups regardless of head size.

There are problems reproducing the problem using the test, apparently more interactions with memory are needed to reproduce it as in the inference of the whole model

AI Assistance:

AI assistance used: yes

clee30 · 2026-05-04T07:15:22Z

        manager.register_pass<ov::pass::GLUFusion>();
        manager.register_pass<ov::intel_gpu::IndirectKVCache>();

+        if (!has_shared_kv_cache_vars(func)) {


As this only affect iGPU with support_immad = false, you may skip the check if support_immad=true

This transformation is enabled only in the following cases:

is_paged_attention_model

info.supports_immad == false

auxiliary_kv_update_model

And when a model with PA appears, I think the same problem will appear, so I think it's better to temporarily disable this transformation overall for the unsupported graph

[GPU] Fix Gemma4-E4B SDPA model

569fd5b

Lyamin-Roman added this to the 2026.2 milestone May 1, 2026

Lyamin-Roman requested review from a team as code owners May 1, 2026 21:48

Lyamin-Roman added the category: GPU OpenVINO GPU plugin label May 1, 2026

clee30 reviewed May 4, 2026

View reviewed changes

e-ddykim reviewed May 4, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

Comment thread src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa_opt.cl

[GPU] Review changes applied

8cc965f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Fix Gemma4-E4B SDPA model#35642

[GPU] Fix Gemma4-E4B SDPA model#35642
Lyamin-Roman wants to merge 2 commits intoopenvinotoolkit:masterfrom
Lyamin-Roman:sdpa_igpu_fix

Lyamin-Roman commented May 1, 2026

Uh oh!

clee30 May 4, 2026

Uh oh!

Lyamin-Roman May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lyamin-Roman commented May 1, 2026

Details:

AI Assistance:

Uh oh!

clee30 May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Lyamin-Roman May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants