Skip to content

[GPU] Fix Gemma4-E4B SDPA model#35642

Open
Lyamin-Roman wants to merge 2 commits intoopenvinotoolkit:masterfrom
Lyamin-Roman:sdpa_igpu_fix
Open

[GPU] Fix Gemma4-E4B SDPA model#35642
Lyamin-Roman wants to merge 2 commits intoopenvinotoolkit:masterfrom
Lyamin-Roman:sdpa_igpu_fix

Conversation

@Lyamin-Roman
Copy link
Copy Markdown
Contributor

Details:

This PR fixes two consecutive errors related to the inference of Gemma4-E4B SDPA version model on iGPU

  1. In this model, a single KVCache is shared between multiple SDPA layers
    So problems arise in SyncInferRequest::allocate_states and VariableStateIndirectKVCacheCompressed is created when one kv_cache_prim is compressed and the other is not, which leads to output mismatches in the plugin.
    Currently, this only affects iGPUs with supports_immad=false, so we can temporarily fix it by disabling KVCacheCompression for such a graph, which removes potential mismatches.
    It will need to be redone when PA models are enabled.

  2. Out-of-bounds SLM access in sdpa_opt finalization kernel.
    The finalization stage allocated tmp_slm[SUBGROUP_SIZE] elements for cross-subgroup reduction, but the actual number of subgroups per workgroup is SUBGROUPS_PER_WG = CEIL_DIV(V_HEAD_SIZE * SG_SCALE_FACTOR, SUBGROUP_SIZE).
    When V_HEAD_SIZE > SUBGROUP_SIZE^2 (e.g. head_size=512, SUBGROUP_SIZE=16 gives SUBGROUPS_PER_WG=32 > 16), tmp_slm[sgid] writes go out of bounds.
    So changed tmp_slm allocation to SUBGROUPS_PER_WG and replaced the single-pass lane-indexed reduction with a folded loop over CEIL_DIV(SUBGROUPS_PER_WG, SUBGROUP_SIZE) iterations, correctly reducing across all subgroups regardless of head size.

    There are problems reproducing the problem using the test, apparently more interactions with memory are needed to reproduce it as in the inference of the whole model

AI Assistance:

  • AI assistance used: yes

@Lyamin-Roman Lyamin-Roman added this to the 2026.2 milestone May 1, 2026
@Lyamin-Roman Lyamin-Roman requested review from a team as code owners May 1, 2026 21:48
@Lyamin-Roman Lyamin-Roman added the category: GPU OpenVINO GPU plugin label May 1, 2026
manager.register_pass<ov::pass::GLUFusion>();
manager.register_pass<ov::intel_gpu::IndirectKVCache>();

if (!has_shared_kv_cache_vars(func)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this only affect iGPU with support_immad = false, you may skip the check if support_immad=true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation is enabled only in the following cases:

  1. is_paged_attention_model
  2. info.supports_immad == false
  3. auxiliary_kv_update_model

And when a model with PA appears, I think the same problem will appear, so I think it's better to temporarily disable this transformation overall for the unsupported graph

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp
Comment thread src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa_opt.cl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants