[GPU] Fix Gemma4-E4B SDPA model#35642
Open
Lyamin-Roman wants to merge 2 commits intoopenvinotoolkit:masterfrom
Open
[GPU] Fix Gemma4-E4B SDPA model#35642Lyamin-Roman wants to merge 2 commits intoopenvinotoolkit:masterfrom
Lyamin-Roman wants to merge 2 commits intoopenvinotoolkit:masterfrom
Conversation
clee30
reviewed
May 4, 2026
| manager.register_pass<ov::pass::GLUFusion>(); | ||
| manager.register_pass<ov::intel_gpu::IndirectKVCache>(); | ||
|
|
||
| if (!has_shared_kv_cache_vars(func)) { |
Contributor
There was a problem hiding this comment.
As this only affect iGPU with support_immad = false, you may skip the check if support_immad=true
Contributor
Author
There was a problem hiding this comment.
This transformation is enabled only in the following cases:
- is_paged_attention_model
- info.supports_immad == false
- auxiliary_kv_update_model
And when a model with PA appears, I think the same problem will appear, so I think it's better to temporarily disable this transformation overall for the unsupported graph
e-ddykim
reviewed
May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details:
This PR fixes two consecutive errors related to the inference of Gemma4-E4B SDPA version model on iGPU
In this model, a single
KVCacheis shared between multipleSDPAlayersSo problems arise in
SyncInferRequest::allocate_statesandVariableStateIndirectKVCacheCompressedis created when onekv_cache_primis compressed and the other is not, which leads to output mismatches in the plugin.Currently, this only affects iGPUs with
supports_immad=false, so we can temporarily fix it by disablingKVCacheCompressionfor such a graph, which removes potential mismatches.It will need to be redone when PA models are enabled.
Out-of-bounds SLM access in
sdpa_optfinalization kernel.The finalization stage allocated
tmp_slm[SUBGROUP_SIZE]elements for cross-subgroup reduction, but the actual number of subgroups per workgroup isSUBGROUPS_PER_WG = CEIL_DIV(V_HEAD_SIZE * SG_SCALE_FACTOR, SUBGROUP_SIZE).When
V_HEAD_SIZE>SUBGROUP_SIZE^2(e.g.head_size=512,SUBGROUP_SIZE=16givesSUBGROUPS_PER_WG=32 > 16),tmp_slm[sgid]writes go out of bounds.So changed
tmp_slmallocation toSUBGROUPS_PER_WGand replaced the single-pass lane-indexed reduction with a folded loop overCEIL_DIV(SUBGROUPS_PER_WG, SUBGROUP_SIZE)iterations, correctly reducing across all subgroups regardless of head size.There are problems reproducing the problem using the test, apparently more interactions with memory are needed to reproduce it as in the inference of the whole model
AI Assistance: