[PagedAttention] Add bidirectional attention mask within image groups#34111
[PagedAttention] Add bidirectional attention mask within image groups#34111mlukasze merged 31 commits intoopenvinotoolkit:masterfrom
Conversation
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
…nto attn_idea_2 Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
| ov::element::i32, | ||
| getInputShapeAtPort(PagedAttentionExecutor::ID_ADAPTIVE_RKV_DIVERSITY_BLOCK_SET_INDICES_BEGINS))); | ||
|
|
||
| // token_type_ids, i32, [B_token | 0] or [1, B_token] |
There was a problem hiding this comment.
As a common interface of PA, could you encode ids inputs and decode it during execution?
For example, using [3,9] to replace [0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,...]
There was a problem hiding this comment.
It has been discussed that token_type_ids input is already in the model, so connecting it directly to the PA (as proposed in the PR) simplify the whole logic.
mitruska
left a comment
There was a problem hiding this comment.
Great work on enabling token types functionality.
Let's minimize impact on existing PA logic and ensure that supported models are not affected.
...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp
Outdated
Show resolved
Hide resolved
| size_t get_ncausal(size_t q_global_idx, size_t default_ncausal, size_t cur_kv_len) const { | ||
| if (!_token_type || q_global_idx >= _image_group_end.size()) { | ||
| return default_ncausal; | ||
| } |
There was a problem hiding this comment.
This initial check is performed for every call of get_ncausal, with the proposed common updates it will be called for every index calculation even if the "token_type_ids" input is not provided.
Consider optimizing it to not affect existing logic.
Perf validation in needed to ensure supported models are not affected by those PA changes.
There was a problem hiding this comment.
I made the job easier by providing a false check which should be quickly evaluated by CPU prediction logic, but one way or another, performance testing needs to be done. Is there a standard procedure for testing PA performance? cc @maxnick
There was a problem hiding this comment.
As there were no perf regression detected (shared in the ticket), I don't put it as a blocker.
But still the same check is repeated for each loop step, while the state is not changing and depends on the provided inputs - remaining the same for the same compiled model.
Approach with function ptr could be investigated as one of possible options here to make the decision once per whole kernel logic, not at each element.
| } | ||
|
|
||
| memory::ptr get_token_type_ids_memory() { | ||
| std::vector<int> token_type_ids = { 0 }; |
There was a problem hiding this comment.
token_type_ids is created as a 1-element tensor ({0}), but B_token in this test is often >1 (e.g., subsequence {10,0}), so this input is shape-inconsistent with the per-token semantics. To keep existing tests behavior unchanged, consider passing an empty tensor (shape {0}) to represent “not provided”, or otherwise generate token_type_ids with length equal to total query tokens (sum of subsequence_desc.num_tokens) and make its layout dynamic like other per-token inputs.
| std::vector<int> token_type_ids = { 0 }; | |
| // Represent token_type_ids as "not provided" by using an empty tensor (shape {0}) | |
| std::vector<int> token_type_ids; |
| auto adaptive_rkv_evictable_sizes_layout = layout{ov::PartialShape{1}, data_types::f32, format::bfyx}; | ||
| auto adaptive_rkv_diversity_block_set_indices_layout = layout{ov::PartialShape{1}, data_types::f32, format::bfyx}; | ||
| auto adaptive_rkv_diversity_block_set_indices_begins_layout = layout{ov::PartialShape{1}, data_types::f32, format::bfyx}; | ||
| auto token_type_ids_layout = layout{ov::PartialShape{1}, data_types::i32, format::bfyx}; |
There was a problem hiding this comment.
token_type_ids_layout is hard-coded as PartialShape{1}. If this input is intended to be optional (empty shape) or to match the per-token dimension, using a fixed {1} shape can mask shape issues and differs from other dynamic per-token inputs. Consider using {0} for “not provided” or {-1} (and/or {1, -1}) if the test intends to model real token_type_ids.
| auto token_type_ids_layout = layout{ov::PartialShape{1}, data_types::i32, format::bfyx}; | |
| auto token_type_ids_layout = layout{ov::PartialShape{-1}, data_types::i32, format::bfyx}; |
| validate_inputs_count(op, {26}); | ||
| auto inputs = p.GetInputInfo(op); | ||
| auto prim = cldnn::paged_attention(layer_type_name_ID(op), inputs); |
There was a problem hiding this comment.
This change updates the op to require 26 inputs, but the Intel GPU implementation does not appear to consume token_type_ids anywhere (no references to TOKEN_TYPE_IDS / token_type_ids under src/plugins/intel_gpu/src). If the PR’s goal is to enable bidirectional attention on GPU, the kernels/impl need to incorporate this new input; otherwise GPU behavior will remain unchanged despite the new required input.
| get_input_size() == 25, | ||
| "PagedAttensionExtension expects 25 inputs, but it has ", | ||
| get_input_size() == 26, | ||
| "PagedAttensionExtension expects 26 inputs, but it has ", |
There was a problem hiding this comment.
Typo in the validation error message: PagedAttensionExtension → PagedAttentionExtension. Since this string was touched in this change, it’s a good opportunity to fix it for clearer diagnostics.
| "PagedAttensionExtension expects 26 inputs, but it has ", | |
| "PagedAttentionExtension expects 26 inputs, but it has ", |
| ov::Tensor token_type_tensor(ov::element::i32, {seq_len}); | ||
| std::memcpy(token_type_tensor.data<int32_t>(), token_types.data(), seq_len * sizeof(int32_t)); | ||
|
|
There was a problem hiding this comment.
std::memcpy is used here but the file doesn’t include <cstring>. Please add the missing standard header to avoid relying on transitive includes (which can break with different toolchains / build flags).
...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp
Show resolved
Hide resolved
…into attn_idea_2 Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
Signed-off-by: p-wysocki <przemyslaw.wysocki@intel.com>
There was a problem hiding this comment.
Conditional approval as confirmed to resolve the reported issue on the target model without detected regressions.
Refactor of the PA and growing number of inputs is a topic to be continued, but shouldn't be a blocker for all features development.
@p-wysocki Please contribute to the PagedAttention specification PR describing the token_type_ids input as suggested change:
...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp
Show resolved
Hide resolved
| const PlainTensor& subsequence_begins, | ||
| const PlainTensor& past_lens) { | ||
| _has_image_tokens = true; | ||
| _token_type = token_type; | ||
| auto total_tokens = static_cast<int32_t>(token_type.m_dims[0]); | ||
| _image_group_end.resize(total_tokens); | ||
|
|
||
| auto seq_count = static_cast<int32_t>(past_lens.m_dims[0]); | ||
| for (int32_t seq = 0; seq < seq_count; seq++) { | ||
| auto seq_begin = subsequence_begins.ptr<int32_t>()[seq]; | ||
| auto seq_end = subsequence_begins.ptr<int32_t>()[seq + 1]; |
There was a problem hiding this comment.
Is subsequence_begins size ensuerd at this point?
It is accessed at [seq] and [seq + 1] here, so in final loop at subsequence_begins[seq_count], where seq_count = past_lens.m_dims[0].
It means subsequence_begins.size() <= (seq_count + 1) must be ensured.
There was a problem hiding this comment.
The shape is validated here:
| size_t get_ncausal(size_t q_global_idx, size_t default_ncausal, size_t cur_kv_len) const { | ||
| if (!_token_type || q_global_idx >= _image_group_end.size()) { | ||
| return default_ncausal; | ||
| } |
There was a problem hiding this comment.
As there were no perf regression detected (shared in the ticket), I don't put it as a blocker.
But still the same check is repeated for each loop step, while the state is not changing and depends on the provided inputs - remaining the same for the same compiled model.
Approach with function ptr could be investigated as one of possible options here to make the decision once per whole kernel logic, not at each element.
| for (size_t m = q_start; m < q_end; m++) { | ||
| // apply attention mask & sofmax | ||
| auto ncausal = (cur_kv_len - q_cnt + (m - q_start) + 1); | ||
| auto ncausal = get_ncausal(q_token_start + m, cur_kv_len - q_cnt + (m - q_start) + 1, cur_kv_len); |
There was a problem hiding this comment.
Related to my previous comment
As there were no perf regression detected at this point (shared in the ticket), I don't put it as a blocker.
But the first argument of get_ncausal is used only when the new token_type_ids input is provided, while q_token_start + m calculated for at each loop step even if not used.
There was a problem hiding this comment.
Let's discuss offline
| const auto adaptive_rkv_diversity_block_set_indices_begins = | ||
| std::make_shared<op::v0::Parameter>(element::i32, PartialShape{5}); | ||
|
|
||
| const auto token_type_ids = std::make_shared<op::v0::Parameter>(ov::element::i32, ov::Shape{0}); |
There was a problem hiding this comment.
Only existing type prop tests cases have been updated - presenting dummy token_type_ids input with empty shape. Would be good to add also tests with examples of valid input and shapes.
| ov::element::i32, | ||
| getInputShapeAtPort(PagedAttentionExecutor::ID_ADAPTIVE_RKV_DIVERSITY_BLOCK_SET_INDICES_BEGINS))); | ||
|
|
||
| // token_type_ids, i32, [B_token | 0] or [1, B_token] |
There was a problem hiding this comment.
It has been discussed that token_type_ids input is already in the model, so connecting it directly to the PA (as proposed in the PR) simplify the whole logic.
Details:
token_type_ids, which classifies tokens (image vs text).:https://github.com/huggingface/transformers/blob/a6ef2a6f3549dd3267e8f5bafe4976a3217784bb/src/transformers/models/gemma3/modular_gemma3.py#L783-L796
token_type_idsand calculate the bidirectional image attention, modifying causal mask behavior. The behavior is unchanged whentoken_type_idsis not given.Tests:
Questions/TODO:
token_type_idsin a different way? Gemma3 docs don't seem exactly aligned with what the tokenizer implements.Tickets: