Skip to content

Commit b556f14

Browse files
authored
Support Gemma4 model (openvinotoolkit#3644)
<!-- Keep your pull requests (PRs) as atomic as possible. That increases the likelihood that an individual PR won't be stuck because of adjacent problems, merge conflicts, or code review. Your merged PR is going to appear in the automatically generated release notes on GitHub. So the clearer the title the better. --> ## Description <!-- Please include a summary of the change. Also include relevant motivation and context. --> Depends on: huggingface/optimum-intel#1688 optimum-intel PR depends on transformers v5 (**update**: transformers v5 support merged to optimum-intel). ### WWB Accuracy: genai vs optimum-intel: 0.9682357 genai vs transformers: 0.94821364 optimum-intel vs transformers: 0.9387633 Fixes: openvinotoolkit#3653 Current implementation support image text inputs only. Ticket for video support implementation: 185850 ## Checklist: - [x] This PR follows [GenAI Contributing guidelines](https://github.com/openvinotoolkit/openvino.genai?tab=contributing-ov-file#contributing). <!-- Always follow them. If there are deviations, explain what and why. --> - [x] Tests have been updated or added to cover the new code. <!-- Specify exactly which tests were added or updated. If the change isn't maintenance related, update the tests at https://github.com/openvinotoolkit/openvino.genai/tree/master/tests or explain in the description why the tests don't need an update. --> - [x] This PR fully addresses the ticket. <!--- If not, explain clearly what is covered and what is not. If follow-up pull requests are needed, specify in the description. --> - [x] I have made corresponding changes to the documentation. <!-- Run github.com/\<username>/openvino.genai/actions/workflows/deploy_gh_pages.yml on your fork with your branch as a parameter to deploy a test version with the updated content. Replace this comment with the link to the built docs. If the documentation is updated in a separate PR, clearly specify it. -->
1 parent aeb8f62 commit b556f14

21 files changed

Lines changed: 601 additions & 80 deletions

File tree

.github/workflows/linux.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -652,6 +652,14 @@ jobs:
652652
python -m pytest -s -v tests/python_tests/test_vlm_pipeline.py --override-ini cache_dir=/mount/caches/pytest/ -k "qwen3-vl"
653653
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).visual_language.test }}
654654
timeout: 60
655+
- name: 'VLM (gemma4)'
656+
cmd: |
657+
python -m pip install --no-deps git+https://github.com/huggingface/optimum-intel.git@ff99d6e13774841bdd17ac0d4c8bd2d181cf7c27 # PR 1688
658+
python -m pip install transformers==5.5.0
659+
pip show transformers optimum-intel openvino_tokenizers openvino_genai
660+
python -m pytest -s -v tests/python_tests/test_vlm_pipeline.py --override-ini cache_dir=/mount/caches/pytest/ -k "gemma4"
661+
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).visual_language.test }}
662+
timeout: 60
655663
defaults:
656664
run:
657665
shell: bash

.github/workflows/manylinux_2_28.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -575,6 +575,15 @@ jobs:
575575
python -m pytest -s -v tests/python_tests/test_vlm_pipeline.py --override-ini cache_dir=/mount/caches/pytest/ -k "qwen3-vl"
576576
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).visual_language.test }}
577577
timeout: 60
578+
- name: 'VLM (gemma4)'
579+
cmd: |
580+
python -m pip install --no-deps git+https://github.com/huggingface/optimum-intel.git@ff99d6e13774841bdd17ac0d4c8bd2d181cf7c27 # PR 1688
581+
python -m pip install transformers==5.5.0
582+
pip show transformers optimum-intel openvino_tokenizers openvino_genai
583+
python -m pytest -s -v tests/python_tests/test_vlm_pipeline.py --override-ini cache_dir=/mount/caches/pytest/ -k "gemma4"
584+
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).visual_language.test }}
585+
timeout: 60
586+
578587
defaults:
579588
run:
580589
shell: bash

.github/workflows/windows.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -740,6 +740,14 @@ jobs:
740740
python -m pytest -s -v tests/python_tests/test_vlm_pipeline.py --override-ini cache_dir=/mount/caches/pytest/ -k "qwen3-vl"
741741
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).visual_language.test }}
742742
timeout: 60
743+
- name: 'VLM (gemma4)'
744+
cmd: |
745+
python -m pip install --no-deps git+https://github.com/huggingface/optimum-intel.git@ff99d6e13774841bdd17ac0d4c8bd2d181cf7c27 # PR 1688
746+
python -m pip install transformers==5.5.0
747+
pip show transformers optimum-intel openvino_tokenizers openvino_genai
748+
python -m pytest -s -v tests/python_tests/test_vlm_pipeline.py --override-ini cache_dir=/mount/caches/pytest/ -k "gemma4"
749+
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).visual_language.test }}
750+
timeout: 60
743751
defaults:
744752
run:
745753
shell: pwsh

site/docs/supported-models/_components/vlm-models-table/models.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,4 +193,16 @@ export const VLM_MODELS: VLMModelType[] = [
193193
},
194194
],
195195
},
196+
{
197+
architecture: 'Gemma4ForConditionalGeneration',
198+
models: [
199+
{
200+
name: 'gemma4',
201+
links: [
202+
'https://huggingface.co/google/gemma-4-E2B-it',
203+
'https://huggingface.co/google/gemma-4-E4B-it',
204+
],
205+
},
206+
],
207+
},
196208
];

site/docs/supported-models/index.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,12 @@ Apply https://huggingface.co/microsoft/Phi-4-multimodal-instruct/discussions/78/
8787
2. Visual history is not preserved across rounds, so multi-turn interactions have limited visual context.
8888
3. If the number of input frames is not divisible by `mm_local_num_frames` (as defined in `config.json`), additional frames will be automatically padded by duplicating the last frame. For example, if there are 10 frames and `mm_local_num_frames = 4`, it will be padded to 12 frames.
8989

90+
#### Gemma4 {#gemma4-notes}
91+
92+
Gemma4 implementation supports text and image inputs only. Video input is not supported at the moment.
93+
94+
The model requires `transformers==5.5.0` for the export with `optimum-cli`.
95+
9096
#### Qwen3-VL {#qwen3_vl-notes}
9197

9298
The model requires `transformers>=4.57` for the export with `optimum-cli`.

src/cpp/src/lm_encoding.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
8686
std::optional<int64_t> rope_delta,
8787
const size_t max_kv_cache_size,
8888
const bool use_intermediate_remote_tensor,
89-
const std::unordered_map<std::string, ov::Tensor>& lm_extra_inputs
89+
const std::unordered_map<std::string, ov::Tensor>& lm_extra_inputs,
90+
std::function<ov::Tensor(const ov::Tensor& new_input_ids)> per_layer_embeddings_callback
9091
) {
9192
std::vector<GenerationHandle> generations;
9293
for (SequenceGroup::Ptr sequence_group : sequence_groups) {
@@ -261,6 +262,8 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
261262
ov::Tensor new_visual_pos_masks{tensor.get_element_type(), {batch_size, 1}};
262263
std::fill_n(new_visual_pos_masks.data<bool>(), new_visual_pos_masks.get_size(), false);
263264
m_llm.set_tensor(name, new_visual_pos_masks);
265+
} else if (name == "per_layer_inputs" && per_layer_embeddings_callback) {
266+
m_llm.set_tensor(name, per_layer_embeddings_callback(new_input_ids));
264267
}
265268
}
266269
} else {

src/cpp/src/lm_encoding.hpp

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,16 @@
33

44
#pragma once
55

6+
#include <functional>
67
#include <optional>
8+
79
#include "openvino/genai/llm_pipeline.hpp"
8-
#include "visual_language/embedding_model.hpp"
910
#include "sampling/sampler.hpp"
11+
#include "visual_language/embedding_model.hpp"
1012

1113
namespace ov {
1214
namespace genai {
1315

14-
1516
ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
1617
ov::InferRequest& m_llm,
1718
const ov::Tensor& input_ids,
@@ -26,13 +27,12 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
2627
std::optional<int64_t> rope_delta = std::nullopt,
2728
const size_t max_kv_cache_size = std::numeric_limits<size_t>::max(),
2829
const bool use_intermediate_remote_tensor = true,
29-
const std::unordered_map<std::string, ov::Tensor>& lm_extra_inputs = {});
30-
30+
const std::unordered_map<std::string, ov::Tensor>& lm_extra_inputs = {},
31+
std::function<ov::Tensor(const ov::Tensor& new_input_ids)> per_layer_embeddings_callback = nullptr);
3132

3233
void align_cache_and_history(const ov::Tensor& new_chat_tokens, utils::CacheState& cache_state);
3334

34-
3535
TokenizedInputs get_chat_encoded_input(const ov::Tensor& new_chat_tokens, utils::CacheState& cache_state);
3636

37-
}
38-
}
37+
} // namespace genai
38+
} // namespace ov

src/cpp/src/utils.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,8 @@ ProcessorConfig from_any_map(
239239
read_anymap_param(config_map, "max_slice_nums", extracted_config.max_slice_nums);
240240
read_anymap_param(config_map, "norm_mean", extracted_config.norm_mean);
241241
read_anymap_param(config_map, "norm_std", extracted_config.norm_std);
242+
read_anymap_param(config_map, "pooling_kernel_size", extracted_config.pooling_kernel_size);
243+
read_anymap_param(config_map, "max_soft_tokens", extracted_config.max_soft_tokens);
242244
return extracted_config;
243245
}
244246

src/cpp/src/visual_language/gemma3/classes.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,8 @@ NormalizedPrompt InputsEmbedderGemma3::normalize_prompt(const std::string& promp
106106
}
107107
expanded_tag += end_of_image + "\n\n";
108108

109+
// fixme: there seems to be an issue with how image_token is replaced. unified_prompt.find needs search_offset.
110+
// refer to gemma4 implementation.
109111
unified_prompt.replace(unified_prompt.find(start_of_image), start_of_image.length(), expanded_tag);
110112
}
111113
return {std::move(unified_prompt), std::move(images_sequence), {}};

0 commit comments

Comments
 (0)