Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
6c49dc8
Avoid to do resize for same width and height images.
xipingyan Jul 30, 2025
c7d9932
Enable video process for qwen*-vl
xipingyan Jul 30, 2025
2ee043f
Add python interface: generate config: is_video, default false.
xipingyan Jul 31, 2025
29c74fd
fallback video_encode to image encode in base class.
xipingyan Aug 5, 2025
78dac29
Update calc target image size.
xipingyan Aug 5, 2025
7b2c115
Reduce shared codes, fallback to image process via return empty vector;
xipingyan Aug 5, 2025
10d8e8d
1: remove is_video,
xipingyan Aug 9, 2025
a3000d4
Update src/cpp/src/visual_language/llava/classes.cpp
xipingyan Sep 11, 2025
062fc40
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Sep 12, 2025
4d8375d
Update src/cpp/src/visual_language/pipeline.cpp
xipingyan Sep 12, 2025
ef9f868
rename according to copilot suggestion
xipingyan Sep 12, 2025
ad95828
Merge branch 'xp/enable_qwen_vl_video_preprocess' of https://github.c…
xipingyan Sep 12, 2025
f92b19b
rename rgbs to images
xipingyan Sep 12, 2025
66cdf38
enable if node to unify image and video preprocess.
xipingyan Sep 15, 2025
3eda036
cpp preprocess: enable video preprecess.
xipingyan Sep 15, 2025
3df267f
Pass same_images
xipingyan Sep 15, 2025
bf3169b
add commments for same image
xipingyan Sep 15, 2025
e1250aa
Update loop condition, and rename variables.
xipingyan Sep 16, 2025
fe0ab92
Update src/cpp/src/visual_language/pipeline_base.hpp
xipingyan Sep 16, 2025
dec67b2
video should be frames.
xipingyan Sep 16, 2025
caee3fd
Add pytest for video input.
xipingyan Sep 16, 2025
6a49a48
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Sep 16, 2025
800638e
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
peterchen-intel Sep 17, 2025
1502b28
Remove is_video python attribute.
xipingyan Sep 17, 2025
4d8e867
rename video to videos
xipingyan Sep 17, 2025
ea7fc94
Update docs, and add video for add_request.
xipingyan Sep 17, 2025
60364bf
Fix docs format.
xipingyan Sep 17, 2025
4ea5b3d
Fix test error: can't catch exception.
xipingyan Sep 18, 2025
8a0ab2e
Fix: cannot be narrowed from type 'int' to 'float' in initializer list
xipingyan Sep 18, 2025
28337ea
Support no image or video input;
xipingyan Sep 18, 2025
f3fd7d4
Add checking input for python api.
xipingyan Sep 18, 2025
a80d28e
cpp interface: generate, remove video. add is_video, default false
xipingyan Sep 18, 2025
6ab0a35
update get_inputs_embeds_with_token_type_ids and get_inputs_embeds, i…
xipingyan Sep 18, 2025
c531982
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Sep 18, 2025
dc30ec1
update pyi interface of generate.
xipingyan Sep 19, 2025
5edf0a5
Remove "const bool& is_video" in add_request and generate.
xipingyan Sep 24, 2025
2215f8a
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Sep 25, 2025
14352a7
Update src/python/openvino_genai/py_openvino_genai.pyi
xipingyan Sep 25, 2025
89afa54
copilot give a wrong suggestion. add images and video param for add_r…
xipingyan Sep 25, 2025
3b5c6cd
Merge remote-tracking branch 'origin/master' into xp/enable_qwen_vl_v…
xipingyan Sep 25, 2025
8768795
Add examples to .md
xipingyan Sep 25, 2025
be57bf2
Fix test video error, and input multiple images.
xipingyan Sep 25, 2025
d96c5dd
Update test based on 4D video.
xipingyan Sep 26, 2025
aaf20b0
Add vlm test dependency: opencv-python
xipingyan Sep 27, 2025
a2ad61b
Merge remote-tracking branch 'origin/master' into xp/enable_qwen_vl_v…
xipingyan Sep 27, 2025
6f5189b
Enable mix video and image input.
xipingyan Sep 27, 2025
c0829a3
split encode_images into encode_images and encode_video
xipingyan Sep 28, 2025
f25770b
Remove:
xipingyan Sep 28, 2025
72c621b
1: Add <video_pad> placeholder,
xipingyan Sep 28, 2025
132b228
Update position_ids after enable video.
xipingyan Sep 29, 2025
8c0e13d
add video histry id.
xipingyan Sep 30, 2025
64ba684
Update src/cpp/include/openvino/genai/visual_language/pipeline.hpp
xipingyan Sep 30, 2025
bbbef65
Merge branch 'xp/enable_qwen_vl_video_preprocess' of https://github.c…
xipingyan Sep 30, 2025
6e33dcf
Rename video to videos, reducing confusion.
xipingyan Sep 30, 2025
6bf63de
Remove useless header.
xipingyan Sep 30, 2025
eb4faea
Update video-> videos in Readme
xipingyan Sep 30, 2025
123221b
all video -> videos
xipingyan Sep 30, 2025
515c911
Call images when the models not implement video process.
xipingyan Sep 30, 2025
cf58265
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Oct 7, 2025
7c9a220
Update test for video input.
xipingyan Oct 7, 2025
28242fe
Add test: CB+Add_request.
xipingyan Oct 7, 2025
5e637df
Add test: comparing with optimum. but result is different.
xipingyan Oct 8, 2025
ef752e2
cb+add_request test pass.
xipingyan Oct 8, 2025
b6a87e5
vlm pipeline vs optimum.
xipingyan Oct 8, 2025
dfbd850
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
peterchen-intel Oct 9, 2025
1d810b6
Apply suggestion from @Copilot
xipingyan Oct 9, 2025
01bbd49
Revert useless update.
xipingyan Oct 9, 2025
274108e
merge master submodule
xipingyan Oct 9, 2025
dff97c1
clarify frames data layout.
xipingyan Oct 9, 2025
5cac72e
Fix bug: only pass images trigger crash.
xipingyan Oct 9, 2025
0277392
1: Add macro to disable "if" Node.
xipingyan Oct 9, 2025
fe5e709
pass video for sdpa backend.
xipingyan Oct 10, 2025
1ad75e6
Comparing with Optimum, test pass.
xipingyan Oct 10, 2025
0cd42e4
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Oct 10, 2025
1cdee9d
update genai.pyi, rename video to vidoes.
xipingyan Oct 10, 2025
06c029e
Update tests/python_tests/test_vlm_pipeline.py
xipingyan Oct 10, 2025
584c546
Fix ci issues
xipingyan Oct 10, 2025
a77ce48
pass video for add request.
xipingyan Oct 10, 2025
890bc03
Update tests/python_tests/test_vlm_pipeline.py
xipingyan Oct 10, 2025
29e8b27
Add docstring and some comments based on copilot's suggestion.
xipingyan Oct 10, 2025
4187890
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Oct 11, 2025
02feed7
Update tests/python_tests/test_vlm_pipeline.py
xipingyan Oct 11, 2025
9e25869
encode token separately, based on copilot's suggestion.
xipingyan Oct 11, 2025
a06d7f9
Update src/cpp/src/visual_language/vision_encoder.hpp
xipingyan Oct 11, 2025
654233f
Update src/cpp/src/visual_language/vision_encoder.hpp
xipingyan Oct 11, 2025
c8b4b2d
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 11, 2025
0596f57
Update src/cpp/src/visual_language/inputs_embedder.hpp
xipingyan Oct 11, 2025
52e4971
Update src/cpp/src/visual_language/inputs_embedder.cpp
xipingyan Oct 11, 2025
1ff0b7c
Update src/cpp/src/visual_language/inputs_embedder.hpp
xipingyan Oct 11, 2025
243c4f8
Rename NormlizedPrompt to NormalizedPrompt
xipingyan Oct 11, 2025
4b98644
Update src/python/py_continuous_batching_pipeline.cpp
xipingyan Oct 11, 2025
0885a63
1: if condition node "same_image" is confuse, just rename to cond_img…
xipingyan Oct 11, 2025
6fe7290
Merge remote-tracking branch 'origin/master' into xp/enable_qwen_vl_v…
xipingyan Oct 11, 2025
95c208b
Fix bugs after merging master.
xipingyan Oct 11, 2025
08e0967
Remove duplicated add_request.
xipingyan Oct 12, 2025
42bfef9
Remove test: test_vlm_pipeline_match_optimum_video_input
xipingyan Oct 12, 2025
d61dd0b
Fix crash issue when input empty images and video.
xipingyan Oct 13, 2025
87d0312
Because Qwen-VL patch all processing to model, so only add a limitati…
xipingyan Oct 13, 2025
8fcb856
simplify codes based on comments.
xipingyan Oct 14, 2025
b08b53b
Move encode vision placeholder to construct function.
xipingyan Oct 14, 2025
28c8806
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Oct 14, 2025
adc0a8b
Reuse: llm_grid_sz, reduce compute.
xipingyan Oct 14, 2025
09103a8
Fix position_ids calc bug.
xipingyan Oct 14, 2025
e4ac053
Update tests/python_tests/test_vlm_pipeline.py
xipingyan Oct 14, 2025
3a5af9a
Update tests/python_tests/test_vlm_pipeline.py
xipingyan Oct 14, 2025
ee96e29
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
peterchen-intel Oct 15, 2025
7ea1fc3
1: Check images and video separately.
xipingyan Oct 15, 2025
301a86a
Local test pass. test_vlm_pipeline_match_optimum_preresized for model…
xipingyan Oct 15, 2025
efa3f36
next position idx should be considered based on each dim.(3D)
xipingyan Oct 16, 2025
4c839aa
Load tokens_per_second from config.json
xipingyan Oct 16, 2025
1524c56
1:Remove python depends
xipingyan Oct 16, 2025
3b5ecf9
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Oct 16, 2025
c877d9a
video and image order of QWen2.5-VL's implatmenation:
xipingyan Oct 16, 2025
2994330
1: Remove video_frames_features, reuse video_feautures;
xipingyan Oct 17, 2025
5e00877
Update get_window_index after split video and image processing.
xipingyan Oct 17, 2025
2be9349
Merge branch 'master' into xp/enable_qwen_vl_video_preprocess
xipingyan Oct 18, 2025
00a32ba
Fix ci issue after merging master.
xipingyan Oct 18, 2025
d73a110
Fix cmake error: arithmetic on a pointer to void
xipingyan Oct 18, 2025
ce13b59
Keep align naming for video_embed_idx and image_embed_idx
xipingyan Oct 18, 2025
3220562
Video should be put to ahead of image in GenAI implementation.
xipingyan Oct 18, 2025
c476670
Update tests/python_tests/test_vlm_pipeline.py
xipingyan Oct 18, 2025
db74b11
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 18, 2025
1cc533c
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 18, 2025
1acd4e6
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 18, 2025
946834c
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 18, 2025
c9bdc89
Fix spelling error.
xipingyan Oct 18, 2025
9ac9fa1
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 18, 2025
44c8193
Fix void* cast issue.
xipingyan Oct 18, 2025
e660021
Fix calc product bug when input is empty.
xipingyan Oct 18, 2025
27aa591
reserve image_pad_token number for better performance.
xipingyan Oct 19, 2025
b400a64
Update src/cpp/src/visual_language/qwen2vl/classes.cpp
xipingyan Oct 20, 2025
03e4b9e
Update src/cpp/src/visual_language/qwen2_5_vl/classes.cpp
xipingyan Oct 20, 2025
2091bb9
Removed the duplicated parts.
xipingyan Oct 21, 2025
4f2bf4b
kcz/add_mp4_disabling_into_frames
krzyczar Oct 8, 2025
dbf9d69
after Sofya's review
krzyczar Nov 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,13 @@ image_data = ov.Tensor(image_data)

prompt = "Can you describe the image?"
result = pipe.generate(prompt, image=image_data, max_new_tokens=100)

# To input multiple images, use 'images='
# result = pipe.generate(prompt, images=[image_data], max_new_tokens=100)

# To input videos frames, use 'videos=', frames_data layout = [Frame num, H, W, C]
# result = pipe.generate(prompt, videos=[frames_data], max_new_tokens=100)

print(result.texts[0])
```

Expand All @@ -178,6 +185,12 @@ int main(int argc, char* argv[]) {
ov::genai::image(rgb),
ov::genai::max_new_tokens(100)
) << '\n';

// To input multiple images, use 'images'
// pipe.generate(prompt, ov::genai::images(std::vector<ov::Tensor>{rgb}), ov::genai::max_new_tokens(100));

// To input videos frames, use 'videos'
// pipe.generate(prompt, ov::genai::videos(std::vector<ov::Tensor>{frames}), ov::genai::max_new_tokens(100));
}
```

Expand Down
6 changes: 3 additions & 3 deletions src/cpp/include/openvino/genai/visual_language/pipeline.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/// If the prompt doesn't contain image or video tags, but images or videos are
/// provided, the tags are prepended to the prompt.
/// @param images Image to be prepended to a prompt.
/// @param videos Videos to be prepended to a prompt.
/// @param videos Multiple videos, each providing multiple frames, to be prepended to a prompt.
/// @param generation_config A config to follow for text generation.
/// @param streamer A streamer to acquire intermediate result.
/// @return A string generated by a model.
Expand Down Expand Up @@ -291,8 +291,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/*
* utils that allow to use generate() in the following way:
* pipe.generate(prompt, ov::genai::image(image_tensor)).
* pipe.generate(prompt, ov::genai::images(video_tensor)).
* pipe.generate(prompt, ov::genai::videos(video_tensor)).
* pipe.generate(prompt, ov::genai::images(image_tensors)).
* pipe.generate(prompt, ov::genai::videos(videos_tensors)).
*/
static constexpr ov::Property<ov::Tensor> image{"image"};
static constexpr ov::Property<std::vector<ov::Tensor>> images{"images"};
Expand Down
5 changes: 2 additions & 3 deletions src/cpp/src/continuous_batching/pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -276,13 +276,12 @@ std::vector<VLMDecodedResults> ContinuousBatchingPipeline::generate(
std::vector<VLMDecodedResults> ContinuousBatchingPipeline::generate(
const std::vector<std::string>& prompts,
const std::vector<std::vector<ov::Tensor>>& images,
const std::vector<std::vector<ov::Tensor>>& video,
const std::vector<std::vector<ov::Tensor>>& videos,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer) {
return m_impl->generate(prompts, images, video, sampling_params, streamer);
return m_impl->generate(prompts, images, videos, sampling_params, streamer);
}


void ContinuousBatchingPipeline::start_chat(const std::string& system_message) {
m_impl->finish_chat();
m_impl->start_chat(system_message);
Expand Down
64 changes: 45 additions & 19 deletions src/cpp/src/continuous_batching/pipeline_base.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ void ContinuousBatchingPipeline::IContinuousBatchingPipeline::finish_chat() {
m_history_videos.clear();
m_history_image_ids.clear();
m_history_video_ids.clear();
m_history_vision_count.clear();
if (m_inputs_embedder) {
m_inputs_embedder->finish_chat();
}
Expand Down Expand Up @@ -164,52 +165,65 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
const std::vector<std::vector<ov::Tensor>>& images_vector,
const std::vector<std::vector<ov::Tensor>>& videos_vector,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer) {
const StreamerVariant& streamer) {
auto generate_start_time = std::chrono::steady_clock::now();
OPENVINO_ASSERT(m_model_input_type == ModelInputType::EMBEDDINGS);

OPENVINO_ASSERT(prompts.size() == sampling_params.size(), "Number of prompts should be equal to the number of generation configs.");
OPENVINO_ASSERT(prompts.size() == images_vector.size() && prompts.size() == videos_vector.size(), "Number of prompts should be equal to the number of images or video vectors.");
if (images_vector.size() > 0)
OPENVINO_ASSERT(prompts.size() == images_vector.size(), "Number of prompts should be equal to the number of images vectors.");
if (videos_vector.size() > 0)
OPENVINO_ASSERT(prompts.size() == videos_vector.size(), "Number of prompts should be equal to the number of videos vectors.");

std::vector<ov::Tensor> input_embeds_list;
std::vector<ov::Tensor> token_type_ids_list;

std::vector<VLMPerfMetrics> vlm_perf_metrics(prompts.size());
std::vector<EncodedImage> encoded_images = {};
std::vector<EncodedVideo> encoded_videos = {};
bool recalculate_merged_embeddings = images_vector.size() > 0 || videos_vector.size() > 0;

if (m_is_chat_conversation) {
OPENVINO_ASSERT(1 == prompts.size(), "Can't chat with multiple prompts");
const auto& prompt = prompts[0];
auto start_get_inputs_embeds = std::chrono::steady_clock::now();

encoded_images = m_inputs_embedder->encode_images(images_vector[0]);
encoded_images = m_inputs_embedder->encode_images(images_vector.size() > 0 ? images_vector[0] : std::vector<ov::Tensor>{});
m_history_images.insert(m_history_images.end(), encoded_images.begin(), encoded_images.end());

encoded_videos = m_inputs_embedder->encode_videos(videos_vector[0]);
encoded_videos = m_inputs_embedder->encode_videos(videos_vector.size() > 0 ? videos_vector[0] : std::vector<ov::Tensor>{});
m_history_videos.insert(m_history_videos.end(), encoded_videos.begin(), encoded_videos.end());

auto [unified_prompt, image_sequence, video_sequence] = m_inputs_embedder->normalize_prompt(prompt, m_image_id, m_video_id, encoded_images, encoded_videos);

m_history.push_back({{"role", "user"}, {"content", unified_prompt}});
m_history_image_ids.insert(m_history_image_ids.end(), image_sequence.begin(), image_sequence.end());
m_history_video_ids.insert(m_history_video_ids.end(), video_sequence.begin(), video_sequence.end());
m_history_vision_count.emplace_back(std::make_pair(video_sequence.size(), image_sequence.size()));

std::string templated_history = m_tokenizer.apply_chat_template(m_history, true);

m_inputs_embedder->set_apply_chat_template_status(false);
if (m_inputs_embedder->has_token_type_ids()) {
auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(templated_history, m_history_images, vlm_perf_metrics[0], images_vector.size() > 0, m_history_image_ids);
auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(templated_history,
m_history_images,
m_history_videos,
vlm_perf_metrics[0],
recalculate_merged_embeddings,
m_history_image_ids,
m_history_video_ids,
m_history_vision_count);
input_embeds_list.push_back(std::move(embeds));
token_type_ids_list.push_back(std::move(tt_ids));
} else {
input_embeds_list.emplace_back(m_inputs_embedder->get_inputs_embeds(templated_history,
m_history_images,
m_history_videos,
vlm_perf_metrics[0],
true,
m_history_image_ids,
m_history_video_ids));
m_history_images,
m_history_videos,
vlm_perf_metrics[0],
recalculate_merged_embeddings,
m_history_image_ids,
m_history_video_ids,
m_history_vision_count));
}

auto end_get_inputs_embeds = std::chrono::steady_clock::now();
Expand All @@ -230,11 +244,17 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
m_inputs_embedder->set_apply_chat_template_status(sampling_params[i].apply_chat_template);

if (m_inputs_embedder->has_token_type_ids()) {
auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(unified_prompt, encoded_images, vlm_perf_metrics[i], true, image_sequence);
auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(unified_prompt,
encoded_images,
encoded_videos,
vlm_perf_metrics[i],
recalculate_merged_embeddings,
image_sequence,
video_sequence);
input_embeds_list.push_back(std::move(embeds));
token_type_ids_list.push_back(std::move(tt_ids));
} else {
input_embeds_list.emplace_back(m_inputs_embedder->get_inputs_embeds(unified_prompt, encoded_images, encoded_videos, vlm_perf_metrics[i], true, image_sequence, video_sequence));
input_embeds_list.emplace_back(m_inputs_embedder->get_inputs_embeds(unified_prompt, encoded_images, encoded_videos, vlm_perf_metrics[i], recalculate_merged_embeddings, image_sequence, video_sequence));
}

auto end_get_inputs_embeds = std::chrono::steady_clock::now();
Expand Down Expand Up @@ -278,6 +298,11 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
m_history_image_ids.pop_back();
m_history_images.pop_back();
}
for (size_t idx = 0; idx < encoded_videos.size(); idx++) {
m_history_video_ids.pop_back();
m_history_videos.pop_back();
}
m_history_vision_count.pop_back();
}
}
return results;
Expand Down Expand Up @@ -307,12 +332,13 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::add_request(uint64_t re
return add_request(request_id, inputs, sampling_params, token_type_ids);
}

GenerationHandle
ContinuousBatchingPipeline::IContinuousBatchingPipeline::add_request(uint64_t request_id,
const std::string& prompt,
const std::vector<ov::Tensor>& images,
const std::vector<ov::Tensor>& videos,
GenerationConfig sampling_params) {
GenerationHandle
ContinuousBatchingPipeline::IContinuousBatchingPipeline::add_request(
uint64_t request_id,
const std::string& prompt,
const std::vector<ov::Tensor>& images,
const std::vector<ov::Tensor>& videos,
GenerationConfig sampling_params) {
OPENVINO_ASSERT(m_model_input_type == ModelInputType::EMBEDDINGS, "Model doesn't support embeddings.");
ov::genai::VLMPerfMetrics metrics;
ov::Tensor inputs;
Expand Down
3 changes: 2 additions & 1 deletion src/cpp/src/continuous_batching/pipeline_base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
std::vector<size_t> m_history_image_ids;
std::vector<ov::genai::EncodedVideo> m_history_videos;
std::vector<size_t> m_history_video_ids;
std::vector<std::pair<std::size_t, std::size_t>> m_history_vision_count; // pair<video count, image count>
size_t m_image_id = 0;
size_t m_video_id = 0;

Expand Down Expand Up @@ -144,7 +145,7 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {

virtual std::vector<VLMDecodedResults> generate(const std::vector<std::string>& prompts,
const std::vector<std::vector<ov::Tensor>>& images,
const std::vector<std::vector<ov::Tensor>>& video,
const std::vector<std::vector<ov::Tensor>>& videos,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer);

Expand Down
7 changes: 6 additions & 1 deletion src/cpp/src/visual_language/clip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,12 @@ void bicubic_resize(const clip_image_u8 &img, clip_image_u8 &dst, int target_wid

dst.nx = target_width;
dst.ny = target_height;
dst.buf.resize(3 * target_width * target_height);
const int target_size = 3 * target_width * target_height;
dst.buf.resize(target_size);
if (img.nx == target_width && img.ny == target_height) {
std::memcpy(dst.buf.data(), img.buf.data(), target_size);
return;
}

float Cc;
float C[5];
Expand Down
8 changes: 5 additions & 3 deletions src/cpp/src/visual_language/continuous_batching_adapter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,16 @@ class ov::genai::VLMPipeline::VLMContinuousBatchingAdapter : public ov::genai::V
VLMDecodedResults generate(
const std::string& prompt,
const std::vector<ov::Tensor>& images,
const std::vector<ov::Tensor>& video,
const std::vector<ov::Tensor>& videos,
GenerationConfig generation_config,
const StreamerVariant& streamer
) override {
auto start_time = std::chrono::steady_clock::now();
auto result = m_impl.generate({prompt}, {images}, {video}, {generation_config}, streamer)[0];
auto images_vec = images.size() == 0u ? std::vector<std::vector<ov::Tensor>>{} : std::vector<std::vector<ov::Tensor>>{images};
auto video_vec = videos.size() == 0u ? std::vector<std::vector<ov::Tensor>>{} : std::vector<std::vector<ov::Tensor>>{videos};
auto result = m_impl.generate({prompt}, images_vec, video_vec, {generation_config}, streamer)[0];
auto stop_time = std::chrono::steady_clock::now();

VLMDecodedResults decoded;
decoded.perf_metrics = result.perf_metrics;
decoded.perf_metrics.load_time = get_load_time();
Expand Down
2 changes: 1 addition & 1 deletion src/cpp/src/visual_language/gemma3/classes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ std::vector<ov::genai::EncodedImage> InputsEmbedderGemma3::encode_images(const s
return embeds;
}

NormlizedPrompt InputsEmbedderGemma3::normalize_prompt(const std::string& prompt, size_t base_id, const std::vector<EncodedImage>& images) const {
NormalizedPrompt InputsEmbedderGemma3::normalize_prompt(const std::string& prompt, size_t base_id, const std::vector<EncodedImage>& images) const {
std::string start_of_image = m_vlm_config.start_of_image;
std::string image_token = m_vlm_config.image_soft_token;
std::string end_of_image = m_vlm_config.end_of_image;
Expand Down
2 changes: 1 addition & 1 deletion src/cpp/src/visual_language/gemma3/classes.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class InputsEmbedderGemma3 : public InputsEmbedder::IInputsEmbedder {

std::vector<ov::genai::EncodedImage> encode_images(const std::vector<ov::Tensor>& images) override;

NormlizedPrompt normalize_prompt(const std::string& prompt, size_t base_id, const std::vector<EncodedImage>& images) const override;
NormalizedPrompt normalize_prompt(const std::string& prompt, size_t base_id, const std::vector<EncodedImage>& images) const override;

std::pair<ov::Tensor, std::optional<int64_t>> get_position_ids(const size_t inputs_embeds_size, const size_t history_size) override;

Expand Down
Loading