Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
146 commits
Select commit Hold shift + click to select a range
db9da07
Draft enable VLM lookup.
xipingyan Sep 4, 2025
4a8901c
Remove global variable pass prompt ids.
xipingyan Sep 5, 2025
bbb9de3
Update some comments.
xipingyan Sep 5, 2025
eec1fa7
1: fix potential issue: max_ngram_size < input_length;
xipingyan Sep 11, 2025
e96d490
avoiding potential signed/unsigned comparison issues
xipingyan Sep 11, 2025
3a7b3a5
move to loop before.
xipingyan Sep 11, 2025
f2fc501
don't update param variable.
xipingyan Sep 11, 2025
5ee3df6
static_cast for type convert.
xipingyan Sep 11, 2025
432de9b
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Sep 12, 2025
04450fd
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Sep 17, 2025
6752836
1: Rename get_inputs_embeds_with_token_type_ids to get_inputs_embeds_…
xipingyan Sep 18, 2025
89b3422
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Sep 19, 2025
53c26aa
pass prompt ids for VLM+lookup
xipingyan Sep 19, 2025
3f4c58d
Fix continues batching test fail with enable lookup prompt.
xipingyan Sep 20, 2025
b402f64
Merge remote-tracking branch 'origin/master' into xp/enable_vlm_lookup
xipingyan Nov 14, 2025
9d618ef
revert get input_ids.
xipingyan Nov 14, 2025
9d2af10
Encode original prompt as lookup table in base class.
xipingyan Nov 15, 2025
d00f80c
Remove unecessay updated code.
xipingyan Nov 15, 2025
98b351c
update format
xipingyan Nov 15, 2025
de46c3c
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Nov 18, 2025
10ba409
Merge remote-tracking branch 'origin/master' into xp/enable_vlm_lookup
xipingyan Nov 21, 2025
88162d4
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Nov 24, 2025
05a7557
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Nov 24, 2025
ab20000
rm useless comments
xipingyan Nov 24, 2025
097070a
update based on copilot suggestion
xipingyan Nov 24, 2025
300e236
Update src/cpp/src/visual_language/inputs_embedder.hpp
xipingyan Nov 24, 2025
58c9458
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Nov 24, 2025
6820fb8
Add prompt lookup decoding sample for VLM
xipingyan Nov 25, 2025
70f83e0
Add prompt lookup decoding python sample for VLM
xipingyan Nov 25, 2025
c577083
Update src/cpp/src/prompt_lookup/continuous_batching_for_prompt_looku…
xipingyan Nov 25, 2025
2cf0d60
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Nov 25, 2025
c4f85c1
Add python test.
xipingyan Nov 25, 2025
a7c2d9b
Merge remote-tracking branch 'origin/master' into xp/enable_vlm_lookup
xipingyan Nov 27, 2025
9ecec20
Revert: pass prompt_lookup flag to InputsEmbedder
xipingyan Nov 27, 2025
961f3f4
Call refer handle
xipingyan Nov 27, 2025
0b4721a
Update tests/python_tests/samples/test_prompt_lookup_decoding_vlm.py
xipingyan Nov 27, 2025
623ded6
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Nov 27, 2025
0988a02
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Nov 27, 2025
42bf6d0
I don't know how to test cpp in python, remove cpp test, just compare…
xipingyan Nov 27, 2025
4309d87
Fix accuracy issue.(wrong position id trigger accuracy drop)
xipingyan Dec 4, 2025
4d96def
keep align about remove position id.
xipingyan Dec 4, 2025
af046aa
add debug log to print condatate.
xipingyan Dec 4, 2025
c6550d9
Update src/cpp/src/prompt_lookup/continuous_batching_for_prompt_looku…
xipingyan Dec 4, 2025
fbca036
Update samples/cpp/visual_language_chat/prompt_lookup_decoding_vlm.cpp
xipingyan Dec 4, 2025
fccebd1
Update samples/cpp/visual_language_chat/CMakeLists.txt
xipingyan Dec 4, 2025
e31400b
replace -1 with const.
xipingyan Dec 4, 2025
fd31b9b
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Dec 4, 2025
612478e
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Dec 4, 2025
774248c
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Dec 4, 2025
d5e0025
Update src/cpp/src/visual_language/inputs_embedder.hpp
xipingyan Dec 4, 2025
33207dd
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Dec 4, 2025
3439216
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Dec 4, 2025
1d95cf7
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Dec 4, 2025
0e2ddbf
Update tests/python_tests/samples/test_prompt_lookup_decoding_vlm.py
xipingyan Dec 4, 2025
7a5f4c9
model print token id to utils.
xipingyan Dec 4, 2025
95bb668
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Dec 4, 2025
0943c86
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Dec 4, 2025
bfa6964
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Dec 4, 2025
ff4902b
remove next check after padding token
xipingyan Dec 5, 2025
60aaf30
NO use cxxopts in my case. just remove.
xipingyan Dec 5, 2025
7fd1a81
Based on comment, update get candidate algorithm.
xipingyan Dec 16, 2025
e0348e3
fix py test lint error
xipingyan Dec 17, 2025
8db3eab
fix lint error again
xipingyan Dec 17, 2025
1306f1e
Update src/cpp/src/utils.cpp
xipingyan Dec 17, 2025
1b2b2c4
fix var name spelling error
xipingyan Dec 17, 2025
dce056a
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Dec 17, 2025
db1efa5
Merge remote-tracking branch 'origin/master' into xp/enable_vlm_lookup
xipingyan Dec 23, 2025
6d3ef16
Update samples/cpp/visual_language_chat/prompt_lookup_decoding_vlm.cpp
xipingyan Jan 5, 2026
aa3d7d2
Update tests/python_tests/samples/test_prompt_lookup_decoding_vlm.py
xipingyan Jan 5, 2026
8ad719a
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Jan 5, 2026
20caac3
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 5, 2026
001f2c2
add a comment, tip: generate_candidate only for prompt lookup
xipingyan Jan 7, 2026
8a8e4a8
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Jan 8, 2026
96db59d
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Jan 9, 2026
490d6e1
move print_token_id to debug_utils.hpp
xipingyan Jan 11, 2026
52c6526
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 11, 2026
e4afba8
algin comment for cpp/python samples
xipingyan Jan 13, 2026
798367b
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
xipingyan Jan 13, 2026
6fab469
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 13, 2026
327e1f7
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Jan 14, 2026
16264af
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Jan 15, 2026
e4af71f
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Jan 15, 2026
1168413
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 18, 2026
27872c1
Update src/cpp/src/debug_utils.hpp
xipingyan Jan 18, 2026
4ffa17e
Revert "Update src/cpp/src/debug_utils.hpp"
xipingyan Jan 18, 2026
79d84c0
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 19, 2026
e2dd8e1
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 19, 2026
b485321
rebase submodule.
xipingyan Jan 19, 2026
a808c6a
Update src/cpp/src/logger.hpp
xipingyan Jan 21, 2026
7498964
Update src/cpp/src/logger.hpp
xipingyan Jan 21, 2026
34c464d
Update src/cpp/src/logger.hpp
xipingyan Jan 21, 2026
341e75e
Update src/cpp/src/logger.hpp
xipingyan Jan 21, 2026
272b0ad
Apply suggestion from @xipingyan
xipingyan Jan 21, 2026
3f6ea3f
remove debug logger.
xipingyan Jan 21, 2026
75ca555
Update src/cpp/src/prompt_lookup/continuous_batching_for_prompt_looku…
xipingyan Jan 21, 2026
b1fccab
Remove explicit set ATTENTION_BACKEND="PA" in sample. keep align with…
xipingyan Jan 21, 2026
5ba3083
Reverted explicit set attention_backend, fix bug: don't pass prompt_l…
xipingyan Jan 21, 2026
e89dbcf
If set prompt_lookup=false, remove it, avoid to pass it to ov.
xipingyan Jan 21, 2026
2ed13af
Add: test_vlm_prompt_lookup_functionality
xipingyan Jan 21, 2026
d6f521d
Compare the result between enable pld and disable.
xipingyan Jan 21, 2026
4283371
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Jan 21, 2026
0029288
Update samples/cpp/visual_language_chat/prompt_lookup_decoding_vlm.cpp
xipingyan Jan 21, 2026
6da42d4
Update samples/cpp/visual_language_chat/prompt_lookup_decoding_vlm.cpp
xipingyan Jan 22, 2026
b9ee7d4
Update samples/python/visual_language_chat/prompt_lookup_decoding_vlm.py
xipingyan Jan 22, 2026
c2b1e78
Update samples/cpp/visual_language_chat/prompt_lookup_decoding_vlm.cpp
xipingyan Jan 22, 2026
32a79d8
remove specific model name in comments.
xipingyan Jan 22, 2026
fe814bd
Merge branch 'master' into xp/enable_vlm_lookup
xipingyan Jan 22, 2026
67dcdee
update code format
xipingyan Jan 22, 2026
5cc1ecb
Update readme after enable pld + vlm sample.
xipingyan Jan 22, 2026
d10014f
update comment to: Prompt lookup decoding in VLM pipeline enforses Co…
xipingyan Jan 26, 2026
25894e9
fix conflict
Jan 27, 2026
9d958c7
fix ChatHistory issue
Jan 27, 2026
31a734d
add prompt_lookup_decoding_vlm_chat.py
Jan 27, 2026
7abcc46
lint issue
Jan 27, 2026
9025af0
Merge branch 'master' into xp/enable_vlm_lookup
Jan 29, 2026
d3a85a7
Merge branch 'master' into xp/enable_vlm_lookup
Wovchena Jan 30, 2026
c237463
add test case
Feb 2, 2026
1d86405
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
Feb 2, 2026
de8ddd8
lint issue
Feb 2, 2026
b2ac629
add readme
Feb 2, 2026
b16ea2f
rename generate_candidates in ContinuousBatchingImpl
Feb 5, 2026
e7d60df
modify samples
sunxiaoxia2022 Feb 10, 2026
ae135d3
add prompt_lookup in test case
sunxiaoxia2022 Feb 11, 2026
30b6aa1
remove ENABLE_LOOKUP option in sample
sunxiaoxia2022 Feb 11, 2026
f300e46
fix conflict
sunxiaoxia2022 Feb 11, 2026
e3a3d12
move prompt_lookup to the front
sunxiaoxia2022 Feb 11, 2026
7352267
remove a sample test
sunxiaoxia2022 Feb 11, 2026
8479454
remove args.enable_lookup
sunxiaoxia2022 Feb 11, 2026
99b125b
simplify assignment
sunxiaoxia2022 Feb 12, 2026
59d5b57
rename parameter org_prompt in encode_prompt
sunxiaoxia2022 Feb 13, 2026
59ce498
clang
sunxiaoxia2022 Feb 13, 2026
2d25b5b
pass prompt_lookup into VLMPipeline() in test
sunxiaoxia2022 Feb 13, 2026
2872b9e
add limitation with PROMPT_LOOKUP in test case
sunxiaoxia2022 Feb 13, 2026
5796a27
clang
sunxiaoxia2022 Feb 13, 2026
33765a2
Merge branch 'master' into xp/enable_vlm_lookup
Wovchena Feb 20, 2026
e094876
change prompt_lookup to parameter in sample
sunxiaoxia2022 Feb 25, 2026
b871180
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
sunxiaoxia2022 Feb 25, 2026
3ae64d0
lint issue
sunxiaoxia2022 Feb 25, 2026
eb787a0
rm unused include
sunxiaoxia2022 Feb 26, 2026
24bc943
Merge branch 'master' into xp/enable_vlm_lookup
sunxiaoxia2022 Feb 26, 2026
5bd9a8a
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Feb 27, 2026
3778f5a
rm input parameter is_validation_mode_enabled from ContinuousBatching…
sunxiaoxia2022 Feb 27, 2026
0077763
Merge branch 'xp/enable_vlm_lookup' of https://github.com/xipingyan/o…
sunxiaoxia2022 Feb 27, 2026
a694712
Merge branch 'master' into xp/enable_vlm_lookup
peterchen-intel Mar 2, 2026
0f55024
DEVICE and PROMPT_LOOKUP are optional
peterchen-intel Mar 3, 2026
dfdc600
change org_prompt_ids_list to original_prompt_ids_list
sunxiaoxia2022 Mar 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions samples/cpp/visual_language_chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
This example showcases inference of Visual language models (VLMs). The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features `ov::genai::VLMPipeline` and runs the simplest deterministic greedy sampling algorithm. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/minicpm-v-multimodal-chatbot) which provides an example of Visual-language assistant.


There are three sample files:
- [`visual_language_chat.cpp`](./visual_language_chat.cpp) demonstrates basic usage of the VLM pipeline.
The following are sample files:
- [`visual_language_chat.cpp`](./visual_language_chat.cpp) demonstrates basic usage of the VLM pipeline which supports accelerated inference using prompt lookup decoding.
- [`video_to_text_chat.cpp`](./video_to_text_chat.cpp) demonstrates video to text usage of the VLM pipeline.
- [`benchmark_vlm.cpp`](./benchmark_vlm.cpp) shows how to benchmark a VLM in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text and calculating various performance metrics.

Expand Down
26 changes: 18 additions & 8 deletions samples/cpp/visual_language_chat/visual_language_chat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,35 @@ ov::genai::StreamingStatus print_subword(std::string&& subword) {
}

int main(int argc, char* argv[]) try {
if (argc < 3 || argc > 4) {
throw std::runtime_error(std::string{"Usage "} + argv[0] + " <MODEL_DIR> <IMAGE_FILE OR DIR_WITH_IMAGES> <DEVICE>");
if (argc < 3 || argc > 5) {
throw std::runtime_error(std::string{"Usage "} + argv[0] + " <MODEL_DIR> <IMAGE_FILE OR DIR_WITH_IMAGES> [DEVICE] [PROMPT_LOOKUP]");
}

std::vector<ov::Tensor> rgbs = utils::load_images(argv[2]);

// GPU and NPU can be used as well.
// Note: If NPU is selected, only language model will be run on NPU
std::string device = (argc == 4) ? argv[3] : "CPU";
ov::AnyMap enable_compile_cache;
std::string device = (argc >= 4) ? argv[3] : "CPU";
std::string lookup = (argc == 5) ? argv[4] : "false";
bool prompt_lookup = (lookup == "true");
// Prompt lookup decoding in VLM pipeline enforces ContinuousBatching backend
ov::AnyMap properties = {ov::genai::prompt_lookup(prompt_lookup)};
if (device == "GPU") {
// Cache compiled models on disk for GPU to save time on the
// next run. It's not beneficial for CPU.
enable_compile_cache.insert({ov::cache_dir("vlm_cache")});
properties.insert({ov::cache_dir("vlm_cache")});
}
ov::genai::VLMPipeline pipe(argv[1], device, enable_compile_cache);

ov::genai::VLMPipeline pipe(argv[1], device, properties);

ov::genai::GenerationConfig generation_config;
generation_config.max_new_tokens = 100;
if (prompt_lookup) {
// Define candidates number for candidate generation
generation_config.num_assistant_tokens = 5;
// Define max_ngram_size
generation_config.max_ngram_size = 3;
}

std::string prompt;

Expand All @@ -47,7 +57,7 @@ int main(int argc, char* argv[]) try {
);
history.push_back({{"role", "assistant"}, {"content", std::move(decoded_results.texts[0])}});
std::cout << "\n----------\n"
"question:\n";
"question:\n";
while (std::getline(std::cin, prompt)) {
history.push_back({{"role", "user"}, {"content", std::move(prompt)}});
// New images and videos can be passed at each turn
Expand All @@ -58,7 +68,7 @@ int main(int argc, char* argv[]) try {
);
history.push_back({{"role", "assistant"}, {"content", std::move(decoded_results.texts[0])}});
std::cout << "\n----------\n"
"question:\n";
"question:\n";
}
} catch (const std::exception& error) {
try {
Expand Down
4 changes: 2 additions & 2 deletions samples/python/visual_language_chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

This example showcases inference of text-generation Vision Language Models (VLMs): `miniCPM-V-2_6` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features `openvino_genai.VLMPipeline` and configures it for the chat scenario. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/minicpm-v-multimodal-chatbot) which provides an example of Visual-language assistant.

There are three sample files:
- [`visual_language_chat.py`](./visual_language_chat.py) demonstrates basic usage of the VLM pipeline.
The following are sample files:
- [`visual_language_chat.py`](./visual_language_chat.py) demonstrates basic usage of the VLM pipeline which supports accelerated inference using prompt lookup decoding.
- [`video_to_text_chat.py`](./video_to_text_chat.py) demonstrates video to text usage of the VLM pipeline.
- [`benchmark_vlm.py`](./benchmark_vlm.py) shows how to benchmark a VLM in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text and calculating various performance metrics.
- [`milebench_eval_vlm.py`](./milebench_eval_vlm.py) provides MileBench validation for VLMs, enabling evaluation of image–text reasoning and visual QA tasks across multiple subsets designed to assess the MultImodal Long-contExt capabilities of MLLMs.
Expand Down
22 changes: 16 additions & 6 deletions samples/python/visual_language_chat/visual_language_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,25 +49,35 @@ def read_images(path: str) -> list[Tensor]:

def main():
parser = argparse.ArgumentParser()
parser.add_argument('model_dir', help="Path to the model directory")
parser.add_argument('image_dir', help="Image file or dir with images")
parser.add_argument('device', nargs='?', default='CPU', help="Device to run the model on (default: CPU)")
parser.add_argument("model_dir", help="Path to the model directory")
parser.add_argument("image_dir", help="Image file or dir with images")
parser.add_argument("device", nargs="?", default="CPU", help="Device to run the model on (default: CPU)")
parser.add_argument(
"prompt_lookup", nargs="?", default="false", help="Enable prompt lookup decoding (default: false)"
)
args = parser.parse_args()

rgbs = read_images(args.image_dir)

# GPU and NPU can be used as well.
# Note: If NPU is selected, only the language model will be run on the NPU.
enable_compile_cache = dict()
# Prompt lookup decoding in VLM pipeline enforces ContinuousBatching backend
prompt_lookup = args.prompt_lookup == "true"
properties = {"prompt_lookup": prompt_lookup}
if args.device == "GPU":
# Cache compiled models on disk for GPU to save time on the next run.
# It's not beneficial for CPU.
enable_compile_cache["CACHE_DIR"] = "vlm_cache"
properties["CACHE_DIR"] = "vlm_cache"

pipe = openvino_genai.VLMPipeline(args.model_dir, args.device, **enable_compile_cache)
pipe = openvino_genai.VLMPipeline(args.model_dir, args.device, **properties)

config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
if prompt_lookup:
# add parameter to enable prompt lookup decoding to generate `num_assistant_tokens` candidates per iteration
config.num_assistant_tokens = 5
# Define max_ngram_size
config.max_ngram_size = 3

history = openvino_genai.ChatHistory()
prompt = input('question:\n')
Expand Down
7 changes: 5 additions & 2 deletions src/cpp/src/continuous_batching/pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,11 @@ ContinuousBatchingPipeline::ContinuousBatchingPipeline( const std::filesystem::p

if (is_prompt_lookup_enabled) {
OPENVINO_ASSERT(draft_model_desr.model == nullptr, "Speculative decoding and prompt lookup decoding are mutually exclusive");
OPENVINO_ASSERT(embedder == nullptr, "Prompt lookup decoding is not supported for models with embeddings");
m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
if (embedder) {
m_impl = std::make_shared<PromptLookupImpl>(model, embedder, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
} else {
m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
}
} else if (draft_model_desr.model != nullptr && eagle_rt_info.eagle3_mode) {
OPENVINO_ASSERT(embedder == nullptr, "Eagle speculative decoding is not supported for models with embeddings");
auto main_model_descr = ov::genai::ModelDesc(model, tokenizer, device, properties_without_draft_model_without_gguf, scheduler_config, generation_config);
Expand Down
19 changes: 18 additions & 1 deletion src/cpp/src/continuous_batching/pipeline_base.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,7 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
std::vector<ov::Tensor> input_embeds_list;
std::vector<ov::Tensor> token_type_ids_list;
std::vector<std::pair<ov::Tensor, std::optional<int64_t>>> position_ids_list;
std::vector<ov::Tensor> original_prompt_ids_list;

std::vector<VLMPerfMetrics> vlm_perf_metrics(prompts.size());
std::vector<EncodedImage> encoded_images = {};
Expand Down Expand Up @@ -300,6 +301,12 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
std::string templated_history = m_tokenizer.apply_chat_template(m_history, true);

m_inputs_embedder->set_apply_chat_template_status(false);

if (sampling_params[0].is_prompt_lookup()) {
auto prompt_ids = m_inputs_embedder->encode_prompt(prompt);
original_prompt_ids_list.push_back(prompt_ids);
}

if (m_inputs_embedder->has_token_type_ids()) {
auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(templated_history,
m_history_images,
Expand Down Expand Up @@ -340,6 +347,11 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(

m_inputs_embedder->set_apply_chat_template_status(sampling_params[i].apply_chat_template);

if (sampling_params[i].is_prompt_lookup()) {
auto prompt_ids = m_inputs_embedder->encode_prompt(prompt);
original_prompt_ids_list.push_back(prompt_ids);
}
Comment on lines +350 to +353
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should prompt lookup also be applied to other generate() method with ChatHistory argument? Or it will addressed as a separate PR?

Copy link
Contributor Author

@xipingyan xipingyan Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. @yatarkan
Actually, prompt_lookup is introduced to LLM firstly, I just extend it to VLM, because we found it is very useful to some special case in the VLM. For example, answer a question with a specific format, and the corresponding question also contain some similar example, it has very high hit rate.

Currently I don't have plan to apply it to other generate(), because I also don't have customer's requirement.
So I'd like to process other generate case in other PR.
What do you like?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK for now but start_chat() is going to be deprecated in the next release. Will you submit a patch by then?

Copy link
Contributor Author

@xipingyan xipingyan Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Wovchena @yatarkan ,
We found some problems for generate() method with ChatHistory argument if we enable prompt_lookup, because we had moved this PR to next release, so we will continue to profiling and fix it in this PR.

About start_chart and finish_chart, there is no issue.

BTW, @sunxiaoxia2022 will help me to fix ChatHistory issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yatarkan @Wovchena @xipingyan Hi, ChatHistory issue has been resolved, and added test case with ChatHistory in prompt_lookup_decoding_vlm.cpp. Please take a look, thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yatarkan @Wovchena @xipingyan , Added python samples and accuracy test. Please take a look, thank you!


if (m_inputs_embedder->has_token_type_ids()) {
auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(unified_prompt,
encoded_images,
Expand All @@ -360,7 +372,12 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
}
}
std::vector<VLMDecodedResults> results;
std::vector<EncodedGenerationResult> encoded_results = generate(input_embeds_list, sampling_params, streamer, token_type_ids_list, position_ids_list);
std::vector<EncodedGenerationResult> encoded_results = generate(input_embeds_list,
sampling_params,
streamer,
token_type_ids_list,
position_ids_list,
original_prompt_ids_list);
for (size_t i = 0; i < prompts.size(); i++) {
auto result = encoded_results[i];
VLMDecodedResults gen_result;
Expand Down
6 changes: 4 additions & 2 deletions src/cpp/src/continuous_batching/pipeline_base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,8 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
virtual GenerationHandle add_request(uint64_t request_id,
const ov::Tensor& input_ids,
const GenerationConfig& sampling_params,
std::optional<ov::Tensor> token_type_ids = std::nullopt) = 0;
std::optional<ov::Tensor> token_type_ids = std::nullopt,
std::optional<ov::Tensor> prompt_ids = std::nullopt) = 0;

/**
* Adds request to running queue based on string input
Expand Down Expand Up @@ -130,7 +131,8 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer,
const std::optional<std::vector<ov::Tensor>>& token_type_ids = std::nullopt,
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids = std::nullopt) = 0;
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids = std::nullopt,
const std::optional<std::vector<ov::Tensor>>& prompt_ids = std::nullopt) = 0;

/**
* Performs monolitic generation based on text prompts
Expand Down
20 changes: 16 additions & 4 deletions src/cpp/src/continuous_batching/pipeline_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::~ContinuousBatchingImpl() {
}
}

void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates_for_prompt_lookup() {}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty virtual method should have a comment explaining that it's a no-op base implementation that's overridden by PromptLookupImpl for candidate generation.

Suggested change
void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates_for_prompt_lookup() {}
void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates() {
// Intentionally left as a no-op in the base implementation.
// PromptLookupImpl overrides this method to perform candidate generation.
}

Copilot uses AI. Check for mistakes.

void ContinuousBatchingPipeline::ContinuousBatchingImpl::_pull_awaiting_requests() {
std::lock_guard<std::mutex> lock{m_awaiting_requests_mutex};
m_requests.insert(m_requests.end(), m_awaiting_requests.begin(), m_awaiting_requests.end());
Expand Down Expand Up @@ -263,7 +265,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(
uint64_t request_id,
const ov::Tensor& input_ids,
const ov::genai::GenerationConfig& sampling_params,
std::optional<ov::Tensor> token_type_ids) {
std::optional<ov::Tensor> token_type_ids,
std::optional<ov::Tensor> prompt_ids) {
auto sampling_params_copy = sampling_params;
// If stop_token_ids were not provided, take value from default m_generation_config
if (sampling_params_copy.stop_token_ids.empty())
Expand All @@ -289,7 +292,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(
m_block_size,
token_type_ids,
position_ids,
rope_delta);
rope_delta,
prompt_ids);
}
else {
sequence_group = std::make_shared<SequenceGroup>(request_id, input_ids, sampling_params_copy, m_block_size, token_type_ids);
Expand Down Expand Up @@ -434,7 +438,14 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::step() {

free_fork_timer.end();
}


{
static ManualTimer candidates_timer("generate_candidates_for_prompt_lookup()");
candidates_timer.start();
generate_candidates_for_prompt_lookup();
candidates_timer.end();
}

// append embeddings for generated tokens
if (m_model_input_type == ModelInputType::EMBEDDINGS)
m_model_runner->append_embeddings(m_requests, scheduler_output);
Expand Down Expand Up @@ -470,7 +481,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<o
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer,
const std::optional<std::vector<ov::Tensor>>& token_type_ids,
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list) {
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list,
const std::optional<std::vector<ov::Tensor>>& prompt_ids) {

_reset_cache_usage_statistics();
ManualTimer generate_timer("generate()");
Expand Down
13 changes: 11 additions & 2 deletions src/cpp/src/continuous_batching/pipeline_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -123,22 +123,31 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc
GenerationHandle add_request(uint64_t request_id,
const ov::Tensor& input_ids,
const ov::genai::GenerationConfig& sampling_params,
std::optional<ov::Tensor> token_type_ids = std::nullopt) override;
std::optional<ov::Tensor> token_type_ids = std::nullopt,
std::optional<ov::Tensor> prompt_ids = std::nullopt) override;

GenerationHandle add_request(uint64_t request_id,
const std::string& prompt,
const ov::genai::GenerationConfig& sampling_params) override;

bool has_non_finished_requests() override;

virtual void generate_candidates_for_prompt_lookup();

void step() override;

/**
* input_ids is a batch of input ids for generation, which can be either raw prompts or already encoded token ids,
* depending on the pipeline configuration. prompt_ids is an optional batch of prompt ids, which represents the
* token IDs of the prompt portion for each sequence in the batch.
*/
std::vector<EncodedGenerationResult>
generate(const std::vector<ov::Tensor>& input_ids,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer,
const std::optional<std::vector<ov::Tensor>>& token_type_ids = std::nullopt,
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list = std::nullopt) override;
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list = std::nullopt,
const std::optional<std::vector<ov::Tensor>>& prompt_ids = std::nullopt) override;

/**
* Updates LoRA adapters for current generation call
Expand Down
15 changes: 15 additions & 0 deletions src/cpp/src/debug_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
#include <openvino/runtime/tensor.hpp>
#include <string>

#include "openvino/genai/tokenizer.hpp"

template <typename T>
void print_array(T* array, size_t size) {
std::cout << " => [ ";
Expand Down Expand Up @@ -206,6 +208,19 @@ inline ov::Tensor from_npy(const std::filesystem::path& npy) {
return tensor;
}

inline std::string print_token_id(const std::vector<int64_t>& print_ids,
const std::string& prefix,
const size_t& last_num,
ov::genai::Tokenizer& tokenizer) {
std::stringstream ss;
ss << prefix << " = ";
size_t start_id = (print_ids.size() > last_num) ? (print_ids.size() - last_num) : 0;
for (size_t id = start_id; id < print_ids.size(); id++) {
ss << print_ids[id] << "[" << tokenizer.decode(std::vector<int64_t>{print_ids[id]}) << "],";
}
return ss.str();
}

inline float max_diff(const ov::Tensor& lhs, const ov::Tensor& rhs) {
OPENVINO_ASSERT(lhs.get_shape() == rhs.get_shape());
float max_diff = 0.0f;
Expand Down
Loading
Loading