openvinotoolkit · Wovchena · Mar 3, 2026 · Sep 4, 2025 · Sep 5, 2025 · Sep 5, 2025
@@ -3,8 +3,8 @@
 This example showcases inference of Visual language models (VLMs). The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features `ov::genai::VLMPipeline` and runs the simplest deterministic greedy sampling algorithm. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/minicpm-v-multimodal-chatbot) which provides an example of Visual-language assistant.
 
 
-There are three sample files:
- - [`visual_language_chat.cpp`](./visual_language_chat.cpp) demonstrates basic usage of the VLM pipeline.
+The following are sample files:
+ - [`visual_language_chat.cpp`](./visual_language_chat.cpp) demonstrates basic usage of the VLM pipeline which supports accelerated inference using prompt lookup decoding.
  - [`video_to_text_chat.cpp`](./video_to_text_chat.cpp) demonstrates video to text usage of the VLM pipeline.
  - [`benchmark_vlm.cpp`](./benchmark_vlm.cpp) shows how to benchmark a VLM in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text and calculating various performance metrics.
 

@@ -11,25 +11,35 @@ ov::genai::StreamingStatus print_subword(std::string&& subword) {
 }
 
 int main(int argc, char* argv[]) try {
-    if (argc < 3 || argc > 4) {
-        throw std::runtime_error(std::string{"Usage "} + argv[0] + " <MODEL_DIR> <IMAGE_FILE OR DIR_WITH_IMAGES> <DEVICE>");
+    if (argc < 3 || argc > 5) {
+        throw std::runtime_error(std::string{"Usage "} + argv[0] + " <MODEL_DIR> <IMAGE_FILE OR DIR_WITH_IMAGES> [DEVICE] [PROMPT_LOOKUP]");
     }
 
     std::vector<ov::Tensor> rgbs = utils::load_images(argv[2]);
 
     // GPU and NPU can be used as well.
     // Note: If NPU is selected, only language model will be run on NPU
-    std::string device = (argc == 4) ? argv[3] : "CPU";
-    ov::AnyMap enable_compile_cache;
+    std::string device = (argc >= 4) ? argv[3] : "CPU";
+    std::string lookup = (argc == 5) ? argv[4] : "false";
+    bool prompt_lookup = (lookup == "true");
+    // Prompt lookup decoding in VLM pipeline enforces ContinuousBatching backend
+    ov::AnyMap properties = {ov::genai::prompt_lookup(prompt_lookup)};
     if (device == "GPU") {
         // Cache compiled models on disk for GPU to save time on the
         // next run. It's not beneficial for CPU.
-        enable_compile_cache.insert({ov::cache_dir("vlm_cache")});
+        properties.insert({ov::cache_dir("vlm_cache")});
     }
-    ov::genai::VLMPipeline pipe(argv[1], device, enable_compile_cache);
+
+    ov::genai::VLMPipeline pipe(argv[1], device, properties);
 
     ov::genai::GenerationConfig generation_config;
     generation_config.max_new_tokens = 100;
+    if (prompt_lookup) {
+        // Define candidates number for candidate generation
+        generation_config.num_assistant_tokens = 5;
+        // Define max_ngram_size
+        generation_config.max_ngram_size = 3;
+    }
 
     std::string prompt;
 
@@ -47,7 +57,7 @@ int main(int argc, char* argv[]) try {
     );
     history.push_back({{"role", "assistant"}, {"content", std::move(decoded_results.texts[0])}});
     std::cout << "\n----------\n"
-        "question:\n";
+                 "question:\n";
     while (std::getline(std::cin, prompt)) {
         history.push_back({{"role", "user"}, {"content", std::move(prompt)}});
         // New images and videos can be passed at each turn
@@ -58,7 +68,7 @@ int main(int argc, char* argv[]) try {
         );
         history.push_back({{"role", "assistant"}, {"content", std::move(decoded_results.texts[0])}});
         std::cout << "\n----------\n"
-            "question:\n";
+                     "question:\n";
     }
 } catch (const std::exception& error) {
     try {

@@ -2,8 +2,8 @@
 
 This example showcases inference of text-generation Vision Language Models (VLMs): `miniCPM-V-2_6` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features `openvino_genai.VLMPipeline` and configures it for the chat scenario. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/minicpm-v-multimodal-chatbot) which provides an example of Visual-language assistant.
 
-There are three sample files:
- - [`visual_language_chat.py`](./visual_language_chat.py) demonstrates basic usage of the VLM pipeline.
+The following are sample files:
+ - [`visual_language_chat.py`](./visual_language_chat.py) demonstrates basic usage of the VLM pipeline which supports accelerated inference using prompt lookup decoding.
  - [`video_to_text_chat.py`](./video_to_text_chat.py) demonstrates video to text usage of the VLM pipeline.
  - [`benchmark_vlm.py`](./benchmark_vlm.py) shows how to benchmark a VLM in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text and calculating various performance metrics.
  - [`milebench_eval_vlm.py`](./milebench_eval_vlm.py) provides MileBench validation for VLMs, enabling evaluation of image–text reasoning and visual QA tasks across multiple subsets designed to assess the MultImodal Long-contExt capabilities of MLLMs.

@@ -49,25 +49,35 @@ def read_images(path: str) -> list[Tensor]:
 
 def main():
     parser = argparse.ArgumentParser()
-    parser.add_argument('model_dir', help="Path to the model directory")
-    parser.add_argument('image_dir', help="Image file or dir with images")
-    parser.add_argument('device', nargs='?', default='CPU', help="Device to run the model on (default: CPU)")
+    parser.add_argument("model_dir", help="Path to the model directory")
+    parser.add_argument("image_dir", help="Image file or dir with images")
+    parser.add_argument("device", nargs="?", default="CPU", help="Device to run the model on (default: CPU)")
+    parser.add_argument(
+        "prompt_lookup", nargs="?", default="false", help="Enable prompt lookup decoding (default: false)"
+    )
     args = parser.parse_args()
 
     rgbs = read_images(args.image_dir)
 
     # GPU and NPU can be used as well.
     # Note: If NPU is selected, only the language model will be run on the NPU.
-    enable_compile_cache = dict()
+    # Prompt lookup decoding in VLM pipeline enforces ContinuousBatching backend
+    prompt_lookup = args.prompt_lookup == "true"
+    properties = {"prompt_lookup": prompt_lookup}
     if args.device == "GPU":
         # Cache compiled models on disk for GPU to save time on the next run.
         # It's not beneficial for CPU.
-        enable_compile_cache["CACHE_DIR"] = "vlm_cache"
+        properties["CACHE_DIR"] = "vlm_cache"
 
-    pipe = openvino_genai.VLMPipeline(args.model_dir, args.device, **enable_compile_cache)
+    pipe = openvino_genai.VLMPipeline(args.model_dir, args.device, **properties)
 
     config = openvino_genai.GenerationConfig()
     config.max_new_tokens = 100
+    if prompt_lookup:
+        # add parameter to enable prompt lookup decoding to generate `num_assistant_tokens` candidates per iteration
+        config.num_assistant_tokens = 5
+        # Define max_ngram_size
+        config.max_ngram_size = 3
 
     history = openvino_genai.ChatHistory()
     prompt = input('question:\n')

@@ -66,8 +66,11 @@ ContinuousBatchingPipeline::ContinuousBatchingPipeline( const std::filesystem::p
 
     if (is_prompt_lookup_enabled) {
         OPENVINO_ASSERT(draft_model_desr.model == nullptr, "Speculative decoding and prompt lookup decoding are mutually exclusive");
-        OPENVINO_ASSERT(embedder == nullptr, "Prompt lookup decoding is not supported for models with embeddings");
-        m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
+        if (embedder) {
+            m_impl = std::make_shared<PromptLookupImpl>(model, embedder, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
+        } else {
+            m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
+        }
     } else if (draft_model_desr.model != nullptr && eagle_rt_info.eagle3_mode) {
         OPENVINO_ASSERT(embedder == nullptr, "Eagle speculative decoding is not supported for models with embeddings");
         auto main_model_descr = ov::genai::ModelDesc(model, tokenizer, device, properties_without_draft_model_without_gguf, scheduler_config, generation_config);

@@ -268,6 +268,7 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
     std::vector<ov::Tensor> input_embeds_list;
     std::vector<ov::Tensor> token_type_ids_list;
     std::vector<std::pair<ov::Tensor, std::optional<int64_t>>> position_ids_list;
+    std::vector<ov::Tensor> original_prompt_ids_list;
 
     std::vector<VLMPerfMetrics> vlm_perf_metrics(prompts.size());
     std::vector<EncodedImage> encoded_images = {};
@@ -300,6 +301,12 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
         std::string templated_history = m_tokenizer.apply_chat_template(m_history, true);
 
         m_inputs_embedder->set_apply_chat_template_status(false);
+
+        if (sampling_params[0].is_prompt_lookup()) {
+            auto prompt_ids = m_inputs_embedder->encode_prompt(prompt);
+            original_prompt_ids_list.push_back(prompt_ids);
+        }
+
         if (m_inputs_embedder->has_token_type_ids()) {
             auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(templated_history,
                                                                                              m_history_images,
@@ -340,6 +347,11 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
 
             m_inputs_embedder->set_apply_chat_template_status(sampling_params[i].apply_chat_template);
 
+            if (sampling_params[i].is_prompt_lookup()) {
+                auto prompt_ids = m_inputs_embedder->encode_prompt(prompt);
+                original_prompt_ids_list.push_back(prompt_ids);
+            }
+
             if (m_inputs_embedder->has_token_type_ids()) {
                 auto [embeds, tt_ids] = m_inputs_embedder->get_inputs_embeds_with_token_type_ids(unified_prompt,
                                                                                                  encoded_images,
@@ -360,7 +372,12 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
         }
     }
     std::vector<VLMDecodedResults> results;
-    std::vector<EncodedGenerationResult> encoded_results = generate(input_embeds_list, sampling_params, streamer, token_type_ids_list, position_ids_list);
+    std::vector<EncodedGenerationResult> encoded_results = generate(input_embeds_list,
+                                                                    sampling_params,
+                                                                    streamer,
+                                                                    token_type_ids_list,
+                                                                    position_ids_list,
+                                                                    original_prompt_ids_list);
     for (size_t i = 0; i < prompts.size(); i++) {
         auto result = encoded_results[i];
         VLMDecodedResults gen_result;

@@ -83,7 +83,8 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
     virtual GenerationHandle add_request(uint64_t request_id,
                                          const ov::Tensor& input_ids,
                                          const GenerationConfig& sampling_params,
-                                         std::optional<ov::Tensor> token_type_ids = std::nullopt) = 0;
+                                         std::optional<ov::Tensor> token_type_ids = std::nullopt,
+                                         std::optional<ov::Tensor> prompt_ids = std::nullopt) = 0;
 
     /**
      * Adds request to running queue based on string input
@@ -130,7 +131,8 @@ class ContinuousBatchingPipeline::IContinuousBatchingPipeline {
              const std::vector<GenerationConfig>& sampling_params,
              const StreamerVariant& streamer,
              const std::optional<std::vector<ov::Tensor>>& token_type_ids = std::nullopt,
-             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids = std::nullopt) = 0;
+             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids = std::nullopt,
+             const std::optional<std::vector<ov::Tensor>>& prompt_ids = std::nullopt) = 0;
 
     /**
      * Performs monolitic generation based on text prompts

@@ -113,6 +113,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::~ContinuousBatchingImpl() {
     }
 }
 
+void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates_for_prompt_lookup() {}
-void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates_for_prompt_lookup() {}
+void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates() {
+    // Intentionally left as a no-op in the base implementation.
+    // PromptLookupImpl overrides this method to perform candidate generation.
+}
-void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates_for_prompt_lookup() {}
+void ContinuousBatchingPipeline::ContinuousBatchingImpl::generate_candidates() {
+    // Intentionally left as a no-op in the base implementation.
+    // PromptLookupImpl overrides this method to perform candidate generation.
+}
+
 void ContinuousBatchingPipeline::ContinuousBatchingImpl::_pull_awaiting_requests() {
     std::lock_guard<std::mutex> lock{m_awaiting_requests_mutex};
     m_requests.insert(m_requests.end(), m_awaiting_requests.begin(), m_awaiting_requests.end());
@@ -263,7 +265,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(
     uint64_t request_id,
     const ov::Tensor& input_ids,
     const ov::genai::GenerationConfig& sampling_params,
-    std::optional<ov::Tensor> token_type_ids) {
+    std::optional<ov::Tensor> token_type_ids,
+    std::optional<ov::Tensor> prompt_ids) {
     auto sampling_params_copy = sampling_params;
     // If stop_token_ids were not provided, take value from default m_generation_config
     if (sampling_params_copy.stop_token_ids.empty())
@@ -289,7 +292,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::add_request(
                                                          m_block_size, 
                                                          token_type_ids, 
                                                          position_ids, 
-                                                         rope_delta);
+                                                         rope_delta,
+                                                         prompt_ids);
     }
     else {
         sequence_group = std::make_shared<SequenceGroup>(request_id, input_ids, sampling_params_copy, m_block_size, token_type_ids);
@@ -434,7 +438,14 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::step() {
 
         free_fork_timer.end();
     }
-
+
+    {
+        static ManualTimer candidates_timer("generate_candidates_for_prompt_lookup()");
+        candidates_timer.start();
+        generate_candidates_for_prompt_lookup();
+        candidates_timer.end();
+    }
+
     // append embeddings for generated tokens
     if (m_model_input_type == ModelInputType::EMBEDDINGS)
         m_model_runner->append_embeddings(m_requests, scheduler_output);
@@ -470,7 +481,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<o
                                                              const std::vector<GenerationConfig>& sampling_params,
                                                              const StreamerVariant& streamer,
                                                              const std::optional<std::vector<ov::Tensor>>& token_type_ids,
-                                                             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list) {
+                                                             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list,
+                                                             const std::optional<std::vector<ov::Tensor>>& prompt_ids) {
 
     _reset_cache_usage_statistics();
     ManualTimer generate_timer("generate()");

@@ -123,22 +123,31 @@ class ContinuousBatchingPipeline::ContinuousBatchingImpl : public ContinuousBatc
     GenerationHandle add_request(uint64_t request_id,
                                  const ov::Tensor& input_ids,
                                  const ov::genai::GenerationConfig& sampling_params,
-                                 std::optional<ov::Tensor> token_type_ids = std::nullopt) override;
+                                 std::optional<ov::Tensor> token_type_ids = std::nullopt,
+                                 std::optional<ov::Tensor> prompt_ids = std::nullopt) override;
 
     GenerationHandle add_request(uint64_t request_id,
                                  const std::string& prompt,
                                  const ov::genai::GenerationConfig& sampling_params) override;
 
     bool has_non_finished_requests() override;
 
+    virtual void generate_candidates_for_prompt_lookup();
+
     void step() override;
 
+    /**
+     * input_ids is a batch of input ids for generation, which can be either raw prompts or already encoded token ids,
+     * depending on the pipeline configuration. prompt_ids is an optional batch of prompt ids, which represents the
+     * token IDs of the prompt portion for each sequence in the batch.
+     */
     std::vector<EncodedGenerationResult>
     generate(const std::vector<ov::Tensor>& input_ids,
              const std::vector<GenerationConfig>& sampling_params,
              const StreamerVariant& streamer,
              const std::optional<std::vector<ov::Tensor>>& token_type_ids = std::nullopt,
-             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list = std::nullopt) override;
+             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list = std::nullopt,
+             const std::optional<std::vector<ov::Tensor>>& prompt_ids = std::nullopt) override;
 
     /**
      * Updates LoRA adapters for current generation call

@@ -14,6 +14,8 @@
 #include <openvino/runtime/tensor.hpp>
 #include <string>
 
+#include "openvino/genai/tokenizer.hpp"
+
 template <typename T>
 void print_array(T* array, size_t size) {
     std::cout << " => [ ";
@@ -206,6 +208,19 @@ inline ov::Tensor from_npy(const std::filesystem::path& npy) {
     return tensor;
 }
 
+inline std::string print_token_id(const std::vector<int64_t>& print_ids,
+                                  const std::string& prefix,
+                                  const size_t& last_num,
+                                  ov::genai::Tokenizer& tokenizer) {
+    std::stringstream ss;
+    ss << prefix << " = ";
+    size_t start_id = (print_ids.size() > last_num) ? (print_ids.size() - last_num) : 0;
+    for (size_t id = start_id; id < print_ids.size(); id++) {
+        ss << print_ids[id] << "[" << tokenizer.decode(std::vector<int64_t>{print_ids[id]}) << "],";
+    }
+    return ss.str();
+}
+
 inline float max_diff(const ov::Tensor& lhs, const ov::Tensor& rhs) {
     OPENVINO_ASSERT(lhs.get_shape() == rhs.get_shape());
     float max_diff = 0.0f;