Skip to content

Enable VLM lookup.#2707

Merged
Wovchena merged 146 commits intoopenvinotoolkit:masterfrom
xipingyan:xp/enable_vlm_lookup
Mar 3, 2026
Merged

Enable VLM lookup.#2707
Wovchena merged 146 commits intoopenvinotoolkit:masterfrom
xipingyan:xp/enable_vlm_lookup

Conversation

@xipingyan
Copy link
Contributor

@xipingyan xipingyan commented Sep 5, 2025

1: Move m_pipeline->generate_candidates(); fromContinuousBatchingPipeline::PromptLookupImpl::step() to ContinuousBatchingPipeline::ContinuousBatchingImpl::step()
2: Encode original prompt to org_prompt_ids as lookup table.

Tickets: CVS-172889, CVS-176586

@github-actions github-actions bot added category: visual language Visual language pipeline category: continuous batching Continuous batching no-match-files category: prompt lookup Prompt look-up decoding labels Sep 5, 2025
@xipingyan xipingyan marked this pull request as ready for review September 5, 2025 13:29
Copilot AI review requested due to automatic review settings September 5, 2025 13:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables VLM (Vision Language Model) lookup functionality by enhancing the continuous batching pipeline to support prompt lookup for embedding-based models. The changes introduce a new interface for providing prompt token IDs when using embedding inputs, enabling better integration between visual language models and prompt lookup decoding.

  • Move candidate generation from PromptLookupImpl::step() to ContinuousBatchingImpl::step()
  • Add new interface to pass prompt token IDs for embedding input models through get_inputs_embeds_with_token_type_ids method
  • Update all VLM model implementations to support the new token_type_ids interface

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/cpp/src/visual_language/qwen2vl/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/qwen2vl/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/phi4mm/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/phi4mm/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/phi3_vision/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/phi3_vision/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/minicpm/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/minicpm/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/llava_next/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/llava_next/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/llava/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/llava/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/internvl_chat/classes.hpp Add get_inputs_embeds_with_token_type_ids method declaration
src/cpp/src/visual_language/internvl_chat/classes.cpp Implement new method and refactor existing get_inputs_embeds to use it
src/cpp/src/visual_language/inputs_embedder.hpp Update constructors to accept prompt_lookup parameter and add prompt_lookup support
src/cpp/src/visual_language/inputs_embedder.cpp Update has_token_type_ids method and constructors for prompt lookup support
src/cpp/src/sequence_group.hpp Add handling for embeddings in remove_last_tokens method
src/cpp/src/prompt_lookup/prompt_lookup_impl.hpp Add constructor for embedding-based models
src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp Support token_type_ids and remove candidate generation from step method
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.hpp Add constructor for embedding models and make generate_candidates virtual
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp Fix loop variable type and add candidate padding logic
src/cpp/src/continuous_batching/pipeline_impl.hpp Add virtual generate_candidates method declaration
src/cpp/src/continuous_batching/pipeline_impl.cpp Move candidate generation to main pipeline step and add empty default implementation
src/cpp/src/continuous_batching/pipeline.cpp Pass prompt_lookup flag to InputsEmbedder constructor
src/cpp/src/continuous_batching/model_runner.hpp Add proper token_type_ids tensor existence check

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@xipingyan
Copy link
Contributor Author

@yangsu2022 , could you please help review firstly?

1: global variable pass prompts_ids.

Signed-off-by: xipingya <xiping.yan@intel.com>
Don't need to add new interface, just reuse "token_type_ids".

Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
2: fix match bug, for example:
input_ids={2, 3, 1, 1, 2, 3, 4, 5, 6, 9, 2, 3, 1, 2, 3}
num_pred_tokens=3
max_ngram_size=3
return candidate: 2,3,1

Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
@xipingyan xipingyan requested a review from Copilot September 11, 2025 05:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: xipingya <xiping.yan@intel.com>
@xipingyan xipingyan requested a review from Copilot September 11, 2025 06:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: xipingya <xiping.yan@intel.com>
@xipingyan xipingyan requested a review from Copilot September 11, 2025 06:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: xipingya <xiping.yan@intel.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.

@Wovchena Wovchena mentioned this pull request Feb 12, 2026
1 task
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:50

  • After moving candidate generation into ContinuousBatchingImpl::step(), PromptLookupImpl::step() no longer updates m_sd_metrics.draft_duration, so draft vs main timing breakdown in SpeculativeDecodingMetrics becomes misleading (draft stays 0 and candidate-generation time is effectively attributed to main). Consider reintroducing draft timing around generate_candidates_for_prompt_lookup() for the prompt-lookup pipeline, or updating the metrics naming/logic accordingly.
    ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
    main_timer.start();
    m_pipeline->step();
    main_timer.end();
    m_sd_metrics.main_duration += main_timer.get_duration();
    m_pipeline_metrics = m_pipeline->get_metrics();

Comment on lines +86 to 92
for (const auto& candidate : candidates) {
running_sequence->append_token(candidate, 0);
}
max_validation_len = std::max(max_validation_len, candidates.size());
}
request->set_num_validated_tokens(max_validation_len);
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_validation_len is computed after padding candidates up to sampling_params.num_assistant_tokens, and then stored via set_num_validated_tokens(). This makes the pipeline try to validate padded tokens as if they were real candidates (and can also exceed min_num_assistant_tokens near max_new_tokens). Validation length should reflect only the number of real candidate tokens generated, not the padded shape length.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:46

  • After moving candidate generation out of PromptLookupImpl::step(), the draft/candidate generation duration is no longer accounted for in prompt-lookup-specific metrics (previously accumulated into m_sd_metrics.draft_duration). If these metrics are consumed externally, this is a regression. Consider timing generate_candidates_for_prompt_lookup() in the prompt-lookup path and adding it back into the same metrics field (e.g., via an override that measures and accumulates duration, or by returning duration from the hook).
    ManualTimer step_timer("prompt_lookup_decoding: step()");
    step_timer.start();

    auto generated_len_before = m_pipeline->get_generated_request_len();

    ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
    main_timer.start();

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

src/cpp/src/continuous_batching/pipeline_impl.cpp:529

  • ContinuousBatchingImpl::generate(...) now accepts prompt_ids, but the parameter is never used (and is not forwarded into add_request(...)). This can cause -Wunused-parameter build failures and prevents prompt IDs from reaching SequenceGroup when using this implementation. Either propagate prompt_ids per request (with the same bounds checks used for token_type_ids) or explicitly mark the parameter unused if it is intentionally unsupported here.
std::vector<EncodedGenerationResult>
ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<ov::Tensor>& input_ids,
                                                             const std::vector<GenerationConfig>& sampling_params,
                                                             const StreamerVariant& streamer,
                                                             const std::optional<std::vector<ov::Tensor>>& token_type_ids,
                                                             const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list,
                                                             const std::optional<std::vector<ov::Tensor>>& prompt_ids) {

    _reset_cache_usage_statistics();
    ManualTimer generate_timer("generate()");
    generate_timer.start();

    OPENVINO_ASSERT(!has_non_finished_requests(), "Generate cannot be called while ContinuousBatchingPipeline is already in running state. Use ContinuousBatchingPipeline::add_request");
    OPENVINO_ASSERT(input_ids.size() == sampling_params.size());

    if (position_ids_list.has_value()) {
        OPENVINO_ASSERT((*position_ids_list).size() == input_ids.size());
    }

    auto start_time =  std::chrono::steady_clock::now();
    PerfMetrics perf_metrics;
    auto& raw_perf_counters = perf_metrics.raw_metrics;
    raw_perf_counters.m_inference_durations =  {{ MicroSeconds(0.0f) }};

    // checks that all requests has the same LoRA adapters property value
    for (size_t i = 1; i < sampling_params.size(); ++i) {
        OPENVINO_ASSERT(sampling_params[i - 1].adapters == sampling_params[i].adapters,
            "LoRA adapters value must be the same for all requests");
    }
    set_adapters(sampling_params[0].adapters);

    const auto streamer_ptr = std::make_shared<ThreadedStreamerWrapper>(streamer, m_tokenizer);

    OPENVINO_ASSERT(!streamer_ptr->has_callback() || input_ids.size() == 1 && sampling_params[0].num_return_sequences == 1 &&
        (sampling_params[0].is_greedy_decoding() || sampling_params[0].is_multinomial()),
        "Currently streaming is possible only with batch size=1 and only for greedy or multinomial decoding");

    std::vector<GenerationHandle> generations;
    for (size_t request_id = 0; request_id < input_ids.size(); ++request_id) {
        OPENVINO_ASSERT(1 == input_ids[request_id].get_shape().at(0), "Use multiple tensors to pass a batch.");
        if (position_ids_list.has_value()) {
            const auto [position_ids, rope_delta] = (*position_ids_list)[request_id];
            m_inputs_embedder->set_position_ids(position_ids);
            if (rope_delta.has_value()) {
                m_inputs_embedder->set_rope_delta(*rope_delta);
            }
        }
        bool has_valid_token = token_type_ids.has_value() && request_id < token_type_ids->size();
        generations.push_back(
            add_request(request_id, input_ids[request_id], sampling_params[request_id], has_valid_token ? std::make_optional((*token_type_ids)[request_id]) : std::nullopt)
        );

src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:50

  • PromptLookupImpl::step() no longer measures/updates m_sd_metrics.draft_duration, so prompt-lookup candidate generation time is now effectively counted as main_duration (or not tracked separately at all). If consumers rely on SpeculativeDecodingMetrics fields (e.g., debug printing/percentages), this will misreport performance. Consider timing generate_candidates_for_prompt_lookup() separately again (at its new location) and updating draft_duration accordingly.
void ContinuousBatchingPipeline::PromptLookupImpl::step() {
    auto& raw_perf_counters = m_perf_metrics.raw_metrics;

    ManualTimer step_timer("prompt_lookup_decoding: step()");
    step_timer.start();

    auto generated_len_before = m_pipeline->get_generated_request_len();

    ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
    main_timer.start();
    m_pipeline->step();
    main_timer.end();
    m_sd_metrics.main_duration += main_timer.get_duration();
    m_pipeline_metrics = m_pipeline->get_metrics();

src/cpp/src/utils.cpp:139

  • is_paged_attention_available() now returns true on ARM64 only when HAVE_SVE is defined, but the exception messages in explicitly_requires_paged_attention() still state that support is available on “ARM64 platforms” in general. To avoid misleading users, please update the associated error messages (or add a brief comment here) to reflect the new ARM64+SVE requirement.
inline bool is_paged_attention_available() {
#if defined(OPENVINO_ARCH_X86_64) || (defined(OPENVINO_ARCH_ARM64) && defined(HAVE_SVE))
    return true;
#else
    return false;
#endif

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/utils.cpp:139

  • is_paged_attention_available() now returns true on ARM64 only when HAVE_SVE is defined. However, the exception messages in explicitly_requires_paged_attention() still say PagedAttention is available on "x86_64 or ARM64". That becomes misleading on ARM64 builds without SVE.

Suggestion: update the related error messages to reflect the new requirement (e.g., ARM64 with SVE), so users get actionable guidance.

inline bool is_paged_attention_available() {
#if defined(OPENVINO_ARCH_X86_64) || (defined(OPENVINO_ARCH_ARM64) && defined(HAVE_SVE))
    return true;
#else
    return false;
#endif

src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:50

  • After moving candidate generation into ContinuousBatchingImpl::step(), PromptLookupImpl::step() no longer accounts for candidate generation time in m_sd_metrics.draft_duration. As a result, SpeculativeDecodingMetrics for prompt-lookup will report draft_duration == 0 and shift all time into main_duration, making the printed/queried metrics incorrect.

Suggestion: plumb candidate-generation duration back (e.g., store it in m_pipeline_metrics or return it from generate_candidates_for_prompt_lookup()), and update m_sd_metrics.draft_duration in PromptLookupImpl::step() similarly to the previous implementation.

    ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
    main_timer.start();
    m_pipeline->step();
    main_timer.end();
    m_sd_metrics.main_duration += main_timer.get_duration();
    m_pipeline_metrics = m_pipeline->get_metrics();

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/cpp/src/utils.cpp:139

  • is_paged_attention_available() now gates ARM64 support on HAVE_SVE, but this macro is not defined anywhere in this repo and is not a standard compiler macro (the common one is __ARM_FEATURE_SVE). If HAVE_SVE isn’t injected by the build, this change will silently disable ContinuousBatching/PromptLookup on all ARM64 builds and start throwing at runtime. Consider using a standard feature macro (or a repo-defined CMake option) and/or add a build-time definition to guarantee consistent behavior.
inline bool is_paged_attention_available() {
#if defined(OPENVINO_ARCH_X86_64) || (defined(OPENVINO_ARCH_ARM64) && defined(HAVE_SVE))
    return true;
#else
    return false;
#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: GGUF GGUF file reader category: prompt lookup Prompt look-up decoding category: speculative decoding Speculative decoding category: visual language Visual language pipeline category: VLM samples GenAI VLM samples no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.