Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR enables VLM (Vision Language Model) lookup functionality by enhancing the continuous batching pipeline to support prompt lookup for embedding-based models. The changes introduce a new interface for providing prompt token IDs when using embedding inputs, enabling better integration between visual language models and prompt lookup decoding.
- Move candidate generation from PromptLookupImpl::step() to ContinuousBatchingImpl::step()
- Add new interface to pass prompt token IDs for embedding input models through get_inputs_embeds_with_token_type_ids method
- Update all VLM model implementations to support the new token_type_ids interface
Reviewed Changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cpp/src/visual_language/qwen2vl/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/qwen2vl/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/phi4mm/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/phi4mm/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/phi3_vision/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/phi3_vision/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/minicpm/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/minicpm/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/llava_next/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/llava_next/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/llava/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/llava/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/internvl_chat/classes.hpp | Add get_inputs_embeds_with_token_type_ids method declaration |
| src/cpp/src/visual_language/internvl_chat/classes.cpp | Implement new method and refactor existing get_inputs_embeds to use it |
| src/cpp/src/visual_language/inputs_embedder.hpp | Update constructors to accept prompt_lookup parameter and add prompt_lookup support |
| src/cpp/src/visual_language/inputs_embedder.cpp | Update has_token_type_ids method and constructors for prompt lookup support |
| src/cpp/src/sequence_group.hpp | Add handling for embeddings in remove_last_tokens method |
| src/cpp/src/prompt_lookup/prompt_lookup_impl.hpp | Add constructor for embedding-based models |
| src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp | Support token_type_ids and remove candidate generation from step method |
| src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.hpp | Add constructor for embedding models and make generate_candidates virtual |
| src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp | Fix loop variable type and add candidate padding logic |
| src/cpp/src/continuous_batching/pipeline_impl.hpp | Add virtual generate_candidates method declaration |
| src/cpp/src/continuous_batching/pipeline_impl.cpp | Move candidate generation to main pipeline step and add empty default implementation |
| src/cpp/src/continuous_batching/pipeline.cpp | Pass prompt_lookup flag to InputsEmbedder constructor |
| src/cpp/src/continuous_batching/model_runner.hpp | Add proper token_type_ids tensor existence check |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@yangsu2022 , could you please help review firstly? |
1: global variable pass prompts_ids. Signed-off-by: xipingya <xiping.yan@intel.com>
Don't need to add new interface, just reuse "token_type_ids". Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
4c2fe5b to
bbb9de3
Compare
2: fix match bug, for example:
input_ids={2, 3, 1, 1, 2, 3, 4, 5, 6, 9, 2, 3, 1, 2, 3}
num_pred_tokens=3
max_ngram_size=3
return candidate: 2,3,1
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: xipingya <xiping.yan@intel.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: xipingya <xiping.yan@intel.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: xipingya <xiping.yan@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:50
- After moving candidate generation into
ContinuousBatchingImpl::step(),PromptLookupImpl::step()no longer updatesm_sd_metrics.draft_duration, so draft vs main timing breakdown inSpeculativeDecodingMetricsbecomes misleading (draft stays 0 and candidate-generation time is effectively attributed to main). Consider reintroducing draft timing aroundgenerate_candidates_for_prompt_lookup()for the prompt-lookup pipeline, or updating the metrics naming/logic accordingly.
ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
main_timer.start();
m_pipeline->step();
main_timer.end();
m_sd_metrics.main_duration += main_timer.get_duration();
m_pipeline_metrics = m_pipeline->get_metrics();
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.hpp
Outdated
Show resolved
Hide resolved
| for (const auto& candidate : candidates) { | ||
| running_sequence->append_token(candidate, 0); | ||
| } | ||
| max_validation_len = std::max(max_validation_len, candidates.size()); | ||
| } | ||
| request->set_num_validated_tokens(max_validation_len); | ||
| } |
There was a problem hiding this comment.
max_validation_len is computed after padding candidates up to sampling_params.num_assistant_tokens, and then stored via set_num_validated_tokens(). This makes the pipeline try to validate padded tokens as if they were real candidates (and can also exceed min_num_assistant_tokens near max_new_tokens). Validation length should reflect only the number of real candidate tokens generated, not the padded shape length.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:46
- After moving candidate generation out of
PromptLookupImpl::step(), the draft/candidate generation duration is no longer accounted for in prompt-lookup-specific metrics (previously accumulated intom_sd_metrics.draft_duration). If these metrics are consumed externally, this is a regression. Consider timinggenerate_candidates_for_prompt_lookup()in the prompt-lookup path and adding it back into the same metrics field (e.g., via an override that measures and accumulates duration, or by returning duration from the hook).
ManualTimer step_timer("prompt_lookup_decoding: step()");
step_timer.start();
auto generated_len_before = m_pipeline->get_generated_request_len();
ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
main_timer.start();
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
src/cpp/src/continuous_batching/pipeline_impl.cpp:529
ContinuousBatchingImpl::generate(...)now acceptsprompt_ids, but the parameter is never used (and is not forwarded intoadd_request(...)). This can cause-Wunused-parameterbuild failures and prevents prompt IDs from reachingSequenceGroupwhen using this implementation. Either propagateprompt_idsper request (with the same bounds checks used fortoken_type_ids) or explicitly mark the parameter unused if it is intentionally unsupported here.
std::vector<EncodedGenerationResult>
ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<ov::Tensor>& input_ids,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer,
const std::optional<std::vector<ov::Tensor>>& token_type_ids,
const std::optional<std::vector<std::pair<ov::Tensor, std::optional<int64_t>>>>& position_ids_list,
const std::optional<std::vector<ov::Tensor>>& prompt_ids) {
_reset_cache_usage_statistics();
ManualTimer generate_timer("generate()");
generate_timer.start();
OPENVINO_ASSERT(!has_non_finished_requests(), "Generate cannot be called while ContinuousBatchingPipeline is already in running state. Use ContinuousBatchingPipeline::add_request");
OPENVINO_ASSERT(input_ids.size() == sampling_params.size());
if (position_ids_list.has_value()) {
OPENVINO_ASSERT((*position_ids_list).size() == input_ids.size());
}
auto start_time = std::chrono::steady_clock::now();
PerfMetrics perf_metrics;
auto& raw_perf_counters = perf_metrics.raw_metrics;
raw_perf_counters.m_inference_durations = {{ MicroSeconds(0.0f) }};
// checks that all requests has the same LoRA adapters property value
for (size_t i = 1; i < sampling_params.size(); ++i) {
OPENVINO_ASSERT(sampling_params[i - 1].adapters == sampling_params[i].adapters,
"LoRA adapters value must be the same for all requests");
}
set_adapters(sampling_params[0].adapters);
const auto streamer_ptr = std::make_shared<ThreadedStreamerWrapper>(streamer, m_tokenizer);
OPENVINO_ASSERT(!streamer_ptr->has_callback() || input_ids.size() == 1 && sampling_params[0].num_return_sequences == 1 &&
(sampling_params[0].is_greedy_decoding() || sampling_params[0].is_multinomial()),
"Currently streaming is possible only with batch size=1 and only for greedy or multinomial decoding");
std::vector<GenerationHandle> generations;
for (size_t request_id = 0; request_id < input_ids.size(); ++request_id) {
OPENVINO_ASSERT(1 == input_ids[request_id].get_shape().at(0), "Use multiple tensors to pass a batch.");
if (position_ids_list.has_value()) {
const auto [position_ids, rope_delta] = (*position_ids_list)[request_id];
m_inputs_embedder->set_position_ids(position_ids);
if (rope_delta.has_value()) {
m_inputs_embedder->set_rope_delta(*rope_delta);
}
}
bool has_valid_token = token_type_ids.has_value() && request_id < token_type_ids->size();
generations.push_back(
add_request(request_id, input_ids[request_id], sampling_params[request_id], has_valid_token ? std::make_optional((*token_type_ids)[request_id]) : std::nullopt)
);
src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:50
PromptLookupImpl::step()no longer measures/updatesm_sd_metrics.draft_duration, so prompt-lookup candidate generation time is now effectively counted asmain_duration(or not tracked separately at all). If consumers rely onSpeculativeDecodingMetricsfields (e.g., debug printing/percentages), this will misreport performance. Consider timinggenerate_candidates_for_prompt_lookup()separately again (at its new location) and updatingdraft_durationaccordingly.
void ContinuousBatchingPipeline::PromptLookupImpl::step() {
auto& raw_perf_counters = m_perf_metrics.raw_metrics;
ManualTimer step_timer("prompt_lookup_decoding: step()");
step_timer.start();
auto generated_len_before = m_pipeline->get_generated_request_len();
ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
main_timer.start();
m_pipeline->step();
main_timer.end();
m_sd_metrics.main_duration += main_timer.get_duration();
m_pipeline_metrics = m_pipeline->get_metrics();
src/cpp/src/utils.cpp:139
is_paged_attention_available()now returns true on ARM64 only whenHAVE_SVEis defined, but the exception messages inexplicitly_requires_paged_attention()still state that support is available on “ARM64 platforms” in general. To avoid misleading users, please update the associated error messages (or add a brief comment here) to reflect the new ARM64+SVE requirement.
inline bool is_paged_attention_available() {
#if defined(OPENVINO_ARCH_X86_64) || (defined(OPENVINO_ARCH_ARM64) && defined(HAVE_SVE))
return true;
#else
return false;
#endif
src/cpp/src/prompt_lookup/continuous_batching_for_prompt_lookup.cpp
Outdated
Show resolved
Hide resolved
…ForPromptLookupImpl()
…penvino.genai into xp/enable_vlm_lookup
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
src/cpp/src/utils.cpp:139
is_paged_attention_available()now returns true on ARM64 only whenHAVE_SVEis defined. However, the exception messages inexplicitly_requires_paged_attention()still say PagedAttention is available on "x86_64 or ARM64". That becomes misleading on ARM64 builds without SVE.
Suggestion: update the related error messages to reflect the new requirement (e.g., ARM64 with SVE), so users get actionable guidance.
inline bool is_paged_attention_available() {
#if defined(OPENVINO_ARCH_X86_64) || (defined(OPENVINO_ARCH_ARM64) && defined(HAVE_SVE))
return true;
#else
return false;
#endif
src/cpp/src/prompt_lookup/prompt_lookup_impl.cpp:50
- After moving candidate generation into
ContinuousBatchingImpl::step(),PromptLookupImpl::step()no longer accounts for candidate generation time inm_sd_metrics.draft_duration. As a result,SpeculativeDecodingMetricsfor prompt-lookup will reportdraft_duration == 0and shift all time intomain_duration, making the printed/queried metrics incorrect.
Suggestion: plumb candidate-generation duration back (e.g., store it in m_pipeline_metrics or return it from generate_candidates_for_prompt_lookup()), and update m_sd_metrics.draft_duration in PromptLookupImpl::step() similarly to the previous implementation.
ManualTimer main_timer("prompt_lookup_decoding: pipeline: step()");
main_timer.start();
m_pipeline->step();
main_timer.end();
m_sd_metrics.main_duration += main_timer.get_duration();
m_pipeline_metrics = m_pipeline->get_metrics();
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/cpp/src/utils.cpp:139
is_paged_attention_available()now gates ARM64 support onHAVE_SVE, but this macro is not defined anywhere in this repo and is not a standard compiler macro (the common one is__ARM_FEATURE_SVE). IfHAVE_SVEisn’t injected by the build, this change will silently disable ContinuousBatching/PromptLookup on all ARM64 builds and start throwing at runtime. Consider using a standard feature macro (or a repo-defined CMake option) and/or add a build-time definition to guarantee consistent behavior.
inline bool is_paged_attention_available() {
#if defined(OPENVINO_ARCH_X86_64) || (defined(OPENVINO_ARCH_ARM64) && defined(HAVE_SVE))
return true;
#else
return false;
#endif
1: Move
m_pipeline->generate_candidates();fromContinuousBatchingPipeline::PromptLookupImpl::step()toContinuousBatchingPipeline::ContinuousBatchingImpl::step()2: Encode original prompt to org_prompt_ids as lookup table.
Tickets: CVS-172889, CVS-176586