Skip to content

Conversation

@yangwang201911
Copy link
Contributor

@yangwang201911 yangwang201911 commented Sep 9, 2025

  • Implement conditional visual token pruning for QWen-VL models.
    -- Paper: CDPruner (arXiv)
    -- Code: GitHub Repository
  • Add configurations to benchmark.py and WWB tools

Tickets: CVS-173220
Related PRs:

liangali and others added 30 commits August 1, 2025 10:06
…e configuration.

Refactor CDPruner to use visual tokens percentage instead of count for pruning configuration
…ode and remove unused visual token pruning methods
…e arguments and update GenerationConfig structure
…ross codebase for consistency in CDPruner configuration
…in_percentage"

- Updated Python scripts to reflect the corrected parameter name in argument parsing and configuration settings.
2.  Added unit tests for the FastGreedyDPP class to ensure proper functionality and selection behavior based on the visual tokens retention percentage.
Copilot AI review requested due to automatic review settings November 17, 2025 05:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 21, 2025 02:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/cdpruner/kernel_builder.cpp:1

  • Adding numerical_threshold to the numerator in min-max normalization is incorrect. The epsilon should be added to the denominator (range) to prevent division by zero, not to the numerator. This will shift all normalized values incorrectly. Change to: result_data[idx] = (input_data[idx] - min_val) / (range + m_config.numerical_threshold);
// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +986 to +987
auto encoded_vision_tokens = m_tokenizer.encode(m_vlm_config.vision_start_token + m_vlm_config.vision_end_token +
m_vlm_config.image_pad_token + m_vlm_config.video_pad_token,
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token encoding is now done with multiple tokens concatenated as a single string, which assumes specific tokenizer behavior. Consider adding a validation check to ensure the tokenizer produces exactly 4 separate token IDs as expected by the subsequent array indexing (lines 989-992), or document this assumption clearly.

Copilot uses AI. Check for mistakes.
Comment on lines +1110 to +1120
// Acquire request, run inference, then copy the result to safeguard against later reuse
CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(m_embedding->get_request_queue().get());
EmbeddingsRequest& req = embeddings_request_guard.get();
ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);

// Deep-copy necessary: Returned InferRequest's internal memory will be reused in
// extract_text_features_for_cdpruner() that acquires a request from the same queue.
// Without deep-copy, the second inference would overwrite this data, corrupting text_embeds.
text_embeds = ov::Tensor(tmp_embeds.get_element_type(), tmp_embeds.get_shape());
std::memcpy(text_embeds.data(), tmp_embeds.data(), tmp_embeds.get_byte_size());
} // Request released here
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep copying text_embeds introduces unnecessary memory allocation and copy overhead. Consider using a different approach such as maintaining separate inference request queues for different purposes, or ensuring the lifetime of tmp_embeds extends beyond the second inference to avoid this defensive copy.

Copilot uses AI. Check for mistakes.

// Step 2: Convert visual features to CDPruner format
// May split by frames/chunks for video or multi-image scenarios
size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1;
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using images.size() for chunk_count when enable_frame_chunking is true may not correctly represent the actual frame count in multi-frame scenarios. The images parameter represents encoded images, not individual frames. Consider verifying this logic aligns with the actual multi-frame processing requirements.

Suggested change
size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1;
size_t chunk_count = 1;
if (current_pruning_config->enable_frame_chunking) {
// Sum up the frame count for each encoded image
chunk_count = 0;
for (const auto& img : images) {
// TODO: Replace with actual frame count extraction for each image
// If img has a method get_frame_count(), use it; otherwise, assume 1 for now
// chunk_count += img.get_frame_count();
chunk_count += 1; // Placeholder: assumes each image is single-frame
}
// If multi-frame images are possible, implement frame count extraction above
}

Copilot uses AI. Check for mistakes.
bool Config::operator==(const Config& other) const {
return pruning_ratio == other.pruning_ratio && std::abs(relevance_weight - other.relevance_weight) < 1e-6f &&
device == other.device && std::abs(numerical_threshold - other.numerical_threshold) < 1e-9f &&
use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equality operator is missing a comparison for use_cl_kernel field defined in the Config struct (line 47 in cdpruner_config.hpp). This could lead to two Config objects being considered equal even when they have different OpenCL settings.

Suggested change
use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&
use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&
use_cl_kernel == other.use_cl_kernel &&

Copilot uses AI. Check for mistakes.
size_t raw_tokens_to_keep = static_cast<size_t>(std::round(total_tokens * (1.0 - m_config.pruning_ratio / 100.0)));
// Round up to the next even number of tokens to keep
// This is required for DPP OpenCL implementation
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing an even number of tokens by rounding up could exceed the original total_tokens count when pruning_ratio is very small. Add a validation check: num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens); after line 86.

Suggested change
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens);

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@xipingyan xipingyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// This is required for DPP OpenCL implementation
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;

validate_input_tensors(visual_features, text_features);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move validate_input_tensors to the beginning of this function.

Comment on lines 201 to 209
for (size_t t = 0; t < batch_selected.size(); ++t) {
size_t src_token_idx = batch_selected[t];

for (size_t f = 0; f < feature_dim; ++f) {
size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim + f;
size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim + f;
output_data[dst_idx] = input_data[src_idx];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (size_t t = 0; t < batch_selected.size(); ++t) {
size_t src_token_idx = batch_selected[t];
for (size_t f = 0; f < feature_dim; ++f) {
size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim + f;
size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim + f;
output_data[dst_idx] = input_data[src_idx];
}
}
for (size_t t = 0; t < batch_selected.size(); ++t) {
size_t src_token_idx = batch_selected[t];
size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim;
size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim;
for (size_t f = 0; f < feature_dim; ++f) {
output_data[dst_idx+f] = input_data[src_idx+f];
}
}

Just move complex computation to outside of {}

}

// Handle single feature case by calling existing method
if (visual_features_list.size() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (visual_features_list.size() == 1) {
if (visual_features_list.size() == 1u) {

Comment on lines +253 to +255
if (aggregated_selected.empty()) {
aggregated_selected.resize(frame_selected.size());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_last_selected_tokens is known variable before the for loop, please move this resize to before the for loop

Comment on lines 24 to 25
* - CDPRUNER_PRUNING_RATIO: Percentage of visual tokens to prune (integer, 0-100).
* - CDPRUNER_DEBUG_MODE: Enable debug output (boolean, "0" or "1").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reference in other place, please remove.

Comment on lines 658 to 659
std::string device_name;
m_state->device.getInfo(CL_DEVICE_NAME, &device_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get device name, but seems not to use it. please remove.

const float* input_data = input.data<const float>();
float* result_data = result.data<float>();

if (shape.size() == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (shape.size() == 2) {
if (shape.size() == 2u) {

// Compute L2 norm for token i
float norm = 0.0f;
for (size_t j = 0; j < feature_dim; ++j) {
size_t idx = i * feature_dim + j;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can move idx calc to outside of for {}.

const float* input_data = input.data<const float>();
float* result_data = result.data<float>();

if (shape.size() == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (shape.size() == 2) {
if (shape.size() == 2u) {

bool InputsEmbedder::IInputsEmbedder::has_token_type_ids() const { return false; }

// Default implementations for CDPruner-related methods
ov::Tensor InputsEmbedder::IInputsEmbedder::extract_text_features_for_pruning(const ov::Tensor& text_embeds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is not implementation, please print warning when input this branch.

Copilot AI review requested due to automatic review settings November 24, 2025 02:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Normalize batch b
for (size_t i = 0; i < num_tokens; ++i) {
size_t idx = b * num_tokens + i;
result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range;
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding numerical_threshold to the numerator in min-max normalization is incorrect. The threshold should only be added to the denominator to prevent division by zero. This will shift all normalized values upward, producing incorrect results. Change to: (input_data[idx] - min_val) / range

Suggested change
result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range;
result_data[idx] = (input_data[idx] - min_val) / range;

Copilot uses AI. Check for mistakes.
results.push_back(id);
if (batch_size == 1) {
std::sort(results.begin(), results.end());
results.erase(std::unique(results.begin(), results.end()), results.end());
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling std::unique without first sorting the vector only removes consecutive duplicates, not all duplicates. The vector should be sorted before calling std::unique, or use a different approach to remove duplicates. Add std::sort(results.begin(), results.end()); before this line.

Copilot uses AI. Check for mistakes.
size_t num_patches = original_shape[0];
size_t embedding_dim = original_shape[1];
size_t new_patches = num_patches / chunk_count;
OPENVINO_ASSERT(original_shape[0] == new_patches * chunk_count, "Inconsistent number of patches per image");
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message 'Inconsistent number of patches per image' is unclear about what values are inconsistent. Include the actual values in the error message: \"Inconsistent patches: expected \" + std::to_string(new_patches * chunk_count) + \", got \" + std::to_string(original_shape[0])

Copilot uses AI. Check for mistakes.
// This should either return empty tensor or handle gracefully
ov::Tensor result;
EXPECT_NO_THROW(result = cdpruner.apply_pruning(empty_frames, text_features));
EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor";
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression !result tests if the tensor is invalid/empty, but the correct method to check if an ov::Tensor is empty is result.get_shape().empty() or checking byte size. The current check may not work as intended. Change to: EXPECT_TRUE(result.get_shape().empty()) << \"Empty frames should result in empty output tensor\";

Suggested change
EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor";
EXPECT_TRUE(result.get_shape().empty()) << "Empty frames should result in empty output tensor";

Copilot uses AI. Check for mistakes.
Comment on lines +990 to +992
m_vision_token_ids["vision_end"] = encoded_vision_tokens.input_ids.data<int64_t>()[1];
m_vision_token_ids["image_pad"] = encoded_vision_tokens.input_ids.data<int64_t>()[2];
m_vision_token_ids["video_pad"] = encoded_vision_tokens.input_ids.data<int64_t>()[3];
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing array indices 1, 2, and 3 without verifying that encoded_vision_tokens.input_ids has at least 4 elements could cause out-of-bounds access. Add validation: OPENVINO_ASSERT(encoded_vision_tokens.input_ids.get_size() >= 4, \"Expected at least 4 vision tokens\"); before line 990.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the category: GGUF GGUF file reader label Nov 26, 2025
Copilot AI review requested due to automatic review settings November 27, 2025 03:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

tests/cpp/test_cdpruner_dpp.cpp:1

  • SIMD capability logging appears in test file but the actual SIMD code is in fast_dpp.cpp. This logging should be in the production code (fast_dpp.cpp line 439-445) where it already exists, not duplicated in tests.
// Copyright (C) 2024 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);

// Deep-copy necessary: Returned InferRequest's internal memory will be reused in
// extract_text_features_for_cdpruner() that acquires a request from the same queue.
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name in comment should be 'extract_text_features_for_pruning' not 'extract_text_features_for_cdpruner'

Suggested change
// extract_text_features_for_cdpruner() that acquires a request from the same queue.
// extract_text_features_for_pruning() that acquires a request from the same queue.

Copilot uses AI. Check for mistakes.
// Sort final result to maintain order
std::sort(merged_selection.begin(), merged_selection.end());

batch_results.push_back(std::move(merged_selection));
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index adjustment logic for split selection should validate that adjusted indices don't exceed original token count to prevent out-of-bounds access in downstream processing.

Suggested change
batch_results.push_back(std::move(merged_selection));
size_t adjusted_idx = idx + split_point;
if (adjusted_idx < tokens_first_half + tokens_second_half) {
merged_selection.push_back(adjusted_idx);
} else {
OV_GENAI_LOG_WARNING("Adjusted index out of bounds in split selection: {} (idx={}, split_point={}, total_tokens={})",
adjusted_idx, idx, split_point, tokens_first_half + tokens_second_half);
}

Copilot uses AI. Check for mistakes.
// CDPruner Overview
GENAI_INFO("CDPruner configuration:");
GENAI_INFO("\tPruning Ratio: %d%%", current_pruning_config->pruning_ratio);
GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight);
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Label 'Use Relevance Weight' is misleading - should be just 'Relevance Weight' since the value represents the weight itself, not a boolean indicating usage.

Suggested change
GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight);
GENAI_INFO("\tRelevance Weight: %.1f", current_pruning_config->relevance_weight);

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings November 27, 2025 09:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp:1

  • Document why AVX512 is explicitly disabled in favor of AVX2. The comment should explain the rationale (e.g., compatibility, performance characteristics, or testing constraints).
// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@parametrize_cdpruner_models
def test_cdpruner_basic_functionality(
ov_pipe_model: VlmModelInfo,
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing whitespace from function parameter.

Suggested change
ov_pipe_model: VlmModelInfo,
ov_pipe_model: VlmModelInfo,

Copilot uses AI. Check for mistakes.
Comment on lines +1699 to +1703
return kept_indices_per_image;

// Handle single aggregated vector case
OPENVINO_ASSERT(kept_indices_per_image.size() == 1 && region_count > 1,
"Kept token indices layout does not match vision regions");
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The normalization logic assumes either per-region indices or a single aggregated vector. Document this assumption in the function header or add a comment explaining when each format is expected.

Copilot uses AI. Check for mistakes.
int64_t vision_end_token_id) const {
// Default implementation: return empty tensor
// Models that support CDPruner should override this method
GENAI_WARN("extract_text_features_for_pruning not implemented for this model");
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot uses AI. Check for mistakes.
const ov::Tensor& vision_embeds,
size_t chunk_count) const {
// Default implementation: return single tensor in vector
GENAI_WARN("convert_visual_features_for_pruning not implemented for this model");
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot uses AI. Check for mistakes.
const std::vector<std::vector<bool>>& keep_flags_per_region,
const std::vector<std::array<size_t, 3>>& grid_thw_per_region) const {
// Default implementation: return as-is
GENAI_WARN("apply_visual_token_pruning not implemented for this model");
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot uses AI. Check for mistakes.
const std::vector<size_t>& images_sequence,
std::vector<std::vector<bool>>& keep_flags_per_region_out) const {
// Default implementation: do nothing (position_ids remain unchanged)
GENAI_WARN("adjust_position_ids_after_pruning not implemented for this model");
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot uses AI. Check for mistakes.
int64_t vision_start_token_id,
int64_t vision_end_token_id) const {
// Default implementation: return as-is
GENAI_WARN("generate_pruned_input_ids not implemented for this model");
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +133
if (total_tokens < 16) {
return select_cpu_internal(kernel, num_tokens);
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold value of 16 for GPU vs CPU selection should be extracted to a configurable parameter (e.g., in Config) or at minimum documented as a constant with an explanatory comment about the empirical determination.

Copilot uses AI. Check for mistakes.

// Create deep copy to avoid data corruption when InferRequest is reused
ov::Tensor output_tensor(output_tensor_ref.get_element_type(), output_tensor_ref.get_shape());
std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size());
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Similar to comment 3, consider if OpenVINO provides an optimized tensor copy method instead of manual std::memcpy.

Suggested change
std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size());
output_tensor.copy_from(output_tensor_ref);

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +90
// Round up to the next even number of tokens to keep
// This is required for DPP OpenCL implementation
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify why the OpenCL DPP implementation requires an even number of tokens. This constraint should be documented in the DPP implementation or Config as well.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings November 28, 2025 02:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +14
# ifdef OV_GPU_USE_OPENCL_HPP
# include <CL/opencl.hpp>
# else
# include <CL/cl2.hpp>
# endif
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The header selection logic for OpenCL is duplicated in multiple locations. The same pattern appears in the main CMakeLists.txt (lines 52-62). Consider creating a single OpenCL compatibility header that centralizes this logic.

Copilot uses AI. Check for mistakes.
}

// Sort final result to maintain order
std::sort(merged_selection.begin(), merged_selection.end());
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged selection is sorted here, but the selections from both halves are already sorted (lines 217-224). Consider using std::merge instead of std::sort for better performance when merging two already-sorted sequences.

Copilot uses AI. Check for mistakes.
Comment on lines +1115 to +1127
ov::Tensor text_embeds;
{
// Acquire request, run inference, then copy the result to safeguard against later reuse
CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(m_embedding->get_request_queue().get());
EmbeddingsRequest& req = embeddings_request_guard.get();
ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);

// Deep-copy necessary: Returned InferRequest's internal memory will be reused in
// extract_text_features_for_cdpruner() that acquires a request from the same queue.
// Without deep-copy, the second inference would overwrite this data, corrupting text_embeds.
text_embeds = ov::Tensor(tmp_embeds.get_element_type(), tmp_embeds.get_shape());
std::memcpy(text_embeds.data(), tmp_embeds.data(), tmp_embeds.get_byte_size());
} // Request released here
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deep copy of text_embeds is performed for every inference call, even when CDPruner is disabled (pruning_ratio=0). Consider deferring this copy until after checking if CDPruner is active (line 1153) to avoid unnecessary memory allocation and copying in the common case where pruning is disabled.

Copilot uses AI. Check for mistakes.
Comment on lines +1728 to +1731
// Sort and deduplicate each region's indices
for (auto& indices : normalized_indices) {
std::sort(indices.begin(), indices.end());
indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The sort and unique operations are performed on each region's indices separately. If the indices are expected to be mostly sorted or unique already (which is likely from DPP selection), consider using a set or checking for duplicates before sorting to optimize for the common case.

Suggested change
// Sort and deduplicate each region's indices
for (auto& indices : normalized_indices) {
std::sort(indices.begin(), indices.end());
indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
// Helper to check if a vector is sorted and unique
auto is_sorted_and_unique = [](const std::vector<size_t>& v) -> bool {
if (v.empty()) return true;
for (size_t i = 1; i < v.size(); ++i) {
if (v[i-1] >= v[i]) return false;
}
return true;
};
// Sort and deduplicate each region's indices only if needed
for (auto& indices : normalized_indices) {
if (!is_sorted_and_unique(indices)) {
std::sort(indices.begin(), indices.end());
indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
}

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +90
// Round up to the next even number of tokens to keep
// This is required for DPP OpenCL implementation
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The requirement for even token counts is specific to OpenCL DPP implementation, but this rounding is applied unconditionally even when OpenCL is not being used. Consider making this adjustment conditional on the actual backend being used (m_config.use_cl_kernel) to maintain more accurate pruning ratios when using CPU-only mode.

Suggested change
// Round up to the next even number of tokens to keep
// This is required for DPP OpenCL implementation
size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
// Round up to the next even number of tokens to keep only if using OpenCL DPP implementation
// This is required for DPP OpenCL implementation
size_t num_tokens_to_keep = m_config.use_cl_kernel
? ((raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1)
: raw_tokens_to_keep;

Copilot uses AI. Check for mistakes.
Comment on lines +346 to +350
// CRITICAL: Create a deep copy to avoid data corruption when InferRequest is reused
// The InferRequest may reuse or modify its output tensor on the next inference call,
// which would invalidate any references to this tensor
ov::Tensor conditional_kernel(conditional_kernel_ref.get_element_type(), conditional_kernel_ref.get_shape());
std::memcpy(conditional_kernel.data(), conditional_kernel_ref.data(), conditional_kernel_ref.get_byte_size());
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Deep copies of InferRequest output tensors are performed in multiple locations (here and line 129). Consider implementing a tensor pool or reusing pre-allocated tensors across calls to reduce memory allocation overhead in performance-critical paths.

Copilot uses AI. Check for mistakes.
@yangwang201911
Copy link
Contributor Author

Replaced by PR#3084 with clean review history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: cmake / build Cmake scripts category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: Python API Python API for GenAI category: sampling Sampling / Decoding algorithms category: visual language Visual language pipeline no-match-files under_perf_acc_test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants