Conditional visual token pruning for QWen-VL models. #2714

yangwang201911 · 2025-09-09T05:35:49Z

Implement conditional visual token pruning for QWen-VL models.
-- Paper: CDPruner (arXiv)
-- Code: GitHub Repository
Add configurations to benchmark.py and WWB tools

Tickets: CVS-173220
Related PRs:

…e configuration.

…ased models

…yDPP

…r requests for performance optimization

…e configuration. Refactor CDPruner to use visual tokens percentage instead of count for pruning configuration

…ode and remove unused visual token pruning methods

…e arguments and update GenerationConfig structure

…onfig

…e vision config handling

…genai into ywang2/vlm-cdpruner

…ross codebase for consistency in CDPruner configuration

…ructor

…in_percentage" - Updated Python scripts to reflect the corrected parameter name in argument parsing and configuration settings. 2. Added unit tests for the FastGreedyDPP class to ensure proper functionality and selection behavior based on the visual tokens retention percentage.

… FastGreedyDPP

…genai into ywang2/vlm-cdpruner

…ions

…elated configurations

…ove performance

Copilot

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/CMakeLists.txt

src/cpp/src/visual_language/pipeline.cpp

src/cpp/src/visual_language/qwen2vl/classes.cpp

src/cpp/src/visual_language/cdpruner/conditional_kernel.cpp

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

src/cpp/src/visual_language/qwen2vl/classes.cpp

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

src/cpp/src/visual_language/vision_encoder.cpp

…uning

Copilot

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/cdpruner/kernel_builder.cpp:1

Adding numerical_threshold to the numerator in min-max normalization is incorrect. The epsilon should be added to the denominator (range) to prevent division by zero, not to the numerator. This will shift all normalized values incorrectly. Change to: result_data[idx] = (input_data[idx] - min_val) / (range + m_config.numerical_threshold);

// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-21T02:15:55Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    auto encoded_vision_tokens = m_tokenizer.encode(m_vlm_config.vision_start_token + m_vlm_config.vision_end_token +
+                                                        m_vlm_config.image_pad_token + m_vlm_config.video_pad_token,


The token encoding is now done with multiple tokens concatenated as a single string, which assumes specific tokenizer behavior. Consider adding a validation check to ensure the tokenizer produces exactly 4 separate token IDs as expected by the subsequent array indexing (lines 989-992), or document this assumption clearly.

Copilot · 2025-11-21T02:15:55Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+        // Acquire request, run inference, then copy the result to safeguard against later reuse
+        CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(m_embedding->get_request_queue().get());
+        EmbeddingsRequest& req = embeddings_request_guard.get();
+        ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);
+
+        // Deep-copy necessary: Returned InferRequest's internal memory will be reused in
+        // extract_text_features_for_cdpruner() that acquires a request from the same queue.
+        // Without deep-copy, the second inference would overwrite this data, corrupting text_embeds.
+        text_embeds = ov::Tensor(tmp_embeds.get_element_type(), tmp_embeds.get_shape());
+        std::memcpy(text_embeds.data(), tmp_embeds.data(), tmp_embeds.get_byte_size());
+    } // Request released here


Deep copying text_embeds introduces unnecessary memory allocation and copy overhead. Consider using a different approach such as maintaining separate inference request queues for different purposes, or ensuring the lifetime of tmp_embeds extends beyond the second inference to avoid this defensive copy.

Copilot · 2025-11-21T02:15:55Z

src/cpp/src/visual_language/inputs_embedder.cpp

+
+    // Step 2: Convert visual features to CDPruner format
+    // May split by frames/chunks for video or multi-image scenarios
+    size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1;


Using images.size() for chunk_count when enable_frame_chunking is true may not correctly represent the actual frame count in multi-frame scenarios. The images parameter represents encoded images, not individual frames. Consider verifying this logic aligns with the actual multi-frame processing requirements.

Suggested change

size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1;

size_t chunk_count = 1;

if (current_pruning_config->enable_frame_chunking) {

// Sum up the frame count for each encoded image

chunk_count = 0;

for (const auto& img : images) {

// TODO: Replace with actual frame count extraction for each image

// If img has a method get_frame_count(), use it; otherwise, assume 1 for now

// chunk_count += img.get_frame_count();

chunk_count += 1; // Placeholder: assumes each image is single-frame

}

// If multi-frame images are possible, implement frame count extraction above

}

Copilot · 2025-11-21T02:15:56Z

src/cpp/src/visual_language/cdpruner/cdpruner_config.cpp

+bool Config::operator==(const Config& other) const {
+    return pruning_ratio == other.pruning_ratio && std::abs(relevance_weight - other.relevance_weight) < 1e-6f &&
+           device == other.device && std::abs(numerical_threshold - other.numerical_threshold) < 1e-9f &&
+           use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&


The equality operator is missing a comparison for use_cl_kernel field defined in the Config struct (line 47 in cdpruner_config.hpp). This could lead to two Config objects being considered equal even when they have different OpenCL settings.

Suggested change

use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&

use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&

use_cl_kernel == other.use_cl_kernel &&

Copilot · 2025-11-21T02:15:56Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+    size_t raw_tokens_to_keep = static_cast<size_t>(std::round(total_tokens * (1.0 - m_config.pruning_ratio / 100.0)));
+    // Round up to the next even number of tokens to keep
+    // This is required for DPP OpenCL implementation
+    size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;


Forcing an even number of tokens by rounding up could exceed the original total_tokens count when pruning_ratio is very small. Add a validation check: num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens); after line 86.

Suggested change

size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;

size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;

num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens);

xipingyan

I think we also need a python test for pipeline.
For example:
https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/test_vlm_pipeline.py

xipingyan · 2025-11-21T02:26:17Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+    // This is required for DPP OpenCL implementation
+    size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
+
+    validate_input_tensors(visual_features, text_features);


Please move validate_input_tensors to the beginning of this function.

xipingyan · 2025-11-21T02:32:29Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+        for (size_t t = 0; t < batch_selected.size(); ++t) {
+            size_t src_token_idx = batch_selected[t];
+
+            for (size_t f = 0; f < feature_dim; ++f) {
+                size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim + f;
+                size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim + f;
+                output_data[dst_idx] = input_data[src_idx];
+            }
+        }


Suggested change

for (size_t t = 0; t < batch_selected.size(); ++t) {

size_t src_token_idx = batch_selected[t];

for (size_t f = 0; f < feature_dim; ++f) {

size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim + f;

size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim + f;

output_data[dst_idx] = input_data[src_idx];

}

}

for (size_t t = 0; t < batch_selected.size(); ++t) {

size_t src_token_idx = batch_selected[t];

size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim;

size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim;

for (size_t f = 0; f < feature_dim; ++f) {

output_data[dst_idx+f] = input_data[src_idx+f];

}

}

Just move complex computation to outside of {}

xipingyan · 2025-11-21T02:33:09Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+    }
+
+    // Handle single feature case by calling existing method
+    if (visual_features_list.size() == 1) {


Suggested change

if (visual_features_list.size() == 1) {

if (visual_features_list.size() == 1u) {

xipingyan · 2025-11-21T02:38:14Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+        if (aggregated_selected.empty()) {
+            aggregated_selected.resize(frame_selected.size());
+        }


m_last_selected_tokens is known variable before the for loop, please move this resize to before the for loop

xipingyan · 2025-11-21T02:43:50Z

src/cpp/src/visual_language/cdpruner/cdpruner_config.hpp

+     *   - CDPRUNER_PRUNING_RATIO: Percentage of visual tokens to prune (integer, 0-100).
+     *   - CDPRUNER_DEBUG_MODE: Enable debug output (boolean, "0" or "1").


No reference in other place, please remove.

xipingyan · 2025-11-21T02:56:08Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        std::string device_name;
+        m_state->device.getInfo(CL_DEVICE_NAME, &device_name);


You get device name, but seems not to use it. please remove.

xipingyan · 2025-11-21T03:07:36Z

src/cpp/src/visual_language/cdpruner/relevance_calculator.cpp

+    const float* input_data = input.data<const float>();
+    float* result_data = result.data<float>();
+
+    if (shape.size() == 2) {


Suggested change

if (shape.size() == 2) {

if (shape.size() == 2u) {

xipingyan · 2025-11-21T03:10:25Z

src/cpp/src/visual_language/cdpruner/relevance_calculator.cpp

+            // Compute L2 norm for token i
+            float norm = 0.0f;
+            for (size_t j = 0; j < feature_dim; ++j) {
+                size_t idx = i * feature_dim + j;


Maybe you can move idx calc to outside of for {}.

xipingyan · 2025-11-21T03:10:42Z

src/cpp/src/visual_language/cdpruner/relevance_calculator.cpp

+    const float* input_data = input.data<const float>();
+    float* result_data = result.data<float>();
+
+    if (shape.size() == 2) {


Suggested change

if (shape.size() == 2) {

if (shape.size() == 2u) {

xipingyan · 2025-11-21T03:19:22Z

src/cpp/src/visual_language/inputs_embedder.cpp

 bool InputsEmbedder::IInputsEmbedder::has_token_type_ids() const { return false; }

+// Default implementations for CDPruner-related methods
+ov::Tensor InputsEmbedder::IInputsEmbedder::extract_text_features_for_pruning(const ov::Tensor& text_embeds,


If there is not implementation, please print warning when input this branch.

…istency

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-24T02:50:03Z

src/cpp/src/visual_language/cdpruner/relevance_calculator.cpp

+            // Normalize batch b
+            for (size_t i = 0; i < num_tokens; ++i) {
+                size_t idx = b * num_tokens + i;
+                result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range;


Adding numerical_threshold to the numerator in min-max normalization is incorrect. The threshold should only be added to the denominator to prevent division by zero. This will shift all normalized values upward, producing incorrect results. Change to: (input_data[idx] - min_val) / range

Suggested change

result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range;

result_data[idx] = (input_data[idx] - min_val) / range;

Copilot · 2025-11-24T02:50:04Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        results.push_back(id);
+    if (batch_size == 1) {
+        std::sort(results.begin(), results.end());
+        results.erase(std::unique(results.begin(), results.end()), results.end());


Calling std::unique without first sorting the vector only removes consecutive duplicates, not all duplicates. The vector should be sorted before calling std::unique, or use a different approach to remove duplicates. Add std::sort(results.begin(), results.end()); before this line.

Copilot · 2025-11-24T02:50:04Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    size_t num_patches = original_shape[0];
+    size_t embedding_dim = original_shape[1];
+    size_t new_patches = num_patches / chunk_count;
+    OPENVINO_ASSERT(original_shape[0] == new_patches * chunk_count, "Inconsistent number of patches per image");


The error message 'Inconsistent number of patches per image' is unclear about what values are inconsistent. Include the actual values in the error message: \"Inconsistent patches: expected \" + std::to_string(new_patches * chunk_count) + \", got \" + std::to_string(original_shape[0])

Copilot · 2025-11-24T02:50:04Z

tests/cpp/test_cdpruner_dpp.cpp

+        // This should either return empty tensor or handle gracefully
+        ov::Tensor result;
+        EXPECT_NO_THROW(result = cdpruner.apply_pruning(empty_frames, text_features));
+        EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor";


The expression !result tests if the tensor is invalid/empty, but the correct method to check if an ov::Tensor is empty is result.get_shape().empty() or checking byte size. The current check may not work as intended. Change to: EXPECT_TRUE(result.get_shape().empty()) << \"Empty frames should result in empty output tensor\";

Suggested change

EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor";

EXPECT_TRUE(result.get_shape().empty()) << "Empty frames should result in empty output tensor";

Copilot · 2025-11-24T02:50:04Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    m_vision_token_ids["vision_end"] = encoded_vision_tokens.input_ids.data<int64_t>()[1];
+    m_vision_token_ids["image_pad"] = encoded_vision_tokens.input_ids.data<int64_t>()[2];
+    m_vision_token_ids["video_pad"] = encoded_vision_tokens.input_ids.data<int64_t>()[3];


Accessing array indices 1, 2, and 3 without verifying that encoded_vision_tokens.input_ids has at least 4 elements could cause out-of-bounds access. Add validation: OPENVINO_ASSERT(encoded_vision_tokens.input_ids.get_size() >= 4, \"Expected at least 4 vision tokens\"); before line 990.

…integration

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

tests/cpp/test_cdpruner_dpp.cpp:1

SIMD capability logging appears in test file but the actual SIMD code is in fast_dpp.cpp. This logging should be in the production code (fast_dpp.cpp line 439-445) where it already exists, not duplicated in tests.

// Copyright (C) 2024 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T03:23:33Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+        ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);
+
+        // Deep-copy necessary: Returned InferRequest's internal memory will be reused in
+        // extract_text_features_for_cdpruner() that acquires a request from the same queue.


Function name in comment should be 'extract_text_features_for_pruning' not 'extract_text_features_for_cdpruner'

Suggested change

// extract_text_features_for_cdpruner() that acquires a request from the same queue.

// extract_text_features_for_pruning() that acquires a request from the same queue.

Copilot · 2025-11-27T03:23:33Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        // Sort final result to maintain order
+        std::sort(merged_selection.begin(), merged_selection.end());
+
+        batch_results.push_back(std::move(merged_selection));


Index adjustment logic for split selection should validate that adjusted indices don't exceed original token count to prevent out-of-bounds access in downstream processing.

Suggested change

batch_results.push_back(std::move(merged_selection));

size_t adjusted_idx = idx + split_point;

if (adjusted_idx < tokens_first_half + tokens_second_half) {

merged_selection.push_back(adjusted_idx);

} else {

OV_GENAI_LOG_WARNING("Adjusted index out of bounds in split selection: {} (idx={}, split_point={}, total_tokens={})",

adjusted_idx, idx, split_point, tokens_first_half + tokens_second_half);

}

Copilot · 2025-11-27T03:23:34Z

src/cpp/src/visual_language/inputs_embedder.cpp

+    // CDPruner Overview
+    GENAI_INFO("CDPruner configuration:");
+    GENAI_INFO("\tPruning Ratio: %d%%", current_pruning_config->pruning_ratio);
+    GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight);


Label 'Use Relevance Weight' is misleading - should be just 'Relevance Weight' since the value represents the weight itself, not a boolean indicating usage.

Suggested change

GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight);

GENAI_INFO("\tRelevance Weight: %.1f", current_pruning_config->relevance_weight);

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp:1

Document why AVX512 is explicitly disabled in favor of AVX2. The comment should explain the rationale (e.g., compatibility, performance characteristics, or testing constraints).

// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T09:22:24Z

tests/python_tests/test_vlm_pipeline.py

+
+@parametrize_cdpruner_models
+def test_cdpruner_basic_functionality(
+    ov_pipe_model: VlmModelInfo, 


Remove trailing whitespace from function parameter.

Suggested change

ov_pipe_model: VlmModelInfo,

ov_pipe_model: VlmModelInfo,

Copilot · 2025-11-27T09:22:25Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+            return kept_indices_per_image;
+
+        // Handle single aggregated vector case
+        OPENVINO_ASSERT(kept_indices_per_image.size() == 1 && region_count > 1,
+                        "Kept token indices layout does not match vision regions");


[nitpick] The normalization logic assumes either per-region indices or a single aggregated vector. Document this assumption in the function header or add a comment explaining when each format is expected.

Copilot · 2025-11-27T09:22:25Z

src/cpp/src/visual_language/inputs_embedder.cpp

+                                                                              int64_t vision_end_token_id) const {
+    // Default implementation: return empty tensor
+    // Models that support CDPruner should override this method
+    GENAI_WARN("extract_text_features_for_pruning not implemented for this model");


[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot · 2025-11-27T09:22:25Z

src/cpp/src/visual_language/inputs_embedder.cpp

+    const ov::Tensor& vision_embeds,
+    size_t chunk_count) const {
+    // Default implementation: return single tensor in vector
+    GENAI_WARN("convert_visual_features_for_pruning not implemented for this model");


[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot · 2025-11-27T09:22:26Z

src/cpp/src/visual_language/inputs_embedder.cpp

+    const std::vector<std::vector<bool>>& keep_flags_per_region,
+    const std::vector<std::array<size_t, 3>>& grid_thw_per_region) const {
+    // Default implementation: return as-is
+    GENAI_WARN("apply_visual_token_pruning not implemented for this model");


[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot · 2025-11-27T09:22:26Z

src/cpp/src/visual_language/inputs_embedder.cpp

+    const std::vector<size_t>& images_sequence,
+    std::vector<std::vector<bool>>& keep_flags_per_region_out) const {
+    // Default implementation: do nothing (position_ids remain unchanged)
+    GENAI_WARN("adjust_position_ids_after_pruning not implemented for this model");


[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot · 2025-11-27T09:22:26Z

src/cpp/src/visual_language/inputs_embedder.cpp

+    int64_t vision_start_token_id,
+    int64_t vision_end_token_id) const {
+    // Default implementation: return as-is
+    GENAI_WARN("generate_pruned_input_ids not implemented for this model");


[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.

Copilot · 2025-11-27T09:22:26Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+    if (total_tokens < 16) {
+        return select_cpu_internal(kernel, num_tokens);
+    }


The threshold value of 16 for GPU vs CPU selection should be extracted to a configurable parameter (e.g., in Config) or at minimum documented as a constant with an explanatory comment about the empirical determination.

Copilot · 2025-11-27T09:22:27Z

src/cpp/src/visual_language/cdpruner/conditional_kernel.cpp

+
+        // Create deep copy to avoid data corruption when InferRequest is reused
+        ov::Tensor output_tensor(output_tensor_ref.get_element_type(), output_tensor_ref.get_shape());
+        std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size());


[nitpick] Similar to comment 3, consider if OpenVINO provides an optimized tensor copy method instead of manual std::memcpy.

Suggested change

std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size());

output_tensor.copy_from(output_tensor_ref);

Copilot · 2025-11-27T09:22:27Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+    // Round up to the next even number of tokens to keep
+    // This is required for DPP OpenCL implementation
+    size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;


Clarify why the OpenCL DPP implementation requires an even number of tokens. This constraint should be documented in the DPP implementation or Config as well.

…mance optimizations, and default warning messages for unimplemented methods

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-28T02:48:59Z

src/cpp/src/visual_language/cdpruner/fast_dpp_cl.hpp

+#    ifdef OV_GPU_USE_OPENCL_HPP
+#        include <CL/opencl.hpp>
+#    else
+#        include <CL/cl2.hpp>
+#    endif


[nitpick] The header selection logic for OpenCL is duplicated in multiple locations. The same pattern appears in the main CMakeLists.txt (lines 52-62). Consider creating a single OpenCL compatibility header that centralizes this logic.

Copilot · 2025-11-28T02:49:00Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        }
+
+        // Sort final result to maintain order
+        std::sort(merged_selection.begin(), merged_selection.end());


The merged selection is sorted here, but the selections from both halves are already sorted (lines 217-224). Consider using std::merge instead of std::sort for better performance when merging two already-sorted sequences.

Copilot · 2025-11-28T02:49:00Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    ov::Tensor text_embeds;
+    {
+        // Acquire request, run inference, then copy the result to safeguard against later reuse
+        CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(m_embedding->get_request_queue().get());
+        EmbeddingsRequest& req = embeddings_request_guard.get();
+        ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);
+
+        // Deep-copy necessary: Returned InferRequest's internal memory will be reused in
+        // extract_text_features_for_cdpruner() that acquires a request from the same queue.
+        // Without deep-copy, the second inference would overwrite this data, corrupting text_embeds.
+        text_embeds = ov::Tensor(tmp_embeds.get_element_type(), tmp_embeds.get_shape());
+        std::memcpy(text_embeds.data(), tmp_embeds.data(), tmp_embeds.get_byte_size());
+    } // Request released here


The deep copy of text_embeds is performed for every inference call, even when CDPruner is disabled (pruning_ratio=0). Consider deferring this copy until after checking if CDPruner is active (line 1153) to avoid unnecessary memory allocation and copying in the common case where pruning is disabled.

Copilot · 2025-11-28T02:49:01Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    // Sort and deduplicate each region's indices
+    for (auto& indices : normalized_indices) {
+        std::sort(indices.begin(), indices.end());
+        indices.erase(std::unique(indices.begin(), indices.end()), indices.end());


[nitpick] The sort and unique operations are performed on each region's indices separately. If the indices are expected to be mostly sorted or unique already (which is likely from DPP selection), consider using a set or checking for duplicates before sorting to optimize for the common case.

Suggested change

// Sort and deduplicate each region's indices

for (auto& indices : normalized_indices) {

std::sort(indices.begin(), indices.end());

indices.erase(std::unique(indices.begin(), indices.end()), indices.end());

// Helper to check if a vector is sorted and unique

auto is_sorted_and_unique = [](const std::vector<size_t>& v) -> bool {

if (v.empty()) return true;

for (size_t i = 1; i < v.size(); ++i) {

if (v[i-1] >= v[i]) return false;

}

return true;

};

// Sort and deduplicate each region's indices only if needed

for (auto& indices : normalized_indices) {

if (!is_sorted_and_unique(indices)) {

std::sort(indices.begin(), indices.end());

indices.erase(std::unique(indices.begin(), indices.end()), indices.end());

}

Copilot · 2025-11-28T02:49:01Z

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

+    // Round up to the next even number of tokens to keep
+    // This is required for DPP OpenCL implementation
+    size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;


[nitpick] The requirement for even token counts is specific to OpenCL DPP implementation, but this rounding is applied unconditionally even when OpenCL is not being used. Consider making this adjustment conditional on the actual backend being used (m_config.use_cl_kernel) to maintain more accurate pruning ratios when using CPU-only mode.

Suggested change

// Round up to the next even number of tokens to keep

// This is required for DPP OpenCL implementation

size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;

// Round up to the next even number of tokens to keep only if using OpenCL DPP implementation

// This is required for DPP OpenCL implementation

size_t num_tokens_to_keep = m_config.use_cl_kernel

? ((raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1)

: raw_tokens_to_keep;

Copilot · 2025-11-28T02:49:01Z

src/cpp/src/visual_language/cdpruner/conditional_kernel.cpp

+    // CRITICAL: Create a deep copy to avoid data corruption when InferRequest is reused
+    // The InferRequest may reuse or modify its output tensor on the next inference call,
+    // which would invalidate any references to this tensor
+    ov::Tensor conditional_kernel(conditional_kernel_ref.get_element_type(), conditional_kernel_ref.get_shape());
+    std::memcpy(conditional_kernel.data(), conditional_kernel_ref.data(), conditional_kernel_ref.get_byte_size());


[nitpick] Deep copies of InferRequest output tensors are performed in multiple locations (here and line 129). Consider implementing a tensor pool or reusing pre-allocated tensors across calls to reduce memory allocation overhead in performance-critical paths.

yangwang201911 · 2025-11-28T07:19:44Z

Replaced by PR#3084 with clean review history.

…genai into ywang2/vlm-cdpruner

…pport in InputsEmbedder

liangali and others added 30 commits August 1, 2025 10:06

[POC] implement cdpruner for qwen2.5-vl

3afb35b

Enhance CDPruner and RelevanceCalculator to support negative relevanc…

38879b5

…e configuration.

Update CDPruner configuration to enable negative relevance for CLIP-b…

5bedef4

…ased models

Add support for subgraph in CDPruner and ConditionalKernelBuilder

c81af98

Update L2 normalization function

4c7e1c0

Skip updating marginal gains for already selected tokens in FastGreed…

5c1b678

…yDPP

Enhance ConditionalKernelBuilder to precompile models and create infe…

4ac2a1c

…r requests for performance optimization

Enhance CDPruner and RelevanceCalculator to support negative relevanc…

1d2ff66

…e configuration. Refactor CDPruner to use visual tokens percentage instead of count for pruning configuration

Add CDPruner configuration parameters to GenerationConfig

95c243f

Implement GPU model compilation in constructor.

221456b

Refactor CDPruner configuration: rename debug_mode to pruning_debug_m…

79d7955

…ode and remove unused visual token pruning methods

Enhance CDPruner configuration: add pruning parameters to command-lin…

79529fa

…e arguments and update GenerationConfig structure

Merge remote-tracking branch 'upstream' into ywang2/enable_cdpruner_c…

99f55b0

…onfig

Refactor CDPruner configuration: remove unused settings and streamlin…

6a4a332

…e vision config handling

update format

5ba4d7d

Merge branch 'master' of https://github.com/openvinotoolkit/openvino.…

b618463

…genai into ywang2/vlm-cdpruner

Merge branch 'ywang2/enable_cdpruner_config' into ywang2/vlm-cdpruner

e63d071

Refactor pruning debug mode checks and enable ops model by default

ebf1a18

Add logging for CDPruner configuration

087d1c8

Add logging for CDPruner configuration settings

81fcf68

Rename visual_tokens_percentage to viusal_tokens_retain_percentage ac…

cc89a26

…ross codebase for consistency in CDPruner configuration

Initialize CDPruner with default configuration in VisionEncoder const…

05e7e65

…ructor

Add debug logging for conditional kernel matrix and marginal gains in…

c1e1f45

… FastGreedyDPP

update.

2cb1e8f

Merge branch 'master' of https://github.com/openvinotoolkit/openvino.…

572b251

…genai into ywang2/vlm-cdpruner

[visual_language_chat] Add CDPruner options and update usage instruct…

26b29f7

…ions

Enhance CDPruner functionality with new ops model option and update r…

c0280eb

…elated configurations

Refactor CDPruner debug output for consistency and clarity in logging

b2f2601

Optimize orthogonal vector computation: reduce memory access and impr…

9452f2f

…ove performance

Copilot AI review requested due to automatic review settings November 17, 2025 05:48

Copilot AI reviewed Nov 17, 2025

View reviewed changes

yangwang201911 requested review from peterchen-intel and xipingyan November 17, 2025 07:17

yangwang201911 added 3 commits November 17, 2025 21:51

[CDPruner] Implement default methods and pipeline for visual token pr…

3a6011a

…uning

update.

a8f2403

Merge branch 'master' into ywang2/vlm-cdpruner

cf08b8f

Copilot AI review requested due to automatic review settings November 21, 2025 02:14

Copilot AI reviewed Nov 21, 2025

View reviewed changes

xipingyan reviewed Nov 21, 2025

View reviewed changes

yangwang201911 added 2 commits November 21, 2025 16:24

Update CDPruner configuration and logging to improve clarity and cons…

5cfa27d

…istency

Merge branch 'master' into ywang2/vlm-cdpruner

8bc3cbf

Copilot AI review requested due to automatic review settings November 24, 2025 02:47

Copilot AI reviewed Nov 24, 2025

View reviewed changes

github-actions bot added the category: GGUF GGUF file reader label Nov 26, 2025

yangwang201911 added 2 commits November 27, 2025 00:46

Refactor CDPruner and add VLM pipeline test case with CDPruner enabled.

5e57f14

Implement OpenCL-accelerated DPP with new OpenCLDPP class and kernel …

98446ea

…integration

Copilot AI review requested due to automatic review settings November 27, 2025 03:21

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 27, 2025 09:18

Copilot AI reviewed Nov 27, 2025

View reviewed changes

yangwang201911 added 2 commits November 27, 2025 22:06

Add timing logs for cdpruner and pruning processes summary in CDPruner

1dbf4f3

Enhance CDPruner and related components with input validation, perfor…

933f19a

…mance optimizations, and default warning messages for unimplemented methods

Copilot AI review requested due to automatic review settings November 28, 2025 02:46

Copilot AI reviewed Nov 28, 2025

View reviewed changes

yangwang201911 closed this Nov 28, 2025

yangwang201911 mentioned this pull request Nov 28, 2025

Conditional visual token pruning for QWen-VL models. #3084

Open

yangwang201911 added 2 commits November 28, 2025 17:20

Merge branch 'master' of https://github.com/openvinotoolkit/openvino.…

eea0a0d

…genai into ywang2/vlm-cdpruner

Remove default implementations and add error handling for CDPruner su…

09ad8c9

…pport in InputsEmbedder

		auto encoded_vision_tokens = m_tokenizer.encode(m_vlm_config.vision_start_token + m_vlm_config.vision_end_token +
		m_vlm_config.image_pad_token + m_vlm_config.video_pad_token,

-    size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1;
+    size_t chunk_count = 1;
+    if (current_pruning_config->enable_frame_chunking) {
+        // Sum up the frame count for each encoded image
+        chunk_count = 0;
+        for (const auto& img : images) {
+            // TODO: Replace with actual frame count extraction for each image
+            // If img has a method get_frame_count(), use it; otherwise, assume 1 for now
+            // chunk_count += img.get_frame_count();
+            chunk_count += 1; // Placeholder: assumes each image is single-frame
+        }
+        // If multi-frame images are possible, implement frame count extraction above
+    }

	use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&
	use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold &&
	use_cl_kernel == other.use_cl_kernel &&

	size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
	size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1;
	num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens);

	if (visual_features_list.size() == 1) {
	if (visual_features_list.size() == 1u) {

		* - CDPRUNER_PRUNING_RATIO: Percentage of visual tokens to prune (integer, 0-100).
		* - CDPRUNER_DEBUG_MODE: Enable debug output (boolean, "0" or "1").

		std::string device_name;
		m_state->device.getInfo(CL_DEVICE_NAME, &device_name);

	result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range;
	result_data[idx] = (input_data[idx] - min_val) / range;

	EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor";
	EXPECT_TRUE(result.get_shape().empty()) << "Empty frames should result in empty output tensor";

	// extract_text_features_for_cdpruner() that acquires a request from the same queue.
	// extract_text_features_for_pruning() that acquires a request from the same queue.

-        batch_results.push_back(std::move(merged_selection));
+            size_t adjusted_idx = idx + split_point;
+            if (adjusted_idx < tokens_first_half + tokens_second_half) {
+                merged_selection.push_back(adjusted_idx);
+            } else {
+                OV_GENAI_LOG_WARNING("Adjusted index out of bounds in split selection: {} (idx={}, split_point={}, total_tokens={})",
+                                     adjusted_idx, idx, split_point, tokens_first_half + tokens_second_half);
+            }

	GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight);
	GENAI_INFO("\tRelevance Weight: %.1f", current_pruning_config->relevance_weight);

	std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size());
	output_tensor.copy_from(output_tensor_ref);

-    // Sort and deduplicate each region's indices
-    for (auto& indices : normalized_indices) {
-        std::sort(indices.begin(), indices.end());
-        indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
+    // Helper to check if a vector is sorted and unique
+    auto is_sorted_and_unique = [](const std::vector<size_t>& v) -> bool {
+        if (v.empty()) return true;
+        for (size_t i = 1; i < v.size(); ++i) {
+            if (v[i-1] >= v[i]) return false;
+        }
+        return true;
+    };
+    // Sort and deduplicate each region's indices only if needed
+    for (auto& indices : normalized_indices) {
+        if (!is_sorted_and_unique(indices)) {
+            std::sort(indices.begin(), indices.end());
+            indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
+        }

Conditional visual token pruning for QWen-VL models. #2714

Conditional visual token pruning for QWen-VL models. #2714

Uh oh!

Conversation

yangwang201911 commented Sep 9, 2025 • edited by peterchen-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

xipingyan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

yangwang201911 commented Sep 9, 2025 •

edited by peterchen-intel

Loading