-
Notifications
You must be signed in to change notification settings - Fork 312
Conditional visual token pruning for QWen-VL models. #2714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conditional visual token pruning for QWen-VL models. #2714
Conversation
…r requests for performance optimization
…e configuration. Refactor CDPruner to use visual tokens percentage instead of count for pruning configuration
…ode and remove unused visual token pruning methods
…e arguments and update GenerationConfig structure
…e vision config handling
…genai into ywang2/vlm-cdpruner
…ross codebase for consistency in CDPruner configuration
…in_percentage" - Updated Python scripts to reflect the corrected parameter name in argument parsing and configuration settings. 2. Added unit tests for the FastGreedyDPP class to ensure proper functionality and selection behavior based on the visual tokens retention percentage.
…genai into ywang2/vlm-cdpruner
…elated configurations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/visual_language/cdpruner/kernel_builder.cpp:1
- Adding numerical_threshold to the numerator in min-max normalization is incorrect. The epsilon should be added to the denominator (range) to prevent division by zero, not to the numerator. This will shift all normalized values incorrectly. Change to:
result_data[idx] = (input_data[idx] - min_val) / (range + m_config.numerical_threshold);
// Copyright (C) 2023-2025 Intel Corporation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto encoded_vision_tokens = m_tokenizer.encode(m_vlm_config.vision_start_token + m_vlm_config.vision_end_token + | ||
| m_vlm_config.image_pad_token + m_vlm_config.video_pad_token, |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The token encoding is now done with multiple tokens concatenated as a single string, which assumes specific tokenizer behavior. Consider adding a validation check to ensure the tokenizer produces exactly 4 separate token IDs as expected by the subsequent array indexing (lines 989-992), or document this assumption clearly.
| // Acquire request, run inference, then copy the result to safeguard against later reuse | ||
| CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(m_embedding->get_request_queue().get()); | ||
| EmbeddingsRequest& req = embeddings_request_guard.get(); | ||
| ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids); | ||
|
|
||
| // Deep-copy necessary: Returned InferRequest's internal memory will be reused in | ||
| // extract_text_features_for_cdpruner() that acquires a request from the same queue. | ||
| // Without deep-copy, the second inference would overwrite this data, corrupting text_embeds. | ||
| text_embeds = ov::Tensor(tmp_embeds.get_element_type(), tmp_embeds.get_shape()); | ||
| std::memcpy(text_embeds.data(), tmp_embeds.data(), tmp_embeds.get_byte_size()); | ||
| } // Request released here |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deep copying text_embeds introduces unnecessary memory allocation and copy overhead. Consider using a different approach such as maintaining separate inference request queues for different purposes, or ensuring the lifetime of tmp_embeds extends beyond the second inference to avoid this defensive copy.
|
|
||
| // Step 2: Convert visual features to CDPruner format | ||
| // May split by frames/chunks for video or multi-image scenarios | ||
| size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1; |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using images.size() for chunk_count when enable_frame_chunking is true may not correctly represent the actual frame count in multi-frame scenarios. The images parameter represents encoded images, not individual frames. Consider verifying this logic aligns with the actual multi-frame processing requirements.
| size_t chunk_count = current_pruning_config->enable_frame_chunking ? images.size() : 1; | |
| size_t chunk_count = 1; | |
| if (current_pruning_config->enable_frame_chunking) { | |
| // Sum up the frame count for each encoded image | |
| chunk_count = 0; | |
| for (const auto& img : images) { | |
| // TODO: Replace with actual frame count extraction for each image | |
| // If img has a method get_frame_count(), use it; otherwise, assume 1 for now | |
| // chunk_count += img.get_frame_count(); | |
| chunk_count += 1; // Placeholder: assumes each image is single-frame | |
| } | |
| // If multi-frame images are possible, implement frame count extraction above | |
| } |
| bool Config::operator==(const Config& other) const { | ||
| return pruning_ratio == other.pruning_ratio && std::abs(relevance_weight - other.relevance_weight) < 1e-6f && | ||
| device == other.device && std::abs(numerical_threshold - other.numerical_threshold) < 1e-9f && | ||
| use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold && |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The equality operator is missing a comparison for use_cl_kernel field defined in the Config struct (line 47 in cdpruner_config.hpp). This could lead to two Config objects being considered equal even when they have different OpenCL settings.
| use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold && | |
| use_negative_relevance == other.use_negative_relevance && split_threshold == other.split_threshold && | |
| use_cl_kernel == other.use_cl_kernel && |
| size_t raw_tokens_to_keep = static_cast<size_t>(std::round(total_tokens * (1.0 - m_config.pruning_ratio / 100.0))); | ||
| // Round up to the next even number of tokens to keep | ||
| // This is required for DPP OpenCL implementation | ||
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing an even number of tokens by rounding up could exceed the original total_tokens count when pruning_ratio is very small. Add a validation check: num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens); after line 86.
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; | |
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; | |
| num_tokens_to_keep = std::min(num_tokens_to_keep, total_tokens); |
xipingyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need a python test for pipeline.
For example:
https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/test_vlm_pipeline.py
| // This is required for DPP OpenCL implementation | ||
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; | ||
|
|
||
| validate_input_tensors(visual_features, text_features); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move validate_input_tensors to the beginning of this function.
| for (size_t t = 0; t < batch_selected.size(); ++t) { | ||
| size_t src_token_idx = batch_selected[t]; | ||
|
|
||
| for (size_t f = 0; f < feature_dim; ++f) { | ||
| size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim + f; | ||
| size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim + f; | ||
| output_data[dst_idx] = input_data[src_idx]; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (size_t t = 0; t < batch_selected.size(); ++t) { | |
| size_t src_token_idx = batch_selected[t]; | |
| for (size_t f = 0; f < feature_dim; ++f) { | |
| size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim + f; | |
| size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim + f; | |
| output_data[dst_idx] = input_data[src_idx]; | |
| } | |
| } | |
| for (size_t t = 0; t < batch_selected.size(); ++t) { | |
| size_t src_token_idx = batch_selected[t]; | |
| size_t src_idx = b * total_tokens * feature_dim + src_token_idx * feature_dim; | |
| size_t dst_idx = b * actual_selected_tokens * feature_dim + t * feature_dim; | |
| for (size_t f = 0; f < feature_dim; ++f) { | |
| output_data[dst_idx+f] = input_data[src_idx+f]; | |
| } | |
| } |
Just move complex computation to outside of {}
| } | ||
|
|
||
| // Handle single feature case by calling existing method | ||
| if (visual_features_list.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (visual_features_list.size() == 1) { | |
| if (visual_features_list.size() == 1u) { |
| if (aggregated_selected.empty()) { | ||
| aggregated_selected.resize(frame_selected.size()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_last_selected_tokens is known variable before the for loop, please move this resize to before the for loop
| * - CDPRUNER_PRUNING_RATIO: Percentage of visual tokens to prune (integer, 0-100). | ||
| * - CDPRUNER_DEBUG_MODE: Enable debug output (boolean, "0" or "1"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reference in other place, please remove.
| std::string device_name; | ||
| m_state->device.getInfo(CL_DEVICE_NAME, &device_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You get device name, but seems not to use it. please remove.
| const float* input_data = input.data<const float>(); | ||
| float* result_data = result.data<float>(); | ||
|
|
||
| if (shape.size() == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (shape.size() == 2) { | |
| if (shape.size() == 2u) { |
| // Compute L2 norm for token i | ||
| float norm = 0.0f; | ||
| for (size_t j = 0; j < feature_dim; ++j) { | ||
| size_t idx = i * feature_dim + j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can move idx calc to outside of for {}.
| const float* input_data = input.data<const float>(); | ||
| float* result_data = result.data<float>(); | ||
|
|
||
| if (shape.size() == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (shape.size() == 2) { | |
| if (shape.size() == 2u) { |
| bool InputsEmbedder::IInputsEmbedder::has_token_type_ids() const { return false; } | ||
|
|
||
| // Default implementations for CDPruner-related methods | ||
| ov::Tensor InputsEmbedder::IInputsEmbedder::extract_text_features_for_pruning(const ov::Tensor& text_embeds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is not implementation, please print warning when input this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Normalize batch b | ||
| for (size_t i = 0; i < num_tokens; ++i) { | ||
| size_t idx = b * num_tokens + i; | ||
| result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range; |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding numerical_threshold to the numerator in min-max normalization is incorrect. The threshold should only be added to the denominator to prevent division by zero. This will shift all normalized values upward, producing incorrect results. Change to: (input_data[idx] - min_val) / range
| result_data[idx] = (input_data[idx] - min_val + m_config.numerical_threshold) / range; | |
| result_data[idx] = (input_data[idx] - min_val) / range; |
| results.push_back(id); | ||
| if (batch_size == 1) { | ||
| std::sort(results.begin(), results.end()); | ||
| results.erase(std::unique(results.begin(), results.end()), results.end()); |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling std::unique without first sorting the vector only removes consecutive duplicates, not all duplicates. The vector should be sorted before calling std::unique, or use a different approach to remove duplicates. Add std::sort(results.begin(), results.end()); before this line.
| size_t num_patches = original_shape[0]; | ||
| size_t embedding_dim = original_shape[1]; | ||
| size_t new_patches = num_patches / chunk_count; | ||
| OPENVINO_ASSERT(original_shape[0] == new_patches * chunk_count, "Inconsistent number of patches per image"); |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message 'Inconsistent number of patches per image' is unclear about what values are inconsistent. Include the actual values in the error message: \"Inconsistent patches: expected \" + std::to_string(new_patches * chunk_count) + \", got \" + std::to_string(original_shape[0])
| // This should either return empty tensor or handle gracefully | ||
| ov::Tensor result; | ||
| EXPECT_NO_THROW(result = cdpruner.apply_pruning(empty_frames, text_features)); | ||
| EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor"; |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression !result tests if the tensor is invalid/empty, but the correct method to check if an ov::Tensor is empty is result.get_shape().empty() or checking byte size. The current check may not work as intended. Change to: EXPECT_TRUE(result.get_shape().empty()) << \"Empty frames should result in empty output tensor\";
| EXPECT_TRUE(!result) << "Empty frames should result in empty output tensor"; | |
| EXPECT_TRUE(result.get_shape().empty()) << "Empty frames should result in empty output tensor"; |
| m_vision_token_ids["vision_end"] = encoded_vision_tokens.input_ids.data<int64_t>()[1]; | ||
| m_vision_token_ids["image_pad"] = encoded_vision_tokens.input_ids.data<int64_t>()[2]; | ||
| m_vision_token_ids["video_pad"] = encoded_vision_tokens.input_ids.data<int64_t>()[3]; |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing array indices 1, 2, and 3 without verifying that encoded_vision_tokens.input_ids has at least 4 elements could cause out-of-bounds access. Add validation: OPENVINO_ASSERT(encoded_vision_tokens.input_ids.get_size() >= 4, \"Expected at least 4 vision tokens\"); before line 990.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
tests/cpp/test_cdpruner_dpp.cpp:1
- SIMD capability logging appears in test file but the actual SIMD code is in fast_dpp.cpp. This logging should be in the production code (fast_dpp.cpp line 439-445) where it already exists, not duplicated in tests.
// Copyright (C) 2024 Intel Corporation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids); | ||
|
|
||
| // Deep-copy necessary: Returned InferRequest's internal memory will be reused in | ||
| // extract_text_features_for_cdpruner() that acquires a request from the same queue. |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name in comment should be 'extract_text_features_for_pruning' not 'extract_text_features_for_cdpruner'
| // extract_text_features_for_cdpruner() that acquires a request from the same queue. | |
| // extract_text_features_for_pruning() that acquires a request from the same queue. |
| // Sort final result to maintain order | ||
| std::sort(merged_selection.begin(), merged_selection.end()); | ||
|
|
||
| batch_results.push_back(std::move(merged_selection)); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index adjustment logic for split selection should validate that adjusted indices don't exceed original token count to prevent out-of-bounds access in downstream processing.
| batch_results.push_back(std::move(merged_selection)); | |
| size_t adjusted_idx = idx + split_point; | |
| if (adjusted_idx < tokens_first_half + tokens_second_half) { | |
| merged_selection.push_back(adjusted_idx); | |
| } else { | |
| OV_GENAI_LOG_WARNING("Adjusted index out of bounds in split selection: {} (idx={}, split_point={}, total_tokens={})", | |
| adjusted_idx, idx, split_point, tokens_first_half + tokens_second_half); | |
| } |
| // CDPruner Overview | ||
| GENAI_INFO("CDPruner configuration:"); | ||
| GENAI_INFO("\tPruning Ratio: %d%%", current_pruning_config->pruning_ratio); | ||
| GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Label 'Use Relevance Weight' is misleading - should be just 'Relevance Weight' since the value represents the weight itself, not a boolean indicating usage.
| GENAI_INFO("\tUse Relevance Weight: %.1f", current_pruning_config->relevance_weight); | |
| GENAI_INFO("\tRelevance Weight: %.1f", current_pruning_config->relevance_weight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 10 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/visual_language/cdpruner/fast_dpp.cpp:1
- Document why AVX512 is explicitly disabled in favor of AVX2. The comment should explain the rationale (e.g., compatibility, performance characteristics, or testing constraints).
// Copyright (C) 2023-2025 Intel Corporation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @parametrize_cdpruner_models | ||
| def test_cdpruner_basic_functionality( | ||
| ov_pipe_model: VlmModelInfo, |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove trailing whitespace from function parameter.
| ov_pipe_model: VlmModelInfo, | |
| ov_pipe_model: VlmModelInfo, |
| return kept_indices_per_image; | ||
|
|
||
| // Handle single aggregated vector case | ||
| OPENVINO_ASSERT(kept_indices_per_image.size() == 1 && region_count > 1, | ||
| "Kept token indices layout does not match vision regions"); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The normalization logic assumes either per-region indices or a single aggregated vector. Document this assumption in the function header or add a comment explaining when each format is expected.
| int64_t vision_end_token_id) const { | ||
| // Default implementation: return empty tensor | ||
| // Models that support CDPruner should override this method | ||
| GENAI_WARN("extract_text_features_for_pruning not implemented for this model"); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.
| const ov::Tensor& vision_embeds, | ||
| size_t chunk_count) const { | ||
| // Default implementation: return single tensor in vector | ||
| GENAI_WARN("convert_visual_features_for_pruning not implemented for this model"); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.
| const std::vector<std::vector<bool>>& keep_flags_per_region, | ||
| const std::vector<std::array<size_t, 3>>& grid_thw_per_region) const { | ||
| // Default implementation: return as-is | ||
| GENAI_WARN("apply_visual_token_pruning not implemented for this model"); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.
| const std::vector<size_t>& images_sequence, | ||
| std::vector<std::vector<bool>>& keep_flags_per_region_out) const { | ||
| // Default implementation: do nothing (position_ids remain unchanged) | ||
| GENAI_WARN("adjust_position_ids_after_pruning not implemented for this model"); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.
| int64_t vision_start_token_id, | ||
| int64_t vision_end_token_id) const { | ||
| // Default implementation: return as-is | ||
| GENAI_WARN("generate_pruned_input_ids not implemented for this model"); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] These warning messages for unimplemented CDPruner methods should clarify that CDPruner is not supported for the current model type, rather than implying a missing implementation. Consider: 'CDPruner not supported for this model type'.
| if (total_tokens < 16) { | ||
| return select_cpu_internal(kernel, num_tokens); | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold value of 16 for GPU vs CPU selection should be extracted to a configurable parameter (e.g., in Config) or at minimum documented as a constant with an explanatory comment about the empirical determination.
|
|
||
| // Create deep copy to avoid data corruption when InferRequest is reused | ||
| ov::Tensor output_tensor(output_tensor_ref.get_element_type(), output_tensor_ref.get_shape()); | ||
| std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size()); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Similar to comment 3, consider if OpenVINO provides an optimized tensor copy method instead of manual std::memcpy.
| std::memcpy(output_tensor.data(), output_tensor_ref.data(), output_tensor_ref.get_byte_size()); | |
| output_tensor.copy_from(output_tensor_ref); |
| // Round up to the next even number of tokens to keep | ||
| // This is required for DPP OpenCL implementation | ||
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify why the OpenCL DPP implementation requires an even number of tokens. This constraint should be documented in the DPP implementation or Config as well.
…mance optimizations, and default warning messages for unimplemented methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ifdef OV_GPU_USE_OPENCL_HPP | ||
| # include <CL/opencl.hpp> | ||
| # else | ||
| # include <CL/cl2.hpp> | ||
| # endif |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The header selection logic for OpenCL is duplicated in multiple locations. The same pattern appears in the main CMakeLists.txt (lines 52-62). Consider creating a single OpenCL compatibility header that centralizes this logic.
| } | ||
|
|
||
| // Sort final result to maintain order | ||
| std::sort(merged_selection.begin(), merged_selection.end()); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merged selection is sorted here, but the selections from both halves are already sorted (lines 217-224). Consider using std::merge instead of std::sort for better performance when merging two already-sorted sequences.
| ov::Tensor text_embeds; | ||
| { | ||
| // Acquire request, run inference, then copy the result to safeguard against later reuse | ||
| CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(m_embedding->get_request_queue().get()); | ||
| EmbeddingsRequest& req = embeddings_request_guard.get(); | ||
| ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids); | ||
|
|
||
| // Deep-copy necessary: Returned InferRequest's internal memory will be reused in | ||
| // extract_text_features_for_cdpruner() that acquires a request from the same queue. | ||
| // Without deep-copy, the second inference would overwrite this data, corrupting text_embeds. | ||
| text_embeds = ov::Tensor(tmp_embeds.get_element_type(), tmp_embeds.get_shape()); | ||
| std::memcpy(text_embeds.data(), tmp_embeds.data(), tmp_embeds.get_byte_size()); | ||
| } // Request released here |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deep copy of text_embeds is performed for every inference call, even when CDPruner is disabled (pruning_ratio=0). Consider deferring this copy until after checking if CDPruner is active (line 1153) to avoid unnecessary memory allocation and copying in the common case where pruning is disabled.
| // Sort and deduplicate each region's indices | ||
| for (auto& indices : normalized_indices) { | ||
| std::sort(indices.begin(), indices.end()); | ||
| indices.erase(std::unique(indices.begin(), indices.end()), indices.end()); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The sort and unique operations are performed on each region's indices separately. If the indices are expected to be mostly sorted or unique already (which is likely from DPP selection), consider using a set or checking for duplicates before sorting to optimize for the common case.
| // Sort and deduplicate each region's indices | |
| for (auto& indices : normalized_indices) { | |
| std::sort(indices.begin(), indices.end()); | |
| indices.erase(std::unique(indices.begin(), indices.end()), indices.end()); | |
| // Helper to check if a vector is sorted and unique | |
| auto is_sorted_and_unique = [](const std::vector<size_t>& v) -> bool { | |
| if (v.empty()) return true; | |
| for (size_t i = 1; i < v.size(); ++i) { | |
| if (v[i-1] >= v[i]) return false; | |
| } | |
| return true; | |
| }; | |
| // Sort and deduplicate each region's indices only if needed | |
| for (auto& indices : normalized_indices) { | |
| if (!is_sorted_and_unique(indices)) { | |
| std::sort(indices.begin(), indices.end()); | |
| indices.erase(std::unique(indices.begin(), indices.end()), indices.end()); | |
| } |
| // Round up to the next even number of tokens to keep | ||
| // This is required for DPP OpenCL implementation | ||
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The requirement for even token counts is specific to OpenCL DPP implementation, but this rounding is applied unconditionally even when OpenCL is not being used. Consider making this adjustment conditional on the actual backend being used (m_config.use_cl_kernel) to maintain more accurate pruning ratios when using CPU-only mode.
| // Round up to the next even number of tokens to keep | |
| // This is required for DPP OpenCL implementation | |
| size_t num_tokens_to_keep = (raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1; | |
| // Round up to the next even number of tokens to keep only if using OpenCL DPP implementation | |
| // This is required for DPP OpenCL implementation | |
| size_t num_tokens_to_keep = m_config.use_cl_kernel | |
| ? ((raw_tokens_to_keep % 2 == 0) ? raw_tokens_to_keep : raw_tokens_to_keep + 1) | |
| : raw_tokens_to_keep; |
| // CRITICAL: Create a deep copy to avoid data corruption when InferRequest is reused | ||
| // The InferRequest may reuse or modify its output tensor on the next inference call, | ||
| // which would invalidate any references to this tensor | ||
| ov::Tensor conditional_kernel(conditional_kernel_ref.get_element_type(), conditional_kernel_ref.get_shape()); | ||
| std::memcpy(conditional_kernel.data(), conditional_kernel_ref.data(), conditional_kernel_ref.get_byte_size()); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Deep copies of InferRequest output tensors are performed in multiple locations (here and line 129). Consider implementing a tensor pool or reusing pre-allocated tensors across calls to reduce memory allocation overhead in performance-critical paths.
|
Replaced by PR#3084 with clean review history. |
…genai into ywang2/vlm-cdpruner
…pport in InputsEmbedder
-- Paper: CDPruner (arXiv)
-- Code: GitHub Repository
Tickets: CVS-173220
Related PRs: