Skip to content
Open
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
a3f6d32
[Doc] Visual Token Pruning
peterchen-intel Oct 18, 2025
674f7ed
Wording update per copolit suggestion
peterchen-intel Oct 18, 2025
03fe7da
Limit the model supported.
peterchen-intel Oct 18, 2025
ef02325
Update wording and conf code
peterchen-intel Oct 19, 2025
664348e
Update wording
peterchen-intel Oct 19, 2025
655d2a6
Trigger build for pull_request
peterchen-intel Oct 20, 2025
92c84b8
Add read permission
peterchen-intel Oct 20, 2025
fd3cf32
Update wording per review comments
peterchen-intel Oct 21, 2025
6393915
Update per review comments
peterchen-intel Oct 21, 2025
3141506
Remove duplicate step
peterchen-intel Oct 22, 2025
dbc7a04
Update per review comment
peterchen-intel Oct 22, 2025
03b9aca
Remove the changes to .github/workflows/deploy_gh_pages.yml
peterchen-intel Oct 27, 2025
cde7a6c
Merge branch 'master' into doc/visual/token/pruning
peterchen-intel Oct 27, 2025
f3e926e
Enhance implementation of DPP-based token selection in Visual Token P…
yangwang201911 Dec 16, 2025
ae26b63
update
yangwang201911 Dec 16, 2025
124bba3
Update.
yangwang201911 Dec 17, 2025
0eb2e12
Update site/docs/concepts/optimization-techniques/visual-token-prunin…
yangwang201911 Dec 17, 2025
b251d44
Apply suggestions from code review
peterchen-intel Dec 17, 2025
4fd80d1
Merge pull request #1 from yangwang201911/ywang2/add_splitting_strate…
peterchen-intel Dec 17, 2025
dfe2a12
TTFT
peterchen-intel Dec 17, 2025
03b17e1
Update format
peterchen-intel Dec 17, 2025
da7bc36
Apply suggestions from code review
peterchen-intel Dec 17, 2025
14606bd
Formatting
peterchen-intel Dec 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
sidebar_position: 5
---

# Visual Token Pruning (CDPruner)

## Overview
Visual Token Pruning is a context compression technique for Multimodal / Visual Language Models (VLMs) that aims to enhance inference efficiency without significant performance degradation by identifying and removing redundant or less informative tokens. A representative approach is CDPruner, introduced in the paper [Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/pdf/2506.10967). Its main goal is to lower inference latency and memory footprint while retaining the visual information most relevant to the user's query.

Unlike traditional attention-based or similarity-based pruning techniques, which can either retain redundant tokens or neglect instruction relevance, CDPruner focuses on maximizing the conditional diversity of the retained visual tokens. Pruned tokens are removed from further attention computations, shrinking KV cache footprint, reducing Time To First Token (TTFT) and improving throughput. A relevance weighting factor controls the influence of instruction relevance during pruning, helping balance token reduction against the preservation of important visual details.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'recieve' to 'receive'.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph contains multiple long, complex sentences that could be difficult to parse. Consider breaking it into shorter sentences for improved readability. For example, split the second sentence after 'computations' to separate the performance benefits.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase 'Time To First Token' should use hyphens to match the abbreviation format.

Copilot uses AI. Check for mistakes.

## Conceptual Model
CDPruner operates on the sequence of visual token embeddings produced by the vision encoder before they are passed to the language model. Instead of forwarding all tokens, it selects a subset based on conditional diversity, combining token similarity and instruction relevance.

### Token Partitioning

The visual tokens are conceptually divided into:
* Retained Tokens: A selected subset that provides diverse and instruction-relevant visual information.
* Pruned Tokens: Tokens excluded from further processing because they contribute redundant or low-relevance information.

High-level flow:
1. Encode image producing N visual tokens (embeddings).
2. Compute pairwise token similarity and per-token relevance scores.
3. Relevance and similarity are combined into a conditional kernel. A greedy DPP-based MAP algorithm identifies the least important tokens to discard according to `pruning_ratio`, adjusting scores using `relevance_weight` to control the trade-off between diversity and relevance.
4. Build reduced token set; subsequent generation attends only to retained tokens.

Improvement beyond the paper's approach:
1. In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default : 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default : 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`).
1. In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default : 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`).

2. *Note:* The split variant is not semantically equivalent to running DPP on the full kernel. In the split approach, an equal number of tokens are selected from each half and then merged, whereas a single full-kernel DPP call may select all the top-K tokens from one half if those tokens are most diverse/relevant. This constraint in the split variant can change the token selection set and may affect accuracy differently depending on the model and input. In practice, this splitting strategy has shown: (a) improved accuracy when evaluated on Qwen2.5-VL models, and (b) significantly faster GPU execution with OpenCL kernels due to better parallelization (2-3x speedup with large token counts). By default, the splitting strategy is enabled. Advanced users can disable it by setting the environment variable CDPRUNER_SPLIT_THRESHOLD=0 to use the original approach.

**Effect:** Pruning less important visual tokens reduces memory usage and can speed up generation; extremely high pruning may degrade answer quality for complex visual queries.

## Configuration Interface
Visual Token Pruning is exposed through fields of `ov::genai::GenerationConfig`:

* `pruning_ratio` (integer, 0–99): Portion of visual tokens to prune, specified as an integer percentage. A value of 0 disables pruning. For example, `25` means prune 25% of the visual tokens (keep 75%). Out-of-range values (negative or >=100) are treated as 0 (disabled) to avoid eliminating the entire visual context.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of treating out-of-range values (negative or >=100) as 0 is potentially confusing. A value of 100 would reasonably be expected to prune all tokens, not disable pruning. Consider documenting why this design choice was made or if the API should be adjusted to handle these edge cases more intuitively.

Copilot uses AI. Check for mistakes.
* `relevance_weight` (float): Weighting factor applied when aggregating or scaling dominance scores. **Recommended range:** 0.0–1.0. A value of 0 disables relevance weighting (pruning is based solely on raw dominance scores), while higher values (up to 1.0) emphasize relevance, making pruning more conservative on borderline tokens. Values above 1.0 are allowed but may have diminishing or unpredictable effects; negative values are not recommended. Default in the sample is `0.5f`.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description states 'Values above 1.0 are allowed but may have diminishing or unpredictable effects' without explaining what happens. Consider clarifying the expected behavior or providing guidance on when values above 1.0 might be appropriate, or recommend against using them if the effects are truly unpredictable.

Copilot uses AI. Check for mistakes.

### Sample Usage (Python Benchmark Script)
[samples/python/visual_language_chat/benchmark_vlm.py](https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/visual_language_chat/benchmark_vlm.py) provides a convenient way to measure performance impact of pruning.

Minimal example (prune 70% of visual tokens on GPU):
```bash
python benchmark_vlm.py \
-m ./models/vlm \
-i ./data/example.jpg \
-p "What is on the image?" \
-d GPU \
--pruning_ratio 70 \
--relevance_weight 0.6
```

Relevant configuration excerpt:
```python
config = ov_genai.GenerationConfig()
config.max_new_tokens = args.max_new_tokens
config.pruning_ratio = args.pruning_ratio
config.relevance_weight = args.relevance_weight
```

Pipeline creation and generation:
```python
pipe = ov_genai.VLMPipeline(models_path, device, scheduler_config=scheduler_config)
res = pipe.generate(prompt, images=images, generation_config=config)
```

The script prints performance metrics (time-to-first-token TTFT, throughput, per-stage durations). Compare runs with different `--pruning_ratio` to quantify latency improvements and memory savings.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTFT appears again here but was not defined earlier in the document. Ensure the acronym is defined on its first use in line 10.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions that the script prints 'per-stage durations' but doesn't explain what stages are tracked or how to interpret these metrics. Consider adding a brief explanation of the key performance metrics users should focus on when evaluating pruning effectiveness.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acronym 'TTFT' is defined parenthetically as 'time-to-first-token', but it was already introduced and defined on line 10. Remove the redundant definition here to avoid repetition.

Copilot uses AI. Check for mistakes.

## Performance & Benefits
* Reduced KV cache memory for visual tokens -> enables larger batch sizes or longer text generation within same memory budget.
* Lower per-step attention computations involving image tokens -> improved latency.
* Helpful for edge or GPU memory-constrained deployments (e.g., running VLM on integrated GPU with limited VRAM).

## Current Limitations
* Current implementation assumes a standard image encoder output; exotic hierarchical or sparse encoders might require adjusted scoring strategies.
* Pruning is applied only after the initial image encoding; does not dynamically re-introduce pruned tokens later.
* Score computation details are internal; no per-token debug API is exposed yet.
* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.
* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.

Loading