[Doc] Visual Token Pruning #2861

peterchen-intel · 2025-10-18T12:16:35Z

Description

Document for Visual Token Pruning
Ticket: CVS-173220, CVS-170139

Implementation is in #2714
Doc build: https://github.com/openvinotoolkit/openvino.genai/actions/runs/18670224384?pr=2861

Signed-off-by: Chen, Peter <[email protected]>

Copilot

Pull Request Overview

Adds a new documentation page describing the Visual Token Pruning (CDPruner) feature for VLMs, including conceptual overview, configuration parameters, and a sample usage snippet.

Introduces pruning concepts and workflow.
Documents new GenerationConfig fields (pruning_ratio, relevance_weight) and their effects.
Provides a benchmark script usage example for measuring performance impact.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Co-authored-by: Copilot <[email protected]>

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Signed-off-by: Chen, Peter <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

site/docs/concepts/optimization-techniques/visual-token-pruning.md

peterchen-intel · 2025-10-19T01:14:00Z

@liangali

Co-authored-by: Copilot <[email protected]>

Wovchena · 2025-10-20T06:21:03Z

Build your docs at https://github.com/peterchen-intel/openvino.genai/actions/workflows/deploy_gh_pages.yml to see everything is fine. Add the link of the resulting docs to the PR description

rkazants · 2025-10-20T06:32:08Z

Should be merged only after #2714

Signed-off-by: Chen, Peter <[email protected]>

site/docs/concepts/optimization-techniques/visual-token-pruning.md

l-bat · 2025-10-20T11:20:58Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+The visual token sequence extracted from the image encoder can be partitioned into:
+
+* Retained Tokens: Subset judged most relevant by dominance scoring.
+* Pruned Tokens: Dropped from future decoding (no longer participate in cross-attention or self-attention depending on architecture).
+
+Pruning is controlled by a ratio (percentage of tokens to remove) and a relevance weight scaling that influences importance estimation.


CDPruner operates on the sequence of visual token embeddings produced by the vision encoder before they are passed to the language model. Instead of forwarding all tokens, it selects a subset based on conditional diversity, combining token similarity and instruction relevance.

Token Partitioning The visual tokens are conceptually divided into:

Retained Tokens: A selected subset that provides diverse and instruction-relevant visual information.

Pruned Tokens: Tokens excluded from further processing because they contribute redundant or low-relevance information.

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Co-authored-by: Liubov Talamanova <[email protected]>

…runing with a new splitting strategy for improved efficiency and accuracy.

…g.md Co-authored-by: Chen Peter <[email protected]>

…gy_description [CDPruner] Add splitting strategy description

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T02:34:57Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+## Overview
+Visual Token Pruning is a context compression technique for Multimodal / Visual Language Models (VLMs) that aims to enhance inference efficiency without significant performance degradation by identifying and removing redundant or less informative tokens. A representative approach is CDPruner, introduced in the paper [Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/pdf/2506.10967). Its main goal is to lower inference latency and memory footprint while retaining the visual information most relevant to the user's query.
+
+Unlike traditional attention-based or similarity-based pruning techniques, which can either retain redundant tokens or neglect instruction relevance, CDPruner focuses on maximizing the conditional diversity of the retained visual tokens. Pruned tokens are removed from further attention computations, shrinking KV cache footprint, reducing TTFT and improving throughput. A relevance weighting factor controls the influence of instruction relevance during pruning, helping balance token reduction against the preservation of important visual details.


TTFT is used as an acronym without being defined. Consider defining it on first use as 'Time To First Token (TTFT)' for readers unfamiliar with the term.

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Copilot · 2025-12-17T02:34:58Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+res = pipe.generate(prompt, images=images, generation_config=config)
+```
+
+The script prints performance metrics (time-to-first-token TTFT, throughput, per-stage durations). Compare runs with different `--pruning_ratio` to quantify latency improvements and memory savings.


TTFT appears again here but was not defined earlier in the document. Ensure the acronym is defined on its first use in line 10.

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T02:37:22Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+
+**Implementation Enhancement:** In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default is 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`). 
+
+**Important Note:** The split variant is not semantically equivalent to running DPP on the full kernel. In the split approach, an equal number of tokens are selected from each half and then merged, whereas a single full-kernel DPP call may select all the top-K tokens from one half if those tokens are most diverse/relevant. This constraint in the split variant can change the token selection set and may affect accuracy differently depending on the model and input. In practice, this splitting strategy has shown: (a) improved accuracy when evaluated on Qwen2.5-VL models, and (b) significantly faster GPU execution with OpenCL kernels due to better parallelization (2-3x speedup with large token counts). By default, the splitting strategy is enabled. Advanced users can disable it by setting the environment variable CDPRUNER_SPLIT_THRESHOLD=0 to use the original approach.


This paragraph is very dense and difficult to parse. Consider breaking it into multiple paragraphs: one explaining the semantic difference, one for the practical observations, and one for the default configuration.

Copilot · 2025-12-17T02:37:22Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+## Configuration Interface
+Visual Token Pruning is exposed through fields of `ov::genai::GenerationConfig`:
+
+* `pruning_ratio` (integer, 0–99): Portion of visual tokens to prune, specified as an integer percentage. A value of 0 disables pruning. For example, `25` means prune 25% of the visual tokens (keep 75%). Out-of-range values (negative or >=100) are treated as 0 (disabled) to avoid eliminating the entire visual context.


The behavior of treating out-of-range values (negative or >=100) as 0 is potentially confusing. A value of 100 would reasonably be expected to prune all tokens, not disable pruning. Consider documenting why this design choice was made or if the API should be adjusted to handle these edge cases more intuitively.

Copilot · 2025-12-17T02:37:23Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+Visual Token Pruning is exposed through fields of `ov::genai::GenerationConfig`:
+
+* `pruning_ratio` (integer, 0–99): Portion of visual tokens to prune, specified as an integer percentage. A value of 0 disables pruning. For example, `25` means prune 25% of the visual tokens (keep 75%). Out-of-range values (negative or >=100) are treated as 0 (disabled) to avoid eliminating the entire visual context.
+* `relevance_weight` (float): Weighting factor applied when aggregating or scaling dominance scores. **Recommended range:** 0.0–1.0. A value of 0 disables relevance weighting (pruning is based solely on raw dominance scores), while higher values (up to 1.0) emphasize relevance, making pruning more conservative on borderline tokens. Values above 1.0 are allowed but may have diminishing or unpredictable effects; negative values are not recommended. Default in the sample is `0.5f`.


The description states 'Values above 1.0 are allowed but may have diminishing or unpredictable effects' without explaining what happens. Consider clarifying the expected behavior or providing guidance on when values above 1.0 might be appropriate, or recommend against using them if the effects are truly unpredictable.

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T02:49:53Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+## Overview
+Visual Token Pruning is a context compression technique for Multimodal / Visual Language Models (VLMs) that aims to enhance inference efficiency without significant performance degradation by identifying and removing redundant or less informative tokens. A representative approach is CDPruner, introduced in the paper [Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/pdf/2506.10967). Its main goal is to lower inference latency and memory footprint while retaining the visual information most relevant to the user's query.
+
+Unlike traditional attention-based or similarity-based pruning techniques, which can either retain redundant tokens or neglect instruction relevance, CDPruner focuses on maximizing the conditional diversity of the retained visual tokens. Pruned tokens are removed from further attention computations, shrinking KV cache footprint, reducing Time To First Token (TTFT) and improving throughput. A relevance weighting factor controls the influence of instruction relevance during pruning, helping balance token reduction against the preservation of important visual details.


Corrected spelling of 'recieve' to 'receive'.

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Copilot · 2025-12-17T02:49:54Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+res = pipe.generate(prompt, images=images, generation_config=config)
+```
+
+The script prints performance metrics (time-to-first-token TTFT, throughput, per-stage durations). Compare runs with different `--pruning_ratio` to quantify latency improvements and memory savings.


The documentation mentions that the script prints 'per-stage durations' but doesn't explain what stages are tracked or how to interpret these metrics. Consider adding a brief explanation of the key performance metrics users should focus on when evaluating pruning effectiveness.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T02:50:18Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+## Overview
+Visual Token Pruning is a context compression technique for Multimodal / Visual Language Models (VLMs) that aims to enhance inference efficiency without significant performance degradation by identifying and removing redundant or less informative tokens. A representative approach is CDPruner, introduced in the paper [Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/pdf/2506.10967). Its main goal is to lower inference latency and memory footprint while retaining the visual information most relevant to the user's query.
+
+Unlike traditional attention-based or similarity-based pruning techniques, which can either retain redundant tokens or neglect instruction relevance, CDPruner focuses on maximizing the conditional diversity of the retained visual tokens. Pruned tokens are removed from further attention computations, shrinking KV cache footprint, reducing Time To First Token (TTFT) and improving throughput. A relevance weighting factor controls the influence of instruction relevance during pruning, helping balance token reduction against the preservation of important visual details.


This paragraph contains multiple long, complex sentences that could be difficult to parse. Consider breaking it into shorter sentences for improved readability. For example, split the second sentence after 'computations' to separate the performance benefits.

Copilot · 2025-12-17T02:50:19Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+res = pipe.generate(prompt, images=images, generation_config=config)
+```
+
+The script prints performance metrics (time-to-first-token TTFT, throughput, per-stage durations). Compare runs with different `--pruning_ratio` to quantify latency improvements and memory savings.


The acronym 'TTFT' is defined parenthetically as 'time-to-first-token', but it was already introduced and defined on line 10. Remove the redundant definition here to avoid repetition.

site/docs/concepts/optimization-techniques/visual-token-pruning.md

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T02:54:43Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+## Overview
+Visual Token Pruning is a context compression technique for Multimodal / Visual Language Models (VLMs) that aims to enhance inference efficiency without significant performance degradation by identifying and removing redundant or less informative tokens. A representative approach is CDPruner, introduced in the paper [Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/pdf/2506.10967). Its main goal is to lower inference latency and memory footprint while retaining the visual information most relevant to the user's query.
+
+Unlike traditional attention-based or similarity-based pruning techniques, which can either retain redundant tokens or neglect instruction relevance, CDPruner focuses on maximizing the conditional diversity of the retained visual tokens. Pruned tokens are removed from further attention computations, shrinking KV cache footprint, reducing Time To First Token (TTFT) and improving throughput. A relevance weighting factor controls the influence of instruction relevance during pruning, helping balance token reduction against the preservation of important visual details.


The phrase 'Time To First Token' should use hyphens to match the abbreviation format.

yangwang201911

fix format issue.

yangwang201911 · 2025-12-17T06:08:05Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+4. Build reduced token set; subsequent generation attends only to retained tokens.
+
+Improvement beyond the paper's approach:
+1. In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default : 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`). 


Suggested change

1. In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default : 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`).

1. In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default : 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`).

yangwang201911 · 2025-12-17T06:09:08Z

site/docs/concepts/optimization-techniques/visual-token-pruning.md

+* Current implementation assumes a standard image encoder output; exotic hierarchical or sparse encoders might require adjusted scoring strategies.
+* Pruning is applied only after the initial image encoding; does not dynamically re-introduce pruned tokens later.
+* Score computation details are internal; no per-token debug API is exposed yet.
+* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.


Suggested change

* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.

* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.

[Doc] Visual Token Pruning

a3f6d32

Signed-off-by: Chen, Peter <[email protected]>

Copilot AI review requested due to automatic review settings October 18, 2025 12:16

github-actions bot added the category: GH Pages Docs Github Pages documentation label Oct 18, 2025

peterchen-intel requested review from l-bat and yangwang201911 October 18, 2025 12:17

Copilot AI reviewed Oct 18, 2025

View reviewed changes

peterchen-intel requested a review from xipingyan October 18, 2025 12:17

peterchen-intel assigned Wovchena Oct 18, 2025

Wording update per copolit suggestion

674f7ed

Co-authored-by: Copilot <[email protected]>

peterchen-intel mentioned this pull request Oct 18, 2025

Conditional visual token pruning for QWen-VL models. #2714

Closed

peterchen-intel added this to the 2025.4 milestone Oct 18, 2025

peterchen-intel commented Oct 18, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

Limit the model supported.

03fe7da

peterchen-intel commented Oct 18, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

Update wording and conf code

ef02325

Signed-off-by: Chen, Peter <[email protected]>

peterchen-intel requested a review from Copilot October 19, 2025 01:13

Copilot AI reviewed Oct 19, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

Update wording

664348e

Co-authored-by: Copilot <[email protected]>

rkazants added the do_not_merge label Oct 20, 2025

peterchen-intel added 2 commits October 20, 2025 14:37

Trigger build for pull_request

655d2a6

Signed-off-by: Chen, Peter <[email protected]>

Add read permission

92c84b8

Signed-off-by: Chen, Peter <[email protected]>

github-actions bot added the category: GHA CI based on Github actions label Oct 20, 2025

peterchen-intel had a problem deploying to github-pages October 20, 2025 07:30 — with GitHub Actions Failure

peterchen-intel had a problem deploying to github-pages October 20, 2025 07:52 — with GitHub Actions Failure

l-bat reviewed Oct 20, 2025

View reviewed changes

Update wording per review comments

fd3cf32

Co-authored-by: Liubov Talamanova <[email protected]>

peterchen-intel had a problem deploying to github-pages October 21, 2025 00:29 — with GitHub Actions Failure

Remove the changes to .github/workflows/deploy_gh_pages.yml

03b9aca

github-actions bot removed the category: GHA CI based on Github actions label Oct 27, 2025

Merge branch 'master' into doc/visual/token/pruning

cde7a6c

yangwang201911 mentioned this pull request Nov 28, 2025

Conditional visual token pruning for QWen-VL models. #3084

Open

yangwang201911 and others added 6 commits December 16, 2025 16:55

Enhance implementation of DPP-based token selection in Visual Token P…

f3e926e

…runing with a new splitting strategy for improved efficiency and accuracy.

update

ae26b63

Update.

124bba3

Update site/docs/concepts/optimization-techniques/visual-token-prunin…

0eb2e12

…g.md Co-authored-by: Chen Peter <[email protected]>

Apply suggestions from code review

b251d44

Merge pull request #1 from yangwang201911/ywang2/add_splitting_strate…

4fd80d1

…gy_description [CDPruner] Add splitting strategy description

Copilot AI review requested due to automatic review settings December 17, 2025 02:34

Copilot AI reviewed Dec 17, 2025

View reviewed changes

peterchen-intel commented Dec 17, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

TTFT

dfe2a12

Copilot AI review requested due to automatic review settings December 17, 2025 02:37

Copilot AI reviewed Dec 17, 2025

View reviewed changes

peterchen-intel commented Dec 17, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

peterchen-intel commented Dec 17, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

Update format

03b17e1

Copilot AI review requested due to automatic review settings December 17, 2025 02:49

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Apply suggestions from code review

da7bc36

Copilot AI review requested due to automatic review settings December 17, 2025 02:49

Copilot AI reviewed Dec 17, 2025

View reviewed changes

peterchen-intel commented Dec 17, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

peterchen-intel commented Dec 17, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/visual-token-pruning.md Outdated Show resolved Hide resolved

Formatting

14606bd

Copilot AI review requested due to automatic review settings December 17, 2025 02:54

Copilot AI reviewed Dec 17, 2025

View reviewed changes

yangwang201911 reviewed Dec 17, 2025

View reviewed changes


		Implementation Enhancement: In step 3, when applying the DPP-based token selection algorithm, this implementation provides a splitting strategy option in addition to the original CDPruner approach. While the original approach processes the entire kernel matrix at once, the splitting strategy divides the kernel matrix into two separate blocks for parallel processing when the visual token count exceeds a threshold (default is 1, can be set via environment variable `CDPRUNER_SPLIT_THRESHOLD`).

		Important Note: The split variant is not semantically equivalent to running DPP on the full kernel. In the split approach, an equal number of tokens are selected from each half and then merged, whereas a single full-kernel DPP call may select all the top-K tokens from one half if those tokens are most diverse/relevant. This constraint in the split variant can change the token selection set and may affect accuracy differently depending on the model and input. In practice, this splitting strategy has shown: (a) improved accuracy when evaluated on Qwen2.5-VL models, and (b) significantly faster GPU execution with OpenCL kernels due to better parallelization (2-3x speedup with large token counts). By default, the splitting strategy is enabled. Advanced users can disable it by setting the environment variable CDPRUNER_SPLIT_THRESHOLD=0 to use the original approach.

	* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.
	* The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release.

[Doc] Visual Token Pruning #2861

Are you sure you want to change the base?

[Doc] Visual Token Pruning #2861

Conversation

peterchen-intel commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

peterchen-intel commented Oct 19, 2025

Uh oh!

Wovchena commented Oct 20, 2025

Uh oh!

rkazants commented Oct 20, 2025

Uh oh!

Uh oh!

Uh oh!

l-bat Oct 20, 2025

Choose a reason for hiding this comment

Token Partitioning The visual tokens are conceptually divided into:

Uh oh!

peterchen-intel Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

peterchen-intel commented Oct 18, 2025 •

edited

Loading