Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion modules/genai_optimizations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,30 @@ This module provides experimental optimizations for GenAI models in PyTorch. The

## Supported Generative AI Scenarios

- Text Generation Using LLMs
- Visual language text generation

## Supported Generative AI Optimization Methods

- [**Visual Token Pruning**](./visual_token_pruning.py):
Designed to accelerate inference in VLMs, where the number of input visual tokens is often significantly larger than that of textual tokens. Pruning these tokens reduces first-token latency and overall FLOPs while preserving accuracy. In this repository, we implement a visual token pruning method called [CDPruner](https://arxiv.org/pdf/2506.10967), which maximizes the conditional diversity of retained tokens. It can reduce FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy.

- [**Sparse Attention**](./sparse_attention.py):
Designed to accelerate the prefill stage in LLMs and MMLLMs with long prompts, high-resolution images, or videos by attending only to the most relevant query-key blocks. This block-wise attention mechanism reduces memory usage and FLOPs while preserving model accuracy. Supported modes:
- **Tri-Shape Mode** – A static block-sparse attention pattern that preserves the initial tokens, local windows, and the final segment of the query, forming a triangular structure to capture critical tokens while maintaining instruction-following performance in both turn-0 and multi-request scenarios. Paper: https://arxiv.org/pdf/2412.10319
- **XAttention Mode** – A dynamic block-sparse attention mechanism that accelerates inference by focusing computation on the most important regions of the attention matrix using antidiagonal block scoring, reducing FLOPs and memory usage without significant loss of accuracy. Paper: https://arxiv.org/pdf/2503.16428

## Supported and tested models

Large Language Models:

- [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
- [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
- [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)
- [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)

Multimodal Large Language Models:

- [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
Expand All @@ -33,7 +48,7 @@ source env/bin/activate # On Windows: env\Scripts\activate.bat

### 2. Installation

You can install the package directly from the repository:
You can install the package directly from the repository. To avoid running out of memory during the build, you can limit threads with `MAX_JOBS=4`:

```bash
pip install git+https://github.com/openvinotoolkit/openvino_contrib.git#egg=genai_opt&subdirectory=modules/genai_optimizations
Expand Down
36 changes: 34 additions & 2 deletions modules/genai_optimizations/benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,39 @@
This folder provides examples for evaluating and optimizing Generative AI models across different scenarios.


<details>
<summary><b>Large Language Models Optimization Example: LongBench</b></summary>

This [example](./longbench.py) demonstrates how to evaluate and optimize LLMs using the [LongBench](https://arxiv.org/pdf/2308.14508), a bilingual, multi-task benchmark designed to assess long-context understanding. LongBench includes 21 datasets across six task categories—single-document QA, multi-document QA, summarization, few-shot learning, synthetic reasoning, and code completion—in both English and Chinese.

Sparse attention speeds up the prefill stage in LLMs by attending only to the most relevant query-key blocks. Static patterns like Tri-Shape and dynamic mechanisms like XAttention reduce memory and computation without significant accuracy loss, enabling efficient handling of long prompts.

### Run Example

```bash
python longbench.py \
--subset samsum \
--model meta-llama/Llama-3.2-1B-Instruct \
--use_custom_attention \
--prefill_impl tri-shape
```
This will automatically:

- Download the selected model and dataset
- Apply sparse attention computation during the prefill stage
- Evaluate the model and report the score

</details>

<details>
<summary><b>Multimodal Large Language Models Optimization Example: MME Benchmark</b></summary>

This [example](./mmebench.py) demonstrates how to evaluate and optimize MLLMs using the [MME benchmark](https://arxiv.org/pdf/2306.13394), which measures both perception and cognition abilities across 14 subtasks. Its concise instruction design enables fair comparison of MLLMs without the need for extensive prompt engineering.

Visual token pruning enables significant acceleration of inference in VLMs, where the number of input visual tokens is often much larger than the number of textual tokens. By pruning these tokens, we reduce first-token latency and overall FLOPs while preserving accuracy.

Sparse attention speeds up the prefill stage in LLMs and MMLLMs by attending only to the most relevant query-key blocks. Static patterns like Tri-Shape and dynamic mechanisms like XAttention reduce memory and computation without significant accuracy loss, enabling efficient handling of long prompts, high-resolution images, and multi-frame videos.

### Run Example

```bash
Expand All @@ -18,12 +44,15 @@ python mmebench.py \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--enable_visual_pruning \
--num_keep_tokens 128 \
--theta 0.5
--theta 0.5 \
--use_custom_attention \
--prefill_impl x-attention
```
This will automatically:

- Download the selected model and dataset
- Apply the visual token pruning algorithm
- Apply sparse attention computation during the prefill stage
- Evaluate the model and report the score

</details>
Expand All @@ -42,13 +71,16 @@ python milebench.py \
--model Qwen/Qwen2-VL-2B-Instruct \
--enable_visual_pruning \
--num_keep_tokens 64 \
--theta 0.5
--theta 0.5 \
--use_custom_attention \
--prefill_impl tri-shape
```

This will automatically:

- Download the selected model and dataset
- Apply the visual token pruning algorithm
- Apply sparse attention computation during the prefill stage
- Evaluate the model and report the score

</details>
Loading
Loading