X-Attention Appears to Be Underperforming in Needle-in-a-Haystack Benchmark

Hello, X-Attention teams! 
After reading your paper, I became very interested in your work and proceeded to integrate X-Attention into vLLM. You can find my integration branch here: https://github.com/GITHUBear/vllm/tree/experiment , along with the relevant commit https://github.com/GITHUBear/vllm/commit/a3abc73b35562d8e1b7349a622d63bd5d8551371

In this implementation:

1. I replaced the prefill phase of vLLM's Flash-Attention backend with X-Attention's prefill interface
2. Addressed several compatibility issues between X-Attention and vLLM while retaining its core implementation:

- Limited support for batch_size > 1
- Incompatibility with GQA scenarios where query head count exceeds key head count
- Lack of support for KVCache block tables

I conducted tests using the qwen-2.5-7b-1M model as the baseline. Below are the testing environment details and results. **_Given the significant discrepancy between the baseline results and X-Attention's performance, I would greatly appreciate your team's expertise in reviewing this matter and providing actionable recommendations._**

## Env Info

- Python: Python 3.11.11
- vLLM: 0.9.1
- cuda: cuda_12.8
- GPU: NVIDIA L20
- torch: 2.7.0+cu126

## Benchmark

You can run the Needle-in-a-Haystack benchmark with [run_needle.sh](https://github.com/GITHUBear/vllm/blob/experiment/benchmarks/needle_in_a_haystack/run_needle.sh)

1. Baseline: FullAttention with FA2

```
python ./needle_test.py --model_name Qwen/Qwen2.5-7B-Instruct-1M --max_length 500000 --min_length 10000 --trust_remote_code --enable_chunked_prefill --tensor_parallel_size 4
```

2. X-Attention mode with default hyperparameters：

- stride: int = 16
- threshold: float = 0.9
- block_size: int = 128
- chunk_size: int = 2048

```
python ./needle_test.py --model_name Qwen/Qwen2.5-7B-Instruct-1M --max_length 500000 --min_length 10000 --trust_remote_code --enable_chunked_prefill --tensor_parallel_size 4 --enforce_eager --sparse_prefill_type 1 --run_name x_attn
```

## Benchmark Result

- Baseline:

![Image](https://github.com/user-attachments/assets/fcf6af3e-5efa-467e-aeda-6c6b2e483e59)

- X-Attention:

![Image](https://github.com/user-attachments/assets/e599e8aa-e6bf-4ac8-ab68-a9d205622e72)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

X-Attention Appears to Be Underperforming in Needle-in-a-Haystack Benchmark #14

Env Info

Benchmark

Benchmark Result

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

X-Attention Appears to Be Underperforming in Needle-in-a-Haystack Benchmark #14

Description

Env Info

Benchmark

Benchmark Result

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions