-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
Hello, X-Attention teams!
After reading your paper, I became very interested in your work and proceeded to integrate X-Attention into vLLM. You can find my integration branch here: https://github.com/GITHUBear/vllm/tree/experiment , along with the relevant commit GITHUBear/vllm@a3abc73
In this implementation:
- I replaced the prefill phase of vLLM's Flash-Attention backend with X-Attention's prefill interface
- Addressed several compatibility issues between X-Attention and vLLM while retaining its core implementation:
- Limited support for batch_size > 1
- Incompatibility with GQA scenarios where query head count exceeds key head count
- Lack of support for KVCache block tables
I conducted tests using the qwen-2.5-7b-1M model as the baseline. Below are the testing environment details and results. Given the significant discrepancy between the baseline results and X-Attention's performance, I would greatly appreciate your team's expertise in reviewing this matter and providing actionable recommendations.
Env Info
- Python: Python 3.11.11
- vLLM: 0.9.1
- cuda: cuda_12.8
- GPU: NVIDIA L20
- torch: 2.7.0+cu126
Benchmark
You can run the Needle-in-a-Haystack benchmark with run_needle.sh
- Baseline: FullAttention with FA2
python ./needle_test.py --model_name Qwen/Qwen2.5-7B-Instruct-1M --max_length 500000 --min_length 10000 --trust_remote_code --enable_chunked_prefill --tensor_parallel_size 4
- X-Attention mode with default hyperparameters:
- stride: int = 16
- threshold: float = 0.9
- block_size: int = 128
- chunk_size: int = 2048
python ./needle_test.py --model_name Qwen/Qwen2.5-7B-Instruct-1M --max_length 500000 --min_length 10000 --trust_remote_code --enable_chunked_prefill --tensor_parallel_size 4 --enforce_eager --sparse_prefill_type 1 --run_name x_attn
Benchmark Result
- Baseline:
- X-Attention:
Metadata
Metadata
Assignees
Labels
No labels

