Skip to content

X-Attention Appears to Be Underperforming in Needle-in-a-Haystack Benchmark #14

@GITHUBear

Description

@GITHUBear

Hello, X-Attention teams!
After reading your paper, I became very interested in your work and proceeded to integrate X-Attention into vLLM. You can find my integration branch here: https://github.com/GITHUBear/vllm/tree/experiment , along with the relevant commit GITHUBear/vllm@a3abc73

In this implementation:

  1. I replaced the prefill phase of vLLM's Flash-Attention backend with X-Attention's prefill interface
  2. Addressed several compatibility issues between X-Attention and vLLM while retaining its core implementation:
  • Limited support for batch_size > 1
  • Incompatibility with GQA scenarios where query head count exceeds key head count
  • Lack of support for KVCache block tables

I conducted tests using the qwen-2.5-7b-1M model as the baseline. Below are the testing environment details and results. Given the significant discrepancy between the baseline results and X-Attention's performance, I would greatly appreciate your team's expertise in reviewing this matter and providing actionable recommendations.

Env Info

  • Python: Python 3.11.11
  • vLLM: 0.9.1
  • cuda: cuda_12.8
  • GPU: NVIDIA L20
  • torch: 2.7.0+cu126

Benchmark

You can run the Needle-in-a-Haystack benchmark with run_needle.sh

  1. Baseline: FullAttention with FA2
python ./needle_test.py --model_name Qwen/Qwen2.5-7B-Instruct-1M --max_length 500000 --min_length 10000 --trust_remote_code --enable_chunked_prefill --tensor_parallel_size 4
  1. X-Attention mode with default hyperparameters:
  • stride: int = 16
  • threshold: float = 0.9
  • block_size: int = 128
  • chunk_size: int = 2048
python ./needle_test.py --model_name Qwen/Qwen2.5-7B-Instruct-1M --max_length 500000 --min_length 10000 --trust_remote_code --enable_chunked_prefill --tensor_parallel_size 4 --enforce_eager --sparse_prefill_type 1 --run_name x_attn

Benchmark Result

  • Baseline:

Image

  • X-Attention:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions