-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Phase 3 hybrid attention #30664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Phase 3 hybrid attention #30664
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Documentation preview: https://vllm--30664.org.readthedocs.build/en/30664/ |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: a hybrid attention mechanism combining sliding-window attention with a State Space Model (SSM) branch. This is a substantial contribution that includes new core attention backends, model layer implementations for LLaMA, Qwen, and Step3, and an extensive suite of benchmarks, tests, and documentation. The overall architecture is well-designed, cleverly integrating the dual-memory paradigms of PagedAttention and SSM state management. The use of a "prefix-sum" mode for verification is a great approach for ensuring the correctness of the complex state-passing logic.
My review focuses on the core implementation and benchmark scripts. I've found a critical issue in the SSM adapter related to tensor dtypes that would cause a runtime error, and several high-severity issues in the benchmark scripts that would lead to incorrect memory reporting and failures in summarizing results. After addressing these points, this will be a very strong addition to vLLM, enabling more efficient inference for long-context models.
| # cache_indices is (batch, max_blocks). | ||
| # We want cache_indices[range(batch), block_idx_last]. | ||
| # But wait, block_idx_last is int32 tensor. | ||
| real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The torch.gather operation requires the index tensor to be of type LongTensor (int64). However, the block_idx_last tensor (from block_idx_last_scheduled_token) is created with dtype=torch.int32. This will cause a runtime error when this code path is executed with a 2D cache_indices tensor (e.g., when prefix caching is enabled). You should cast the index tensor to long() before the gather operation.
| real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1) | |
| real_indices = cache_indices.gather(1, block_idx_last.long().unsqueeze(1)).squeeze(1) |
| def get_gpu_memory_info() -> dict[str, float]: | ||
| """Get current GPU memory statistics in GiB. | ||
|
|
||
| Note: vLLM runs in a separate process and takes exclusive GPU control. | ||
| We cannot call CUDA functions from the parent process. | ||
| Memory info will be obtained from vLLM's engine API instead. | ||
| """ | ||
| return {"free_memory_gib": 0, "total_memory_gib": 0, "used_memory_gib": 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_gpu_memory_info function is currently a stub that returns zeros, which will result in incorrect memory usage reporting in the benchmark results. For single-GPU benchmarks (the default for this script), it's possible to directly query CUDA for memory information. This should be implemented to ensure the memory metrics are accurate.
A correct implementation can be found in benchmarks/streaming_video_context.py.
def get_gpu_memory_info() -> dict[str, float]:
"""Get current GPU memory statistics in GiB."""
try:
import torch
if not torch.cuda.is_available():
return {"available": False}
device = torch.cuda.current_device()
free_memory, total_memory = torch.cuda.mem_get_info(device)
used_memory = total_memory - free_memory
return {
"available": True,
"free_memory_gib": free_memory / (1024**3),
"total_memory_gib": total_memory / (1024**3),
"used_memory_gib": used_memory / (1024**3),
}
except Exception as e:
return {"available": False, "error": str(e)}| def get_torch_memory_stats() -> dict[str, float]: | ||
| """Get PyTorch CUDA memory statistics. | ||
|
|
||
| Note: vLLM runs in a separate process - CUDA stats are not available here. | ||
| """ | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_torch_memory_stats function is a stub and returns an empty dictionary. This prevents the benchmark from collecting and reporting PyTorch-specific memory statistics, which are valuable for debugging and performance analysis. This function should be implemented to call the relevant torch.cuda.memory_* functions.
| def get_torch_memory_stats() -> dict[str, float]: | |
| """Get PyTorch CUDA memory statistics. | |
| Note: vLLM runs in a separate process - CUDA stats are not available here. | |
| """ | |
| return {} | |
| def get_torch_memory_stats() -> dict[str, float]: | |
| """Get PyTorch CUDA memory statistics.""" | |
| try: | |
| import torch | |
| if not torch.cuda.is_available(): | |
| return {} | |
| return { | |
| "allocated_bytes": torch.cuda.memory_allocated(), | |
| "max_allocated_bytes": torch.cuda.max_memory_allocated(), | |
| "reserved_bytes": torch.cuda.memory_reserved(), | |
| "max_reserved_bytes": torch.cuda.max_memory_reserved(), | |
| } | |
| except Exception: | |
| return {} |
|
|
||
| # Default configuration | ||
| MODEL_PATH="${1:-meta-llama/Llama-3.2-1B}" | ||
| OUTPUT_DIR="${2:-./hybrid_benchmark_results}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OUTPUT_DIR variable is used by the inline Python script at the end of this file to locate the benchmark results. However, it is not exported as an environment variable. If a user provides a custom output directory as the second argument, the Python script will not be able to find it and will fall back to the default, potentially failing to generate the summary. You should export OUTPUT_DIR to make it available to the child process.
| OUTPUT_DIR="${2:-./hybrid_benchmark_results}" | |
| export OUTPUT_DIR="${2:-./hybrid_benchmark_results}" |
|
|
||
| # Default configuration | ||
| MODEL_PATH="${1:-Qwen/Qwen2.5-VL-3B-Instruct}" | ||
| OUTPUT_DIR="${2:-./streaming_benchmark_results}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OUTPUT_DIR variable is used by the inline Python script at the end of this file to locate the benchmark results. However, it is not exported as an environment variable. If a user provides a custom output directory as the second argument, the Python script will not be able to find it and will fall back to the default, potentially failing to generate the summary. You should export OUTPUT_DIR to make it available to the child process.
| OUTPUT_DIR="${2:-./streaming_benchmark_results}" | |
| export OUTPUT_DIR="${2:-./streaming_benchmark_results}" |
|
Hi @RGBmarya, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
No description provided.