Skip to content

Conversation

@RGBmarya
Copy link

No description provided.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify
Copy link

mergify bot commented Dec 15, 2025

Documentation preview: https://vllm--30664.org.readthedocs.build/en/30664/

@mergify mergify bot added documentation Improvements or additions to documentation llama Related to Llama models new-model Requests to new models performance Performance-related issues qwen Related to Qwen models speculative-decoding v1 labels Dec 15, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a hybrid attention mechanism combining sliding-window attention with a State Space Model (SSM) branch. This is a substantial contribution that includes new core attention backends, model layer implementations for LLaMA, Qwen, and Step3, and an extensive suite of benchmarks, tests, and documentation. The overall architecture is well-designed, cleverly integrating the dual-memory paradigms of PagedAttention and SSM state management. The use of a "prefix-sum" mode for verification is a great approach for ensuring the correctness of the complex state-passing logic.

My review focuses on the core implementation and benchmark scripts. I've found a critical issue in the SSM adapter related to tensor dtypes that would cause a runtime error, and several high-severity issues in the benchmark scripts that would lead to incorrect memory reporting and failures in summarizing results. After addressing these points, this will be a very strong addition to vLLM, enabling more efficient inference for long-context models.

# cache_indices is (batch, max_blocks).
# We want cache_indices[range(batch), block_idx_last].
# But wait, block_idx_last is int32 tensor.
real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The torch.gather operation requires the index tensor to be of type LongTensor (int64). However, the block_idx_last tensor (from block_idx_last_scheduled_token) is created with dtype=torch.int32. This will cause a runtime error when this code path is executed with a 2D cache_indices tensor (e.g., when prefix caching is enabled). You should cast the index tensor to long() before the gather operation.

Suggested change
real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1)
real_indices = cache_indices.gather(1, block_idx_last.long().unsqueeze(1)).squeeze(1)

Comment on lines +61 to +68
def get_gpu_memory_info() -> dict[str, float]:
"""Get current GPU memory statistics in GiB.

Note: vLLM runs in a separate process and takes exclusive GPU control.
We cannot call CUDA functions from the parent process.
Memory info will be obtained from vLLM's engine API instead.
"""
return {"free_memory_gib": 0, "total_memory_gib": 0, "used_memory_gib": 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_gpu_memory_info function is currently a stub that returns zeros, which will result in incorrect memory usage reporting in the benchmark results. For single-GPU benchmarks (the default for this script), it's possible to directly query CUDA for memory information. This should be implemented to ensure the memory metrics are accurate.

A correct implementation can be found in benchmarks/streaming_video_context.py.

def get_gpu_memory_info() -> dict[str, float]:
    """Get current GPU memory statistics in GiB."""
    try:
        import torch

        if not torch.cuda.is_available():
            return {"available": False}

        device = torch.cuda.current_device()
        free_memory, total_memory = torch.cuda.mem_get_info(device)
        used_memory = total_memory - free_memory

        return {
            "available": True,
            "free_memory_gib": free_memory / (1024**3),
            "total_memory_gib": total_memory / (1024**3),
            "used_memory_gib": used_memory / (1024**3),
        }
    except Exception as e:
        return {"available": False, "error": str(e)}

Comment on lines +71 to +76
def get_torch_memory_stats() -> dict[str, float]:
"""Get PyTorch CUDA memory statistics.

Note: vLLM runs in a separate process - CUDA stats are not available here.
"""
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_torch_memory_stats function is a stub and returns an empty dictionary. This prevents the benchmark from collecting and reporting PyTorch-specific memory statistics, which are valuable for debugging and performance analysis. This function should be implemented to call the relevant torch.cuda.memory_* functions.

Suggested change
def get_torch_memory_stats() -> dict[str, float]:
"""Get PyTorch CUDA memory statistics.
Note: vLLM runs in a separate process - CUDA stats are not available here.
"""
return {}
def get_torch_memory_stats() -> dict[str, float]:
"""Get PyTorch CUDA memory statistics."""
try:
import torch
if not torch.cuda.is_available():
return {}
return {
"allocated_bytes": torch.cuda.memory_allocated(),
"max_allocated_bytes": torch.cuda.max_memory_allocated(),
"reserved_bytes": torch.cuda.memory_reserved(),
"max_reserved_bytes": torch.cuda.max_memory_reserved(),
}
except Exception:
return {}


# Default configuration
MODEL_PATH="${1:-meta-llama/Llama-3.2-1B}"
OUTPUT_DIR="${2:-./hybrid_benchmark_results}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The OUTPUT_DIR variable is used by the inline Python script at the end of this file to locate the benchmark results. However, it is not exported as an environment variable. If a user provides a custom output directory as the second argument, the Python script will not be able to find it and will fall back to the default, potentially failing to generate the summary. You should export OUTPUT_DIR to make it available to the child process.

Suggested change
OUTPUT_DIR="${2:-./hybrid_benchmark_results}"
export OUTPUT_DIR="${2:-./hybrid_benchmark_results}"


# Default configuration
MODEL_PATH="${1:-Qwen/Qwen2.5-VL-3B-Instruct}"
OUTPUT_DIR="${2:-./streaming_benchmark_results}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The OUTPUT_DIR variable is used by the inline Python script at the end of this file to locate the benchmark results. However, it is not exported as an environment variable. If a user provides a custom output directory as the second argument, the Python script will not be able to find it and will fall back to the default, potentially failing to generate the summary. You should export OUTPUT_DIR to make it available to the child process.

Suggested change
OUTPUT_DIR="${2:-./streaming_benchmark_results}"
export OUTPUT_DIR="${2:-./streaming_benchmark_results}"

@mergify
Copy link

mergify bot commented Dec 15, 2025

Hi @RGBmarya, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation llama Related to Llama models new-model Requests to new models performance Performance-related issues qwen Related to Qwen models speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant