Phase 3 hybrid attention #30664

RGBmarya · 2025-12-15T03:37:38Z

No description provided.

:

chatgpt-codex-connector · 2025-12-15T03:37:48Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

mergify · 2025-12-15T03:38:24Z

Documentation preview: https://vllm--30664.org.readthedocs.build/en/30664/

github-actions · 2025-12-15T03:40:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a significant new feature: a hybrid attention mechanism combining sliding-window attention with a State Space Model (SSM) branch. This is a substantial contribution that includes new core attention backends, model layer implementations for LLaMA, Qwen, and Step3, and an extensive suite of benchmarks, tests, and documentation. The overall architecture is well-designed, cleverly integrating the dual-memory paradigms of PagedAttention and SSM state management. The use of a "prefix-sum" mode for verification is a great approach for ensuring the correctness of the complex state-passing logic.

My review focuses on the core implementation and benchmark scripts. I've found a critical issue in the SSM adapter related to tensor dtypes that would cause a runtime error, and several high-severity issues in the benchmark scripts that would lead to incorrect memory reporting and failures in summarizing results. After addressing these points, this will be a very strong addition to vLLM, enabling more efficient inference for long-context models.

gemini-code-assist · 2025-12-15T03:40:46Z

vllm/model_executor/layers/hybrid_ssm_adapter.py

+             # cache_indices is (batch, max_blocks).
+             # We want cache_indices[range(batch), block_idx_last].
+             # But wait, block_idx_last is int32 tensor.
+             real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1)


The torch.gather operation requires the index tensor to be of type LongTensor (int64). However, the block_idx_last tensor (from block_idx_last_scheduled_token) is created with dtype=torch.int32. This will cause a runtime error when this code path is executed with a 2D cache_indices tensor (e.g., when prefix caching is enabled). You should cast the index tensor to long() before the gather operation.

Suggested change

real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1)

real_indices = cache_indices.gather(1, block_idx_last.long().unsqueeze(1)).squeeze(1)

gemini-code-assist · 2025-12-15T03:40:47Z

benchmarks/benchmark_hybrid_attention.py

+def get_gpu_memory_info() -> dict[str, float]:
+    """Get current GPU memory statistics in GiB.
+
+    Note: vLLM runs in a separate process and takes exclusive GPU control.
+    We cannot call CUDA functions from the parent process.
+    Memory info will be obtained from vLLM's engine API instead.
+    """
+    return {"free_memory_gib": 0, "total_memory_gib": 0, "used_memory_gib": 0}


The get_gpu_memory_info function is currently a stub that returns zeros, which will result in incorrect memory usage reporting in the benchmark results. For single-GPU benchmarks (the default for this script), it's possible to directly query CUDA for memory information. This should be implemented to ensure the memory metrics are accurate.

A correct implementation can be found in benchmarks/streaming_video_context.py.

def get_gpu_memory_info() -> dict[str, float]: """Get current GPU memory statistics in GiB.""" try: import torch if not torch.cuda.is_available(): return {"available": False} device = torch.cuda.current_device() free_memory, total_memory = torch.cuda.mem_get_info(device) used_memory = total_memory - free_memory return { "available": True, "free_memory_gib": free_memory / (1024**3), "total_memory_gib": total_memory / (1024**3), "used_memory_gib": used_memory / (1024**3), } except Exception as e: return {"available": False, "error": str(e)}

gemini-code-assist · 2025-12-15T03:40:47Z

benchmarks/benchmark_hybrid_attention.py

+def get_torch_memory_stats() -> dict[str, float]:
+    """Get PyTorch CUDA memory statistics.
+
+    Note: vLLM runs in a separate process - CUDA stats are not available here.
+    """
+    return {}


The get_torch_memory_stats function is a stub and returns an empty dictionary. This prevents the benchmark from collecting and reporting PyTorch-specific memory statistics, which are valuable for debugging and performance analysis. This function should be implemented to call the relevant torch.cuda.memory_* functions.

Suggested change

def get_torch_memory_stats() -> dict[str, float]:

"""Get PyTorch CUDA memory statistics.

Note: vLLM runs in a separate process - CUDA stats are not available here.

"""

return {}

def get_torch_memory_stats() -> dict[str, float]:

"""Get PyTorch CUDA memory statistics."""

try:

import torch

if not torch.cuda.is_available():

return {}

return {

"allocated_bytes": torch.cuda.memory_allocated(),

"max_allocated_bytes": torch.cuda.max_memory_allocated(),

"reserved_bytes": torch.cuda.memory_reserved(),

"max_reserved_bytes": torch.cuda.max_memory_reserved(),

}

except Exception:

return {}

gemini-code-assist · 2025-12-15T03:40:47Z

benchmarks/run_hybrid_benchmark.sh

+
+# Default configuration
+MODEL_PATH="${1:-meta-llama/Llama-3.2-1B}"
+OUTPUT_DIR="${2:-./hybrid_benchmark_results}"


The OUTPUT_DIR variable is used by the inline Python script at the end of this file to locate the benchmark results. However, it is not exported as an environment variable. If a user provides a custom output directory as the second argument, the Python script will not be able to find it and will fall back to the default, potentially failing to generate the summary. You should export OUTPUT_DIR to make it available to the child process.

Suggested change

OUTPUT_DIR="${2:-./hybrid_benchmark_results}"

export OUTPUT_DIR="${2:-./hybrid_benchmark_results}"

gemini-code-assist · 2025-12-15T03:40:47Z

benchmarks/run_streaming_video_benchmark.sh

+
+# Default configuration
+MODEL_PATH="${1:-Qwen/Qwen2.5-VL-3B-Instruct}"
+OUTPUT_DIR="${2:-./streaming_benchmark_results}"


The OUTPUT_DIR variable is used by the inline Python script at the end of this file to locate the benchmark results. However, it is not exported as an environment variable. If a user provides a custom output directory as the second argument, the Python script will not be able to find it and will fall back to the default, potentially failing to generate the summary. You should export OUTPUT_DIR to make it available to the child process.

Suggested change

OUTPUT_DIR="${2:-./streaming_benchmark_results}"

export OUTPUT_DIR="${2:-./streaming_benchmark_results}"

mergify · 2025-12-15T03:48:47Z

Hi @RGBmarya, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

RGBmarya and others added 30 commits November 27, 2025 13:43

Add ARCHITECTURE.md

7a297da

Merge branch 'main' of https://github.com/RGBmarya/vllm

a1b81db

feat: part 1

cfab2ca

refactor: use ungated model

4133f5b

fix: init issues

1f3774e

add benchmarks

35cbb96

fix: test misalignment

8e5a2a2

feat: integrate real Mamba kernel into HybridAttention

7aba95a

test fixes

cf23e82

fix: add dist_init

7d5d502

fix: add missing triton_metadata attr

9ea5fe4

feat: add tests

d24d7e8

fix: __dict__

437a10c

chore: add CPU offloading

2098bf7

test: add synthetic video test

8873038

refactor: change QA video test to synthetic

38cca57

works??

bd41fe0

add new test

d62e285

fix import issue

c90043d

add paper

3c8231d

feat: llama-3.2 with hybrid attention

a757376

fix: llama

69d854b

potential fixes

475dc3e

fix: visualization

4b0e277

chore: results

c44777f

fix: hybrid benchmarks

044d143

feat: add video benchmark

afa9d1a

fix: CUDA usage

44dc187

fix: premature CUDA init

bdaa3fa

results 2

cc1f313

RGBmarya and others added 12 commits December 1, 2025 12:39

feat: add visualization

9b30a98

test: 48 frames

785bf28

feat: add streaming benchmark

dcf8eb1

50 it results

4e60821

Merge remote-tracking branch 'refs/remotes/origin/main'

070bf07

results

5236b54

:

results

2956991

evaluation procedure

b3454ef

various results

ec5a39c

fix: python path

183cfd1

eval results

345620f

refactor: frame batch size 1

f09a802

RGBmarya requested review from LucasWilkinson, WoosukKwon, mgoin, sighingnow, tlrmchlsmth and yewentao256 as code owners December 15, 2025 03:37

mergify bot added documentation Improvements or additions to documentation llama Related to Llama models new-model Requests to new models performance Performance-related issues qwen Related to Qwen models speculative-decoding v1 labels Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Phase 3 hybrid attention #30664

Phase 3 hybrid attention #30664

Uh oh!

RGBmarya commented Dec 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	real_indices = cache_indices.gather(1, block_idx_last.unsqueeze(1)).squeeze(1)
	real_indices = cache_indices.gather(1, block_idx_last.long().unsqueeze(1)).squeeze(1)

	OUTPUT_DIR="${2:-./hybrid_benchmark_results}"
	export OUTPUT_DIR="${2:-./hybrid_benchmark_results}"

	OUTPUT_DIR="${2:-./streaming_benchmark_results}"
	export OUTPUT_DIR="${2:-./streaming_benchmark_results}"

Uh oh!

Phase 3 hybrid attention #30664

Are you sure you want to change the base?

Phase 3 hybrid attention #30664

Uh oh!

Conversation

RGBmarya commented Dec 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant