Skip to content

[Future] [P/D] support hybrid attention for mooncake connector#8850

Open
liziyu179 wants to merge 13 commits into
vllm-project:mainfrom
liziyu179:support_hybrid_atten_mooncake_connector
Open

[Future] [P/D] support hybrid attention for mooncake connector#8850
liziyu179 wants to merge 13 commits into
vllm-project:mainfrom
liziyu179:support_hybrid_atten_mooncake_connector

Conversation

@liziyu179
Copy link
Copy Markdown
Collaborator

@liziyu179 liziyu179 commented Apr 30, 2026

What this PR does / why we need it?

support hybrid attention for mooncake connector

Does this PR introduce any user-facing change?

How was this patch tested?

by nightly

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for hybrid attention and Mamba models within the Mooncake connector. The changes enable the connector to handle diverse KV cache specifications and improve the robustness of data transfers by accounting for varying block lengths and specific model requirements. These updates ensure better compatibility with advanced model architectures while maintaining existing functionality.

Highlights

  • Hybrid Attention Support: Implemented hybrid attention (HMA) support for the Mooncake connector, allowing for more flexible KV cache management across different attention specifications.
  • Mamba Integration: Added support for Mamba models within the Mooncake connector, including specific metadata handling for SSM sizes and specialized transfer logic.
  • Connector Refactoring: Refactored the Mooncake connector to support multiple KV cache groups and updated internal data structures to handle block lengths per address.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for hybrid attention (HMA), specifically integrating Mamba into the Mooncake connector for KV cache transfer. It updates metadata structures, introduces multi-group transfer logic, and refines cache registration. Feedback highlights critical bugs including a TypeError in Mamba block initialization, swapped source and destination addresses in transfer logic, and overly restrictive HMA checks that could crash non-Mamba hybrid models. Suggested PR Title: [Ops][Feature] Support hybrid attention for mooncake connector Suggested PR Summary: ### What this PR does / why we need it? This PR adds support for hybrid attention in the Mooncake connector for KV cache transfer. It updates metadata and implements a new transfer method for multiple KV cache groups. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By nightly.

Comment on lines +537 to +538
grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]]
grouped_local_block_ids = [local_block_ids[i][0]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Suggested PR Title:

[Future][Ops][Feature] Support hybrid attention for mooncake connector

Suggested PR Summary:

### What this PR does / why we need it?
This PR adds support for hybrid attention (e.g., Attention + Mamba) in the Mooncake connector for KV cache transfer. It updates the metadata exchange to include Mamba-specific information and implements a new transfer method that can handle multiple KV cache groups simultaneously.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By nightly.

CRITICAL BUG: The initialization of grouped_remote_block_ids and grouped_local_block_ids for Mamba groups will cause a TypeError at runtime. They are assigned as lists of integers, but the subsequent logic at lines 545-547 expects them to be lists of lists (e.g., it accesses local_block_id[0] and calls len(local_block_id)).

Suggested change
grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]]
grouped_local_block_ids = [local_block_ids[i][0]]
grouped_remote_block_ids = [[remote_block_ids[i][transfer_block_idx]]]
grouped_local_block_ids = [[local_block_ids[i][0]]]

Comment on lines +540 to +541
for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate(
zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

CRITICAL BUG: The source and destination addresses for the KV transfer appear to be swapped. In the KVCacheRecvingThread (consumer side), the read operation should pull data from the remote producer into the local consumer. However, the code currently assigns local addresses to src and remote addresses to dst in the zip and subsequent calculations. This would mean the consumer is attempting to read from its own memory and write to the producer's memory.

Comment on lines 631 to 632
for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate(
zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

CRITICAL BUG: Similar to _transfer_kv_cache_all_groups, the source and destination addresses in _transfer_kv_cache appear to be swapped. The consumer should be reading from the remote producer's memory into its local memory, but the zip assigns local addresses to the source side of the transfer.

Comment on lines +571 to +572
remote_block_ids = req_meta["remote_block_ids"][0]
local_block_ids = req_meta["local_block_ids"][0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

HIGH SEVERITY: The modification to use [0] for remote_block_ids and local_block_ids assumes there is only one KV cache group when is_hma_required is false. If multiple FullAttentionSpec groups exist, this will only transfer the first group, leading to incomplete KV cache transfer for the other groups.

Comment on lines +1378 to +1379
else:
raise TypeError("Mooncake connector does not support this type kv_cache now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

HIGH SEVERITY: This implementation raises a TypeError if hybrid attention is required but Mamba is not present. This unnecessarily restricts hybrid attention support to only Mamba-based models and will crash for other hybrid configurations.

@liziyu179 liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 5e527fb to 5d4c64b Compare May 8, 2026 08:32
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@liziyu179 liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from bd74e07 to 81da92b Compare May 9, 2026 09:57
liziyu179 and others added 11 commits May 10, 2026 23:15
Ensure consistent return statement formatting.
Removed license comment and multiple import statements.
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
@liziyu179 liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 81da92b to fd6657d Compare May 21, 2026 02:57
Signed-off-by: liziyu <liziyu16@huawei.com>
@liziyu179 liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch 5 times, most recently from 5ded65a to f4cee00 Compare May 25, 2026 09:52
Signed-off-by: liziyu <liziyu16@huawei.com>
@liziyu179 liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from f4cee00 to e38e282 Compare May 25, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant