[Future] [P/D] support hybrid attention for mooncake connector#8850
[Future] [P/D] support hybrid attention for mooncake connector#8850liziyu179 wants to merge 13 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for hybrid attention and Mamba models within the Mooncake connector. The changes enable the connector to handle diverse KV cache specifications and improve the robustness of data transfers by accounting for varying block lengths and specific model requirements. These updates ensure better compatibility with advanced model architectures while maintaining existing functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request implements support for hybrid attention (HMA), specifically integrating Mamba into the Mooncake connector for KV cache transfer. It updates metadata structures, introduces multi-group transfer logic, and refines cache registration. Feedback highlights critical bugs including a TypeError in Mamba block initialization, swapped source and destination addresses in transfer logic, and overly restrictive HMA checks that could crash non-Mamba hybrid models. Suggested PR Title: [Ops][Feature] Support hybrid attention for mooncake connector Suggested PR Summary: ### What this PR does / why we need it? This PR adds support for hybrid attention in the Mooncake connector for KV cache transfer. It updates metadata and implements a new transfer method for multiple KV cache groups. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By nightly.
| grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]] | ||
| grouped_local_block_ids = [local_block_ids[i][0]] |
There was a problem hiding this comment.
Suggested PR Title:
[Future][Ops][Feature] Support hybrid attention for mooncake connectorSuggested PR Summary:
### What this PR does / why we need it?
This PR adds support for hybrid attention (e.g., Attention + Mamba) in the Mooncake connector for KV cache transfer. It updates the metadata exchange to include Mamba-specific information and implements a new transfer method that can handle multiple KV cache groups simultaneously.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By nightly.CRITICAL BUG: The initialization of grouped_remote_block_ids and grouped_local_block_ids for Mamba groups will cause a TypeError at runtime. They are assigned as lists of integers, but the subsequent logic at lines 545-547 expects them to be lists of lists (e.g., it accesses local_block_id[0] and calls len(local_block_id)).
| grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]] | |
| grouped_local_block_ids = [local_block_ids[i][0]] | |
| grouped_remote_block_ids = [[remote_block_ids[i][transfer_block_idx]]] | |
| grouped_local_block_ids = [[local_block_ids[i][0]]] |
| for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate( | ||
| zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs) |
There was a problem hiding this comment.
CRITICAL BUG: The source and destination addresses for the KV transfer appear to be swapped. In the KVCacheRecvingThread (consumer side), the read operation should pull data from the remote producer into the local consumer. However, the code currently assigns local addresses to src and remote addresses to dst in the zip and subsequent calculations. This would mean the consumer is attempting to read from its own memory and write to the producer's memory.
| for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate( | ||
| zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs) |
There was a problem hiding this comment.
| remote_block_ids = req_meta["remote_block_ids"][0] | ||
| local_block_ids = req_meta["local_block_ids"][0] |
There was a problem hiding this comment.
HIGH SEVERITY: The modification to use [0] for remote_block_ids and local_block_ids assumes there is only one KV cache group when is_hma_required is false. If multiple FullAttentionSpec groups exist, this will only transfer the first group, leading to incomplete KV cache transfer for the other groups.
| else: | ||
| raise TypeError("Mooncake connector does not support this type kv_cache now.") |
5e527fb to
5d4c64b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
bd74e07 to
81da92b
Compare
Ensure consistent return statement formatting.
Removed license comment and multiple import statements.
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
81da92b to
fd6657d
Compare
5ded65a to
f4cee00
Compare
Signed-off-by: liziyu <liziyu16@huawei.com>
f4cee00 to
e38e282
Compare
What this PR does / why we need it?
support hybrid attention for mooncake connector
Does this PR introduce any user-facing change?
How was this patch tested?
by nightly