[Feature] [P/D] support hybrid attention for mooncake connector#8850
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for hybrid attention and Mamba models within the Mooncake connector. The changes enable the connector to handle diverse KV cache specifications and improve the robustness of data transfers by accounting for varying block lengths and specific model requirements. These updates ensure better compatibility with advanced model architectures while maintaining existing functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request implements support for hybrid attention (HMA), specifically integrating Mamba into the Mooncake connector for KV cache transfer. It updates metadata structures, introduces multi-group transfer logic, and refines cache registration. Feedback highlights critical bugs including a TypeError in Mamba block initialization, swapped source and destination addresses in transfer logic, and overly restrictive HMA checks that could crash non-Mamba hybrid models. Suggested PR Title: [Ops][Feature] Support hybrid attention for mooncake connector Suggested PR Summary: ### What this PR does / why we need it? This PR adds support for hybrid attention in the Mooncake connector for KV cache transfer. It updates metadata and implements a new transfer method for multiple KV cache groups. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By nightly.
| grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]] | ||
| grouped_local_block_ids = [local_block_ids[i][0]] |
There was a problem hiding this comment.
Suggested PR Title:
[Future][Ops][Feature] Support hybrid attention for mooncake connectorSuggested PR Summary:
### What this PR does / why we need it?
This PR adds support for hybrid attention (e.g., Attention + Mamba) in the Mooncake connector for KV cache transfer. It updates the metadata exchange to include Mamba-specific information and implements a new transfer method that can handle multiple KV cache groups simultaneously.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By nightly.CRITICAL BUG: The initialization of grouped_remote_block_ids and grouped_local_block_ids for Mamba groups will cause a TypeError at runtime. They are assigned as lists of integers, but the subsequent logic at lines 545-547 expects them to be lists of lists (e.g., it accesses local_block_id[0] and calls len(local_block_id)).
| grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]] | |
| grouped_local_block_ids = [local_block_ids[i][0]] | |
| grouped_remote_block_ids = [[remote_block_ids[i][transfer_block_idx]]] | |
| grouped_local_block_ids = [[local_block_ids[i][0]]] |
| for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate( | ||
| zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs) |
There was a problem hiding this comment.
CRITICAL BUG: The source and destination addresses for the KV transfer appear to be swapped. In the KVCacheRecvingThread (consumer side), the read operation should pull data from the remote producer into the local consumer. However, the code currently assigns local addresses to src and remote addresses to dst in the zip and subsequent calculations. This would mean the consumer is attempting to read from its own memory and write to the producer's memory.
| for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate( | ||
| zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs) |
There was a problem hiding this comment.
| remote_block_ids = req_meta["remote_block_ids"][0] | ||
| local_block_ids = req_meta["local_block_ids"][0] |
There was a problem hiding this comment.
HIGH SEVERITY: The modification to use [0] for remote_block_ids and local_block_ids assumes there is only one KV cache group when is_hma_required is false. If multiple FullAttentionSpec groups exist, this will only transfer the first group, leading to incomplete KV cache transfer for the other groups.
| else: | ||
| raise TypeError("Mooncake connector does not support this type kv_cache now.") |
5e527fb to
5d4c64b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
bd74e07 to
81da92b
Compare
81da92b to
fd6657d
Compare
0a8c606 to
4518a94
Compare
4518a94 to
5fa3afe
Compare
cc14832 to
b478d13
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: liziyu <liziyu16@huawei.com>
a8c4a8e to
2d42a93
Compare
2d42a93 to
f38440f
Compare
Signed-off-by: zzzzzmeng <810924837@qq.com>
Signed-off-by: liziyu179 <liziyu16@huawei.com>
Signed-off-by: liziyu179 <liziyu16@huawei.com>
Signed-off-by: liziyu179 <liziyu16@huawei.com>
7533d67 to
f19fcba
Compare
…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liyishi <1252651434@qq.com>
…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com> Signed-off-by: yilunh <hanyilun1@huawei.com>
…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com> Signed-off-by: shenqiangqiang <2416602906@qq.com>
…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com>
…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com>
…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com>
What this PR does / why we need it?
This PR adapts
MooncakeConnectorKV transfer metadata handling for hybrid KV cache layouts.The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In
register_kv_caches, this PR builds and records per-group/per-layer metadata used by the sender and receiver:self.kv_group2layeridx: maps each KV cache group to its serialized group spec and physical layer indices, e.g.{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}.self.block_size_scale: stores per-layer cache block scaling, e.g.[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks.self.block_len_per_addr: stores per-layer byte length for each cache tensor block, e.g.[layer_idx][cache_idx] -> cache block byte length.self.kv_caches_base_addr: stores per-layer base addresses for each registered cache tensor, e.g.[layer_idx][cache_idx] -> data_ptr.Based on this metadata,
_get_kv_split_metadatanow prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts._get_group_pulls_metadatathen builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from.This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible.
Does this PR introduce any user-facing change?
No user-facing API change.
This changes internal KV transfer metadata construction and scheduling behavior for Mooncake connector, especially for hybrid KV cache layouts.
How was this patch tested?