Commit 0f5660d
[Feature] [P/D] support hybrid attention for mooncake connector (vllm-project#8850)
### What this PR does / why we need it?
This PR adapts `MooncakeConnector` KV transfer metadata handling for
hybrid KV cache layouts.
The core change is to make Mooncake KV transfer operate on
KV-cache-group-aware metadata instead of assuming a single uniform
attention-only KV layout. In `register_kv_caches`, this PR builds and
records per-group/per-layer metadata used by the sender and receiver:
- `self.kv_group2layeridx`: maps each KV cache group to its serialized
group spec and physical layer indices, e.g. `{group_id: (group_spec,
[layer_idx0, layer_idx1, ...])}`.
- `self.block_size_scale`: stores per-layer cache block scaling, e.g.
`[layer_idx][cache_idx] -> cache tensor num_blocks / logical
num_blocks`.
- `self.block_len_per_addr`: stores per-layer byte length for each cache
tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`.
- `self.kv_caches_base_addr`: stores per-layer base addresses for each
registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`.
Based on this metadata, `_get_kv_split_metadata` now prepares transfer
splits that can represent hybrid KV cache groups, including non-uniform
group layouts. `_get_group_pulls_metadata` then builds per-remote-port
group pull descriptors so each transfer task knows which KV cache group,
remote TP offset, and prefill PP rank it should pull from.
This allows Mooncake connector to support hybrid KV cache transfer paths
while keeping the existing non-hybrid behavior compatible.
- vLLM version: v0.20.2
- vLLM main: vllm-project/vllm@39910f2
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: zzzzzmeng <810924837@qq.com>
Signed-off-by: liziyu179 <liziyu16@huawei.com>
Co-authored-by: zzzzzmeng <810924837@qq.com>1 parent 7ba2934 commit 0f5660d
2 files changed
Lines changed: 1359 additions & 340 deletions
File tree
- tests/ut/kv_connector
- vllm_ascend/distributed/kv_transfer/kv_p2p
0 commit comments