[Feature] [P/D] support hybrid attention for mooncake connector by liziyu179 · Pull Request #8850 · vllm-project/vllm-ascend

liziyu179 · 2026-04-30T08:32:52Z

What this PR does / why we need it?

This PR adapts MooncakeConnector KV transfer metadata handling for hybrid KV cache layouts.

The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In register_kv_caches, this PR builds and records per-group/per-layer metadata used by the sender and receiver:

self.kv_group2layeridx: maps each KV cache group to its serialized group spec and physical layer indices, e.g. {group_id: (group_spec, [layer_idx0, layer_idx1, ...])}.
self.block_size_scale: stores per-layer cache block scaling, e.g. [layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks.
self.block_len_per_addr: stores per-layer byte length for each cache tensor block, e.g. [layer_idx][cache_idx] -> cache block byte length.
self.kv_caches_base_addr: stores per-layer base addresses for each registered cache tensor, e.g. [layer_idx][cache_idx] -> data_ptr.

Based on this metadata, _get_kv_split_metadata now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. _get_group_pulls_metadata then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from.

This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible.

Does this PR introduce any user-facing change?

No user-facing API change.

This changes internal KV transfer metadata construction and scheduling behavior for Mooncake connector, especially for hybrid KV cache layouts.

How was this patch tested?

Unit tests added/updated for Mooncake connector metadata split and group pull construction.
Verified formatting and linting:

python -m ruff check vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py tests/ut/kv_connector/test_mooncake_connector.py
python -m ruff format --check vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py tests/ut/kv_connector/test_mooncake_connector.py


- vLLM version: v0.20.2
- vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8

gemini-code-assist · 2026-04-30T08:33:01Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for hybrid attention and Mamba models within the Mooncake connector. The changes enable the connector to handle diverse KV cache specifications and improve the robustness of data transfers by accounting for varying block lengths and specific model requirements. These updates ensure better compatibility with advanced model architectures while maintaining existing functionality.

Highlights

Hybrid Attention Support: Implemented hybrid attention (HMA) support for the Mooncake connector, allowing for more flexible KV cache management across different attention specifications.
Mamba Integration: Added support for Mamba models within the Mooncake connector, including specific metadata handling for SSM sizes and specialized transfer logic.
Connector Refactoring: Refactored the Mooncake connector to support multiple KV cache groups and updated internal data structures to handle block lengths per address.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-04-30T08:33:20Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request implements support for hybrid attention (HMA), specifically integrating Mamba into the Mooncake connector for KV cache transfer. It updates metadata structures, introduces multi-group transfer logic, and refines cache registration. Feedback highlights critical bugs including a TypeError in Mamba block initialization, swapped source and destination addresses in transfer logic, and overly restrictive HMA checks that could crash non-Mamba hybrid models. Suggested PR Title: [Ops][Feature] Support hybrid attention for mooncake connector Suggested PR Summary: ### What this PR does / why we need it? This PR adds support for hybrid attention in the Mooncake connector for KV cache transfer. It updates metadata and implements a new transfer method for multiple KV cache groups. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By nightly.

gemini-code-assist · 2026-04-30T08:35:40Z

+                grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]]
+                grouped_local_block_ids = [local_block_ids[i][0]]


Suggested PR Title:

[Future][Ops][Feature] Support hybrid attention for mooncake connector

Suggested PR Summary:

### What this PR does / why we need it? This PR adds support for hybrid attention (e.g., Attention + Mamba) in the Mooncake connector for KV cache transfer. It updates the metadata exchange to include Mamba-specific information and implements a new transfer method that can handle multiple KV cache groups simultaneously. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By nightly.

CRITICAL BUG: The initialization of grouped_remote_block_ids and grouped_local_block_ids for Mamba groups will cause a TypeError at runtime. They are assigned as lists of integers, but the subsequent logic at lines 545-547 expects them to be lists of lists (e.g., it accesses local_block_id[0] and calls len(local_block_id)).

Suggested change

grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]]

grouped_local_block_ids = [local_block_ids[i][0]]

grouped_remote_block_ids = [[remote_block_ids[i][transfer_block_idx]]]

grouped_local_block_ids = [[local_block_ids[i][0]]]

gemini-code-assist · 2026-04-30T08:35:40Z

+            for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate(
+                zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)


CRITICAL BUG: The source and destination addresses for the KV transfer appear to be swapped. In the KVCacheRecvingThread (consumer side), the read operation should pull data from the remote producer into the local consumer. However, the code currently assigns local addresses to src and remote addresses to dst in the zip and subsequent calculations. This would mean the consumer is attempting to read from its own memory and write to the producer's memory.

gemini-code-assist · 2026-04-30T08:35:40Z

        for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate(
            zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)


CRITICAL BUG: Similar to _transfer_kv_cache_all_groups, the source and destination addresses in _transfer_kv_cache appear to be swapped. The consumer should be reading from the remote producer's memory into its local memory, but the zip assigns local addresses to the source side of the transfer.

gemini-code-assist · 2026-04-30T08:35:40Z

+        remote_block_ids = req_meta["remote_block_ids"][0]
+        local_block_ids = req_meta["local_block_ids"][0]


HIGH SEVERITY: The modification to use [0] for remote_block_ids and local_block_ids assumes there is only one KV cache group when is_hma_required is false. If multiple FullAttentionSpec groups exist, this will only transfer the first group, leading to incomplete KV cache transfer for the other groups.

gemini-code-assist · 2026-04-30T08:35:40Z

+        else:
+            raise TypeError("Mooncake connector does not support this type kv_cache now.")


HIGH SEVERITY: This implementation raises a TypeError if hybrid attention is required but Mamba is not present. This unnecessarily restricts hybrid attention support to only Mamba-based models and will crash for other hybrid configurations.

github-actions · 2026-05-08T13:40:41Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-05-30T06:16:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: liziyu <liziyu16@huawei.com>

Signed-off-by: zzzzzmeng <810924837@qq.com>

Signed-off-by: liziyu179 <liziyu16@huawei.com>

…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liyishi <1252651434@qq.com>

…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com> Signed-off-by: yilunh <hanyilun1@huawei.com>

…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com> Signed-off-by: shenqiangqiang <2416602906@qq.com>

…-project#8850) ### What this PR does / why we need it? This PR adapts `MooncakeConnector` KV transfer metadata handling for hybrid KV cache layouts. The core change is to make Mooncake KV transfer operate on KV-cache-group-aware metadata instead of assuming a single uniform attention-only KV layout. In `register_kv_caches`, this PR builds and records per-group/per-layer metadata used by the sender and receiver: - `self.kv_group2layeridx`: maps each KV cache group to its serialized group spec and physical layer indices, e.g. `{group_id: (group_spec, [layer_idx0, layer_idx1, ...])}`. - `self.block_size_scale`: stores per-layer cache block scaling, e.g. `[layer_idx][cache_idx] -> cache tensor num_blocks / logical num_blocks`. - `self.block_len_per_addr`: stores per-layer byte length for each cache tensor block, e.g. `[layer_idx][cache_idx] -> cache block byte length`. - `self.kv_caches_base_addr`: stores per-layer base addresses for each registered cache tensor, e.g. `[layer_idx][cache_idx] -> data_ptr`. Based on this metadata, `_get_kv_split_metadata` now prepares transfer splits that can represent hybrid KV cache groups, including non-uniform group layouts. `_get_group_pulls_metadata` then builds per-remote-port group pull descriptors so each transfer task knows which KV cache group, remote TP offset, and prefill PP rank it should pull from. This allows Mooncake connector to support hybrid KV cache transfer paths while keeping the existing non-hybrid behavior compatible. - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@39910f2 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: zzzzzmeng <810924837@qq.com> Signed-off-by: liziyu179 <liziyu16@huawei.com> Co-authored-by: zzzzzmeng <810924837@qq.com>

liziyu179 requested review from LCAIZJ and MengqingCao as code owners April 30, 2026 08:32

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 5e527fb to 5d4c64b Compare May 8, 2026 08:32

github-actions Bot added the merge-conflicts label May 8, 2026

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from bd74e07 to 81da92b Compare May 9, 2026 09:57

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 81da92b to fd6657d Compare May 21, 2026 02:57

github-actions Bot removed the merge-conflicts label May 21, 2026

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch 11 times, most recently from 0a8c606 to 4518a94 Compare May 26, 2026 09:48

liziyu179 changed the title ~~[Future] [P/D] support hybrid attention for mooncake connector~~ [Feature] [P/D] support hybrid attention for mooncake connector May 27, 2026

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 4518a94 to 5fa3afe Compare May 27, 2026 02:38

liziyu179 requested a review from wangxiyuan as a code owner May 27, 2026 02:38

github-actions Bot added the module:tests label May 27, 2026

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch 2 times, most recently from cc14832 to b478d13 Compare May 27, 2026 03:45

liziyu179 added ready enable e2e test for PR ready-for-test labels May 27, 2026

github-actions Bot added the merge-conflicts label May 30, 2026

support hybrid attention

ca6318c

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from a8c4a8e to 2d42a93 Compare June 1, 2026 02:06

github-actions Bot removed the merge-conflicts label Jun 1, 2026

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 2d42a93 to f38440f Compare June 1, 2026 02:08

mergify Bot mentioned this pull request Jun 1, 2026

[Feature][P/D] Mooncake Connector Support Hybrid PCP/DCP for QWen3.5 #9809

Merged

MengqingCao approved these changes Jun 1, 2026

View reviewed changes

zzzzzmeng and others added 7 commits June 1, 2026 11:53

fix mooncake connector ut

54f00b8

Signed-off-by: zzzzzmeng <810924837@qq.com>

fix lint

1c1ff7b

Signed-off-by: liziyu179 <liziyu16@huawei.com>

fix vl model config

f212aa2

Signed-off-by: liziyu179 <liziyu16@huawei.com>

fix lint

0153fd2

Signed-off-by: liziyu179 <liziyu16@huawei.com>

fix local block scale greater than remote block scale

3cf7fce

Signed-off-by: liziyu179 <liziyu16@huawei.com>

fix mtp addr

bfbc0f9

Signed-off-by: liziyu179 <liziyu16@huawei.com>

fix lint

f19fcba

Signed-off-by: liziyu179 <liziyu16@huawei.com>

liziyu179 force-pushed the support_hybrid_atten_mooncake_connector branch from 7533d67 to f19fcba Compare June 1, 2026 03:54

MengqingCao merged commit fd9c3cb into vllm-project:main Jun 1, 2026
54 of 57 checks passed

realliujiaxu mentioned this pull request Jun 4, 2026

[Release]: Release checklist for v0.21.0rc1 #9971

Open

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [P/D] support hybrid attention for mooncake connector#8850

[Feature] [P/D] support hybrid attention for mooncake connector#8850
MengqingCao merged 8 commits into
vllm-project:mainfrom
liziyu179:support_hybrid_atten_mooncake_connector

liziyu179 commented Apr 30, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		grouped_remote_block_ids = [remote_block_ids[i][transfer_block_idx]]
		grouped_local_block_ids = [local_block_ids[i][0]]

		for k, (src_layer_base_addr, dst_layer_base_addr) in enumerate(
		zip(local_kv_caches_base_addrs, remote_kv_caches_base_addrs)

		remote_block_ids = req_meta["remote_block_ids"][0]
		local_block_ids = req_meta["local_block_ids"][0]

		else:
		raise TypeError("Mooncake connector does not support this type kv_cache now.")

Conversation

liziyu179 commented Apr 30, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liziyu179 commented Apr 30, 2026 •

edited by github-actions Bot

Loading