Revert "[KV Cache][Feature] Support Layerwise KV Pooling (#10077)" by ader47 · Pull Request #11021 · vllm-project/vllm-ascend

ader47 · 2026-06-26T11:21:37Z

This PR reverts #10077.
PR #10077 introduced new functionality for MooncakeLayerwiseConnector. However, this change causes compatibility issues in some model scenarios.
For example, in models such as Qwen3.5, attn_metadata can be a dict whose keys are layer_name. In attention_v1.py, this variable may default to ascend_metadata, which can cause the new layerwise connector path to handle the attention metadata incorrectly and lead to runtime errors.
Since #10077 contains a relatively large set of changes around Layerwise KV Pooling and MooncakeLayerwiseConnector, this PR reverts the entire change set instead of applying a partial fix.

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.23.0
vLLM main: vllm-project/vllm@967c5c3

gemini-code-assist · 2026-06-26T11:21:52Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request reverts the previously introduced 'Layerwise KV Pooling' feature. The change involves removing the feature's documentation, associated backend logic, and cleaning up the test infrastructure that supported it. The primary goal is to roll back this feature to maintain codebase stability and remove unused components.

Highlights

Feature Reversion: This pull request reverts the support for Layerwise KV Pooling, removing the associated documentation and code implementations.
Code Cleanup: Removed vllm_ascend/memcache_comm_fence.py and cleaned up related dependencies and imports across the codebase, including in attention modules and test suites.
Test Suite Updates: Updated various unit tests to reflect the removal of layerwise-specific logic and dependencies, ensuring test stability.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/scripts/test_config.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-06-26T11:22:46Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Tip

💡 Consider Linking a Related Issue or RFC

Your PR title contains the [Feature] tag, indicating a bug fix or new feature.

Linking a related issue or RFC in the PR description is strongly encouraged — it gives reviewers helpful context and speeds up the review. You can use any of these keywords:

Fixes #<issue_number>
Closes #<issue_number>
Resolves #<issue_number>
Refs #<rfc_or_issue_number> (for RFCs)

🙏 Thanks for helping us keep the project well-organized!

gemini-code-assist

Code Review

Suggested PR Title:\n\nmarkdown\n[Attention][Misc] Refactor and simplify layerwise KV pool implementation\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR refactors and simplifies the layerwise KV pool implementation by removing key-based layerwise transfer threads, deleting the attention compute start gate synchronization mechanism, and cleaning up associated documentation, tests, and mock dependencies.\n\nSeveral critical issues were identified in the changes:\n- In `config_data.py`, a bug in `prepare_value_layer` uses `group_addrs[layer_id * length]` instead of indexing with `i`, causing key and value caches to resolve to the same base address and leading to memory overwrites.\n- A typo `use_layerwize` (with a 'z') was introduced in `pool_worker.py` and its tests.\n- In `ascend_store_connector.py`, the `close()` method of `LookupKeyServer` does not set `self.running = False`, which can cause the background thread to leak or hang during shutdown.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nTested via updated unit tests, though some assertions were incorrectly modified to match the base address indexing bug.\n

gemini-code-assist · 2026-06-26T11:24:05Z

        for i in range(length):
            block_stride = group_block_stride[i] if group_block_stride else group_block_len[i]
-            addr = group_addrs[layer_id * length + i] + block_id * block_stride
+            addr = group_addrs[layer_id * length] + block_id * block_stride


Using group_addrs[layer_id * length] for all caches in the group ignores the cache index i. This causes both key and value caches (and any other caches in the group) to resolve to the same base address, leading to memory overwrites and silent data corruption. It should be group_addrs[layer_id * length + i] to correctly index each cache's base address.

Suggested change

addr = group_addrs[layer_id * length] + block_id * block_stride

addr = group_addrs[layer_id * length + i] + block_id * block_stride

gemini-code-assist · 2026-06-26T11:24:05Z

+        # layer_id=0 => kv_caches_base_addr[0*2] and [0*2+... index mod length]
        self.assertEqual(addr[0], 1000 + 5 * 160)
-        self.assertEqual(addr[1], 2000 + 5 * 320)
+        self.assertEqual(addr[1], 1000 + 5 * 320)


The test assertion was updated to match the buggy implementation where both key and value caches resolve to the same base address (1000). Once the indexing bug in prepare_value_layer is fixed, the second cache should correctly resolve to base address 2000.

Suggested change

# layer_id=0 => kv_caches_base_addr[0*2] and [0*2+... index mod length]

self.assertEqual(addr[0], 1000 + 5 * 160)

self.assertEqual(addr[1], 2000 + 5 * 320)

self.assertEqual(addr[1], 1000 + 5 * 320)

# layer_id=0 => kv_caches_base_addr[0*2 + i] per cache i

self.assertEqual(addr[0], 1000 + 5 * 160)

self.assertEqual(addr[1], 2000 + 5 * 320)

gemini-code-assist · 2026-06-26T11:24:05Z

        self,
        vllm_config: VllmConfig,
-        use_layerwise: bool,
+        use_layerwize: bool,


The parameter name use_layerwize contains a typo ('z' instead of 's'). It should be use_layerwise to be consistent with the rest of the codebase and to avoid TypeError when instantiating KVPoolWorker with keyword arguments.

Suggested change

use_layerwize: bool,

use_layerwise: bool,

gemini-code-assist · 2026-06-26T11:24:06Z

            self.use_mla = True
        self.use_sparse = hasattr(model_config.hf_text_config, "index_topk")
-
+        self.use_layerwise = use_layerwize


Update the assignment to use use_layerwise instead of the typo use_layerwize.

Suggested change

self.use_layerwise = use_layerwize

self.use_layerwise = use_layerwise

gemini-code-assist · 2026-06-26T11:24:06Z

        from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker

-        worker = KVPoolWorker(config, use_layerwise=False)
+        worker = KVPoolWorker(config, use_layerwize=False)


Update the keyword argument to use use_layerwise instead of the typo use_layerwize.

Suggested change

worker = KVPoolWorker(config, use_layerwize=False)

worker = KVPoolWorker(config, use_layerwise=False)

gemini-code-assist · 2026-06-26T11:24:06Z

        from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker

-        worker = KVPoolWorker(config, use_layerwise=False)
+        worker = KVPoolWorker(config, use_layerwize=False)


Update the keyword argument to use use_layerwise instead of the typo use_layerwize.

Suggested change

worker = KVPoolWorker(config, use_layerwize=False)

worker = KVPoolWorker(config, use_layerwise=False)

gemini-code-assist · 2026-06-26T11:24:06Z

        from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker

-        worker = KVPoolWorker(config, use_layerwise=False)
+        worker = KVPoolWorker(config, use_layerwize=False)


Update the keyword argument to use use_layerwise instead of the typo use_layerwize.

Suggested change

worker = KVPoolWorker(config, use_layerwize=False)

worker = KVPoolWorker(config, use_layerwise=False)

gemini-code-assist · 2026-06-26T11:24:06Z

    def close(self):
        self.socket.close(linger=0)
+        # TODO: close the thread!


The background thread process_request runs a loop while self.running:. However, close() does not set self.running = False, which can cause the thread to leak or hang during shutdown. Setting self.running = False before closing the socket ensures the thread terminates cleanly.

Suggested change

def close(self):

self.socket.close(linger=0)

# TODO: close the thread!

def close(self):

self.running = False

self.socket.close(linger=0)

wangxiyuan · 2026-06-26T13:52:51Z

wait CI before merging

…t#10077)" This reverts commit 5e39074. Signed-off-by: F.Liu <1661888967@qq.com>

…llm-project#10077 Reverting vllm-project#10077 removed the pool_scheduler request-processing methods where vllm-project#10565 activated its mamba pooling (the RequestTracker mamba kwargs and the update(num_computed_tokens) call). Re-attach them to the pre-vllm-project#10077 construction/update sites so vllm-project#10565's mamba path stays active. RequestTracker's mamba fields, update_mamba_spec_blocks, and the pool_worker _align_kv_ptrs fix from vllm-project#10565 were already preserved by the revert; only the activation wiring in pool_scheduler was missing. Signed-off-by: F.Liu <1661888967@qq.com>

ader47 requested review from LCAIZJ, MengqingCao, Yikun, wangxiyuan, weijinqian0 and whx-sjtu as code owners June 26, 2026 11:21

github-actions Bot added documentation Improvements or additions to documentation module:tests module:core labels Jun 26, 2026

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

ader47 force-pushed the revert/layerwise-kv-pool-10077 branch from bb95f4e to 9b57704 Compare June 26, 2026 11:26

wangxiyuan added the ready enable e2e test for PR label Jun 26, 2026

wangxiyuan approved these changes Jun 26, 2026

View reviewed changes

Revert "[KV Cache][Feature] Support Layerwise KV Pooling (vllm-projec…

894038d

…t#10077)" This reverts commit 5e39074. Signed-off-by: F.Liu <1661888967@qq.com>

ader47 force-pushed the revert/layerwise-kv-pool-10077 branch from 9b57704 to 894038d Compare June 26, 2026 15:19

wangxiyuan merged commit 4431251 into vllm-project:main Jun 27, 2026
35 checks passed

	addr = group_addrs[layer_id * length] + block_id * block_stride
	addr = group_addrs[layer_id * length + i] + block_id * block_stride

	self.use_layerwise = use_layerwize
	self.use_layerwise = use_layerwise

	worker = KVPoolWorker(config, use_layerwize=False)
	worker = KVPoolWorker(config, use_layerwise=False)

Uh oh!

Conversation

ader47 commented Jun 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Jun 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Jun 26, 2026

💡 Consider Linking a Related Issue or RFC

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ader47 commented Jun 26, 2026 •

edited by github-actions Bot

Loading