Skip to content

Revert "[KV Cache][Feature] Support Layerwise KV Pooling (#10077)"#11021

Merged
wangxiyuan merged 2 commits into
vllm-project:mainfrom
ader47:revert/layerwise-kv-pool-10077
Jun 27, 2026
Merged

Revert "[KV Cache][Feature] Support Layerwise KV Pooling (#10077)"#11021
wangxiyuan merged 2 commits into
vllm-project:mainfrom
ader47:revert/layerwise-kv-pool-10077

Conversation

@ader47

@ader47 ader47 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

This PR reverts #10077.
PR #10077 introduced new functionality for MooncakeLayerwiseConnector. However, this change causes compatibility issues in some model scenarios.
For example, in models such as Qwen3.5, attn_metadata can be a dict whose keys are layer_name. In attention_v1.py, this variable may default to ascend_metadata, which can cause the new layerwise connector path to handle the attention metadata incorrectly and lead to runtime errors.
Since #10077 contains a relatively large set of changes around Layerwise KV Pooling and MooncakeLayerwiseConnector, this PR reverts the entire change set instead of applying a partial fix.

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request reverts the previously introduced 'Layerwise KV Pooling' feature. The change involves removing the feature's documentation, associated backend logic, and cleaning up the test infrastructure that supported it. The primary goal is to roll back this feature to maintain codebase stability and remove unused components.

Highlights

  • Feature Reversion: This pull request reverts the support for Layerwise KV Pooling, removing the associated documentation and code implementations.
  • Code Cleanup: Removed vllm_ascend/memcache_comm_fence.py and cleaned up related dependencies and imports across the codebase, including in attention modules and test suites.
  • Test Suite Updates: Updated various unit tests to reflect the removal of layerwise-specific logic and dependencies, ensuring test stability.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/scripts/test_config.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions Bot added documentation Improvements or additions to documentation module:tests module:core labels Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.


Tip

💡 Consider Linking a Related Issue or RFC

Your PR title contains the [Feature] tag, indicating a bug fix or new feature.

Linking a related issue or RFC in the PR description is strongly encouraged — it gives reviewers helpful context and speeds up the review. You can use any of these keywords:

  • Fixes #<issue_number>
  • Closes #<issue_number>
  • Resolves #<issue_number>
  • Refs #<rfc_or_issue_number> (for RFCs)

🙏 Thanks for helping us keep the project well-organized!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:\n\nmarkdown\n[Attention][Misc] Refactor and simplify layerwise KV pool implementation\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR refactors and simplifies the layerwise KV pool implementation by removing key-based layerwise transfer threads, deleting the attention compute start gate synchronization mechanism, and cleaning up associated documentation, tests, and mock dependencies.\n\nSeveral critical issues were identified in the changes:\n- In `config_data.py`, a bug in `prepare_value_layer` uses `group_addrs[layer_id * length]` instead of indexing with `i`, causing key and value caches to resolve to the same base address and leading to memory overwrites.\n- A typo `use_layerwize` (with a 'z') was introduced in `pool_worker.py` and its tests.\n- In `ascend_store_connector.py`, the `close()` method of `LookupKeyServer` does not set `self.running = False`, which can cause the background thread to leak or hang during shutdown.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nTested via updated unit tests, though some assertions were incorrectly modified to match the base address indexing bug.\n

for i in range(length):
block_stride = group_block_stride[i] if group_block_stride else group_block_len[i]
addr = group_addrs[layer_id * length + i] + block_id * block_stride
addr = group_addrs[layer_id * length] + block_id * block_stride

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using group_addrs[layer_id * length] for all caches in the group ignores the cache index i. This causes both key and value caches (and any other caches in the group) to resolve to the same base address, leading to memory overwrites and silent data corruption. It should be group_addrs[layer_id * length + i] to correctly index each cache's base address.

Suggested change
addr = group_addrs[layer_id * length] + block_id * block_stride
addr = group_addrs[layer_id * length + i] + block_id * block_stride

Comment on lines +205 to +207
# layer_id=0 => kv_caches_base_addr[0*2] and [0*2+... index mod length]
self.assertEqual(addr[0], 1000 + 5 * 160)
self.assertEqual(addr[1], 2000 + 5 * 320)
self.assertEqual(addr[1], 1000 + 5 * 320)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The test assertion was updated to match the buggy implementation where both key and value caches resolve to the same base address (1000). Once the indexing bug in prepare_value_layer is fixed, the second cache should correctly resolve to base address 2000.

Suggested change
# layer_id=0 => kv_caches_base_addr[0*2] and [0*2+... index mod length]
self.assertEqual(addr[0], 1000 + 5 * 160)
self.assertEqual(addr[1], 2000 + 5 * 320)
self.assertEqual(addr[1], 1000 + 5 * 320)
# layer_id=0 => kv_caches_base_addr[0*2 + i] per cache i
self.assertEqual(addr[0], 1000 + 5 * 160)
self.assertEqual(addr[1], 2000 + 5 * 320)

self,
vllm_config: VllmConfig,
use_layerwise: bool,
use_layerwize: bool,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parameter name use_layerwize contains a typo ('z' instead of 's'). It should be use_layerwise to be consistent with the rest of the codebase and to avoid TypeError when instantiating KVPoolWorker with keyword arguments.

Suggested change
use_layerwize: bool,
use_layerwise: bool,

self.use_mla = True
self.use_sparse = hasattr(model_config.hf_text_config, "index_topk")

self.use_layerwise = use_layerwize

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the assignment to use use_layerwise instead of the typo use_layerwize.

Suggested change
self.use_layerwise = use_layerwize
self.use_layerwise = use_layerwise

from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker

worker = KVPoolWorker(config, use_layerwise=False)
worker = KVPoolWorker(config, use_layerwize=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the keyword argument to use use_layerwise instead of the typo use_layerwize.

Suggested change
worker = KVPoolWorker(config, use_layerwize=False)
worker = KVPoolWorker(config, use_layerwise=False)

from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker

worker = KVPoolWorker(config, use_layerwise=False)
worker = KVPoolWorker(config, use_layerwize=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the keyword argument to use use_layerwise instead of the typo use_layerwize.

Suggested change
worker = KVPoolWorker(config, use_layerwize=False)
worker = KVPoolWorker(config, use_layerwise=False)

from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker

worker = KVPoolWorker(config, use_layerwise=False)
worker = KVPoolWorker(config, use_layerwize=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the keyword argument to use use_layerwise instead of the typo use_layerwize.

Suggested change
worker = KVPoolWorker(config, use_layerwize=False)
worker = KVPoolWorker(config, use_layerwise=False)

Comment on lines 319 to +321
def close(self):
self.socket.close(linger=0)
# TODO: close the thread!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The background thread process_request runs a loop while self.running:. However, close() does not set self.running = False, which can cause the thread to leak or hang during shutdown. Setting self.running = False before closing the socket ensures the thread terminates cleanly.

Suggested change
def close(self):
self.socket.close(linger=0)
# TODO: close the thread!
def close(self):
self.running = False
self.socket.close(linger=0)

@ader47 ader47 force-pushed the revert/layerwise-kv-pool-10077 branch from bb95f4e to 9b57704 Compare June 26, 2026 11:26
@wangxiyuan wangxiyuan added the ready enable e2e test for PR label Jun 26, 2026
@wangxiyuan

Copy link
Copy Markdown
Collaborator

wait CI before merging

…t#10077)"

This reverts commit 5e39074.

Signed-off-by: F.Liu <1661888967@qq.com>
@ader47 ader47 force-pushed the revert/layerwise-kv-pool-10077 branch from 9b57704 to 894038d Compare June 26, 2026 15:19
…llm-project#10077

Reverting vllm-project#10077 removed the pool_scheduler request-processing methods
where vllm-project#10565 activated its mamba pooling (the RequestTracker mamba kwargs
and the update(num_computed_tokens) call). Re-attach them to the
pre-vllm-project#10077 construction/update sites so vllm-project#10565's mamba path stays active.

RequestTracker's mamba fields, update_mamba_spec_blocks, and the
pool_worker _align_kv_ptrs fix from vllm-project#10565 were already preserved by the
revert; only the activation wiring in pool_scheduler was missing.

Signed-off-by: F.Liu <1661888967@qq.com>
@wangxiyuan wangxiyuan merged commit 4431251 into vllm-project:main Jun 27, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:core module:tests ready enable e2e test for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants