Revert "[KV Cache][Feature] Support Layerwise KV Pooling (#10077)"#11021
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request reverts the previously introduced 'Layerwise KV Pooling' feature. The change involves removing the feature's documentation, associated backend logic, and cleaning up the test infrastructure that supported it. The primary goal is to roll back this feature to maintain codebase stability and remove unused components. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. Tip 💡 Consider Linking a Related Issue or RFCYour PR title contains the [Feature] tag, indicating a bug fix or new feature. Linking a related issue or RFC in the PR description is strongly encouraged — it gives reviewers helpful context and speeds up the review. You can use any of these keywords:
🙏 Thanks for helping us keep the project well-organized! |
There was a problem hiding this comment.
Code Review
Suggested PR Title:\n\nmarkdown\n[Attention][Misc] Refactor and simplify layerwise KV pool implementation\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR refactors and simplifies the layerwise KV pool implementation by removing key-based layerwise transfer threads, deleting the attention compute start gate synchronization mechanism, and cleaning up associated documentation, tests, and mock dependencies.\n\nSeveral critical issues were identified in the changes:\n- In `config_data.py`, a bug in `prepare_value_layer` uses `group_addrs[layer_id * length]` instead of indexing with `i`, causing key and value caches to resolve to the same base address and leading to memory overwrites.\n- A typo `use_layerwize` (with a 'z') was introduced in `pool_worker.py` and its tests.\n- In `ascend_store_connector.py`, the `close()` method of `LookupKeyServer` does not set `self.running = False`, which can cause the background thread to leak or hang during shutdown.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nTested via updated unit tests, though some assertions were incorrectly modified to match the base address indexing bug.\n
| for i in range(length): | ||
| block_stride = group_block_stride[i] if group_block_stride else group_block_len[i] | ||
| addr = group_addrs[layer_id * length + i] + block_id * block_stride | ||
| addr = group_addrs[layer_id * length] + block_id * block_stride |
There was a problem hiding this comment.
Using group_addrs[layer_id * length] for all caches in the group ignores the cache index i. This causes both key and value caches (and any other caches in the group) to resolve to the same base address, leading to memory overwrites and silent data corruption. It should be group_addrs[layer_id * length + i] to correctly index each cache's base address.
| addr = group_addrs[layer_id * length] + block_id * block_stride | |
| addr = group_addrs[layer_id * length + i] + block_id * block_stride |
| # layer_id=0 => kv_caches_base_addr[0*2] and [0*2+... index mod length] | ||
| self.assertEqual(addr[0], 1000 + 5 * 160) | ||
| self.assertEqual(addr[1], 2000 + 5 * 320) | ||
| self.assertEqual(addr[1], 1000 + 5 * 320) |
There was a problem hiding this comment.
The test assertion was updated to match the buggy implementation where both key and value caches resolve to the same base address (1000). Once the indexing bug in prepare_value_layer is fixed, the second cache should correctly resolve to base address 2000.
| # layer_id=0 => kv_caches_base_addr[0*2] and [0*2+... index mod length] | |
| self.assertEqual(addr[0], 1000 + 5 * 160) | |
| self.assertEqual(addr[1], 2000 + 5 * 320) | |
| self.assertEqual(addr[1], 1000 + 5 * 320) | |
| # layer_id=0 => kv_caches_base_addr[0*2 + i] per cache i | |
| self.assertEqual(addr[0], 1000 + 5 * 160) | |
| self.assertEqual(addr[1], 2000 + 5 * 320) |
| self, | ||
| vllm_config: VllmConfig, | ||
| use_layerwise: bool, | ||
| use_layerwize: bool, |
There was a problem hiding this comment.
| self.use_mla = True | ||
| self.use_sparse = hasattr(model_config.hf_text_config, "index_topk") | ||
|
|
||
| self.use_layerwise = use_layerwize |
| from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker | ||
|
|
||
| worker = KVPoolWorker(config, use_layerwise=False) | ||
| worker = KVPoolWorker(config, use_layerwize=False) |
| from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker | ||
|
|
||
| worker = KVPoolWorker(config, use_layerwise=False) | ||
| worker = KVPoolWorker(config, use_layerwize=False) |
| from vllm_ascend.distributed.kv_transfer.kv_pool.ascend_store.pool_worker import KVPoolWorker | ||
|
|
||
| worker = KVPoolWorker(config, use_layerwise=False) | ||
| worker = KVPoolWorker(config, use_layerwize=False) |
| def close(self): | ||
| self.socket.close(linger=0) | ||
| # TODO: close the thread! |
There was a problem hiding this comment.
The background thread process_request runs a loop while self.running:. However, close() does not set self.running = False, which can cause the thread to leak or hang during shutdown. Setting self.running = False before closing the socket ensures the thread terminates cleanly.
| def close(self): | |
| self.socket.close(linger=0) | |
| # TODO: close the thread! | |
| def close(self): | |
| self.running = False | |
| self.socket.close(linger=0) |
bb95f4e to
9b57704
Compare
|
wait CI before merging |
9b57704 to
894038d
Compare
…llm-project#10077 Reverting vllm-project#10077 removed the pool_scheduler request-processing methods where vllm-project#10565 activated its mamba pooling (the RequestTracker mamba kwargs and the update(num_computed_tokens) call). Re-attach them to the pre-vllm-project#10077 construction/update sites so vllm-project#10565's mamba path stays active. RequestTracker's mamba fields, update_mamba_spec_blocks, and the pool_worker _align_kv_ptrs fix from vllm-project#10565 were already preserved by the revert; only the activation wiring in pool_scheduler was missing. Signed-off-by: F.Liu <1661888967@qq.com>
This PR reverts #10077.
PR #10077 introduced new functionality for MooncakeLayerwiseConnector. However, this change causes compatibility issues in some model scenarios.
For example, in models such as Qwen3.5, attn_metadata can be a dict whose keys are layer_name. In attention_v1.py, this variable may default to ascend_metadata, which can cause the new layerwise connector path to handle the attention metadata incorrectly and lead to runtime errors.
Since #10077 contains a relatively large set of changes around Layerwise KV Pooling and MooncakeLayerwiseConnector, this PR reverts the entire change set instead of applying a partial fix.
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?