-
Notifications
You must be signed in to change notification settings - Fork 597
Description
Motivation.
Using KV Pool for prefix caching has definitely expand the capacity of KV cache we can store, but it still has some problems:
- Capacity is limited by the size of the pooled memory. Compute and storage are tied together, so we can’t scale capacity.
- It doesn’t have capability of persistence or reliability that’s actually needed for prefix caching to work in real-world production systems.
For prefix cache, the capacity and read/write speeds fit right into the sweet spot of traditional storage services. If we switch to a tiered cache setup—using DRAM on each node as cache, and a unified file system (3FS, or an enterprise-grade centralized storage) as the main storage—then that second issue I mentioned is just something the storage backend naturally handles. Check out the diagram below for how this works:

Proposed Change.
Auto Prefix Cache
We want to push UCM (Unified Cache Manager) into vllm-ascend community to support Prefix Cache feature:
- add a new UCMConnector class, this class only delivery request to UCMConnectorImpl(reference lmcache to vLLM)
- add document about how to enable prefix cache by UCM User Guide
P-D Transfer
More Important, UCM can support P-D transfer. We can use vLLM-Ascend + UCM + 3FS to build a P-D Disaggregate Inference System, User Guide:
thanks for review our pull request: #4411
Feedback Period.
2025/11/30
CC List.
Any Other Things.
No response