You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mooncake Store can keep object replicas in both distributed memory and SSD offload storage. When SSD offload is enabled, an object may remain available only as a LOCAL_DISK replica after its MEMORY replica has been evicted.
Today, exist is a metadata-style query. It checks whether the key has at least one complete replica and returns whether the object exists. It does not change replica placement. This keeps the API lightweight, but it also means a subsequent get may still need to read from SSD even if the caller has just probed the key and is likely to access it soon.
For workloads that use exist as a cache probe before a later read, an SSD-only hit is a strong signal that the object may become hot again. This RFC proposes an optional behavior: when exist finds that an object exists on SSD but has no DRAM replica, Mooncake Store can prefetch that object from SSD back into DRAM.
This is especially valuable for frameworks with asynchronous scheduling, which is enabled by default in vLLM. In those systems, an exist(prefetch=True) probe for a future request or future block can overlap with the current forward pass. The SSD read and DRAM materialization latency can therefore be hidden behind ongoing compute. If the prefetch completes before the later get, the final access observes a DRAM hit even though the original probe found the object only on SSD. In this mode, SSD hits can approach DRAM-hit behavior from the application's perspective, provided the scheduler issues probes early enough and the memory tier has enough capacity to hold the promoted objects.
The behavior is opt-in and disabled by default.
Goals
Allow Python users to opt in to SSD-to-DRAM prefetch on exist.
Preserve current exist behavior by default.
Only prefetch when the key exists on SSD, does not already have a complete DRAM replica, and does not have an in-flight DRAM put.
Prefer prefetching into the local DRAM segment of the requesting real client.
Fall back to another available memory segment when the local segment has no space.
Reuse normal memory allocation and eviction behavior. If global DRAM is full, normal eviction should be triggered.
Keep the exist return value compatible: 1 means exists, 0 means not exists, negative values remain errors.
Non-Goals
This RFC does not propose changing default exist semantics.
This RFC does not require exist to wait for prefetch completion in all modes.
This RFC does not introduce a new persistent replica type.
This RFC does not change SSD offload write policy, including offload_on_evict.
This RFC does not require prefetch for keys that already have a complete DRAM replica or an in-flight DRAM replica.
Batch support is useful for KV cache block probes, but the single-key API is sufficient for the first implementation.
Semantics
Existing Behavior
Current exist checks master metadata:
Key missing: return false.
Key exists but no complete replica: return false.
Key has at least one complete replica: grant lease and return true.
No object data is transferred.
Proposed Behavior with prefetch=False
No behavior change.
Proposed Behavior with prefetch=True
When the caller enables prefetch, the client should inspect replica placement after confirming the key exists:
If a complete MEMORY replica exists, return success without prefetch.
If a MEMORY replica is being written or otherwise in flight, return success without prefetch.
If no complete MEMORY replica exists, but a complete LOCAL_DISK replica exists, prefetch the object from SSD into DRAM.
If only a legacy DISK replica exists, do not prefetch in the initial implementation unless explicitly extended later.
If the key is missing or has no complete replica, return not-exist.
Prefetch should be best effort from an API compatibility perspective:
If the key exists but prefetch fails due to transient allocation or transfer failure, exist may still return 1.
The failure should be logged and counted in metrics.
If metadata query itself fails, return the existing negative error code.
This keeps exist as an existence API rather than making it a strict data movement API.
Placement Policy
Prefetch should use one memory replica by default.
The target placement should follow this order:
Prefer the local memory segment of the requesting real client.
If local allocation fails, allocate from any available memory segment.
If DRAM is globally full, rely on the normal allocation path to trigger eviction.
If allocation still fails after eviction, treat prefetch as failed but keep the existence result.
This mirrors the intent of the existing "prefer local segment" behavior used when putting data from HBM or local buffers: local placement is preferred for read locality, but the system should still make progress when local DRAM is full.
Implementation-wise, the prefetch allocation should use ReplicateConfig with:
or an equivalent preferred-segment list. The allocation strategy should try the preferred segment first, then fall back to other segments if the preferred segment cannot satisfy the allocation.
Data Flow
The high-level decision flow is:
flowchart TD
A["Python calls is_exist(key, prefetch)"] --> B{"prefetch enabled?"}
B -- "No" --> C["Use existing ExistKey path"]
C --> Z["Return existing result"]
B -- "Yes" --> D["RealClient queries metadata"]
D --> E{"Key has any complete replica?"}
E -- "No" --> F["Return 0"]
E -- "Yes" --> G{"Has complete MEMORY replica?"}
G -- "Yes" --> H["Return 1 without prefetch"]
G -- "No" --> I{"Has in-flight MEMORY replica?"}
I -- "Yes" --> J["Return 1 without prefetch"]
I -- "No" --> K{"Has complete LOCAL_DISK replica?"}
K -- "No" --> L["Return 1 without prefetch"]
K -- "Yes" --> M["Allocate MEMORY replica, prefer local segment"]
M --> N{"Local segment has space?"}
N -- "Yes" --> P["Use local DRAM target"]
N -- "No" --> O["Try another memory segment"]
O --> Q{"Global DRAM needs eviction?"}
Q -- "Yes" --> R["Trigger normal eviction path"]
Q -- "No" --> S["Use remote DRAM target if allocated"]
R --> S
P --> T["Read object from LOCAL_DISK"]
S --> T
T --> U["Write object into allocated MEMORY replica"]
U --> V["Mark MEMORY replica complete"]
V --> W["Return 1"]
M --> X{"Allocation or transfer failed?"}
X -- "Yes" --> Y["Log and count prefetch failure; return 1"]
Loading
The concrete SSD-to-DRAM prefetch path is:
Python
|
| is_exist(key, prefetch=True)
v
RealClient
|
| Query metadata
v
Master
|
| replicas contain COMPLETE LOCAL_DISK
| and do not contain COMPLETE MEMORY
| and do not contain in-flight MEMORY
v
RealClient
|
| allocate MEMORY replica, prefer local segment
v
Master
|
| PutStart/PrefetchStart allocates DRAM
v
RealClient
|
| read from LOCAL_DISK via offload RPC
| write into allocated MEMORY replica
v
Master
|
| PutEnd/PrefetchEnd marks MEMORY replica complete
v
Python
|
| returns 1
The prefetch path can be implemented as a specialized internal copy from LOCAL_DISK to MEMORY:
Query replicas for the key.
Select a complete LOCAL_DISK source replica only if no complete or in-flight MEMORY replica exists.
Allocate a new MEMORY replica for the same key using preferred-local placement.
Read the object from SSD into the allocated DRAM buffer.
Mark the new MEMORY replica complete.
Master-Side Requirements
The current master ExistKey API only returns a boolean. To implement prefetch, the caller needs replica placement information. There are two possible approaches:
Option A: Client-side query before or after ExistKey
Keep master ExistKey unchanged. When prefetch=True, the real client uses the normal query path to fetch replica descriptors, then decides whether prefetch is needed.
Pros:
Minimal change to existing ExistKey.
Keeps default exist fast.
Reuses existing replica selection helpers.
Cons:
exist(prefetch=True) may require an additional metadata query.
Option B: Extend ExistKey response
Introduce a richer RPC response for prefetch-capable exist:
Check whether a complete or in-flight MEMORY replica already exists.
Select a complete LOCAL_DISK replica as the source only when no MEMORY replica is complete or in flight.
Allocate one new memory replica, preferring the local segment.
Use the existing SSD read path to load data from LOCAL_DISK.
Transfer the loaded data into the allocated memory replica.
Complete the memory replica in master metadata.
Handle duplicate races idempotently.
Races are expected. For example, another client may prefetch or put the same key concurrently. If a complete or in-flight memory replica appears before completion, the prefetch should be treated as successful or safely revoked. In particular, a key with a MEMORY replica in a PROCESSING state should not start SSD-to-DRAM prefetch, because the normal put path is already materializing that key in DRAM.
Eviction Behavior
Prefetch allocation should use the same memory allocation path as normal puts. Therefore:
If local DRAM is full, allocation should try other eligible segments.
If global DRAM is full, normal eviction should be triggered.
If eviction selects objects for SSD offload, existing offload_on_evict behavior should apply.
The prefetched object should receive the usual lease or soft-pin treatment to avoid immediate eviction while the caller is likely to use it.
The RFC does not propose a special eviction policy for prefetched objects. They should participate in normal LRU/lease-based eviction after being materialized in memory.
Error Handling
Recommended behavior:
Metadata query error: return negative error code.
Key not found: return 0.
Key exists in memory: return 1.
Key has an in-flight memory replica: return 1 without prefetch.
Key exists only on SSD and prefetch succeeds: return 1.
Key exists only on SSD and prefetch fails: return 1, log the prefetch failure, and update a metric.
Returning 1 on prefetch failure is intentional because the object does exist. Applications that require strict materialization should use a future explicit prefetch API or call get.
Metrics
TBD
Compatibility
This change is backward compatible:
Python is_exist(key) keeps the same behavior.
Default prefetch=False avoids unexpected data movement.
Existing users who rely on exist as a cheap metadata check are not affected.
The behavior is opt-in per call, so higher-level systems can enable it only for cache probes likely to be followed by reads.
Summary
This RFC proposes an opt-in exist prefetch mode for Mooncake Store. When memory and SSD offload are both enabled, is_exist(key, prefetch=True) can promote SSD-only objects back into DRAM, preferring the caller's local segment and falling back to other memory segments when local DRAM is full.
The default remains unchanged. The feature improves cache-warming behavior for callers that use exist as a predictor of near-future access, while preserving the lightweight semantics of existing exist calls.
Changes proposed
Background
Mooncake Store can keep object replicas in both distributed memory and SSD offload storage. When SSD offload is enabled, an object may remain available only as a
LOCAL_DISKreplica after itsMEMORYreplica has been evicted.Today,
existis a metadata-style query. It checks whether the key has at least one complete replica and returns whether the object exists. It does not change replica placement. This keeps the API lightweight, but it also means a subsequentgetmay still need to read from SSD even if the caller has just probed the key and is likely to access it soon.For workloads that use
existas a cache probe before a later read, an SSD-only hit is a strong signal that the object may become hot again. This RFC proposes an optional behavior: whenexistfinds that an object exists on SSD but has no DRAM replica, Mooncake Store can prefetch that object from SSD back into DRAM.This is especially valuable for frameworks with asynchronous scheduling, which is enabled by default in vLLM. In those systems, an
exist(prefetch=True)probe for a future request or future block can overlap with the current forward pass. The SSD read and DRAM materialization latency can therefore be hidden behind ongoing compute. If the prefetch completes before the laterget, the final access observes a DRAM hit even though the original probe found the object only on SSD. In this mode, SSD hits can approach DRAM-hit behavior from the application's perspective, provided the scheduler issues probes early enough and the memory tier has enough capacity to hold the promoted objects.The behavior is opt-in and disabled by default.
Goals
exist.existbehavior by default.existreturn value compatible:1means exists,0means not exists, negative values remain errors.Non-Goals
existsemantics.existto wait for prefetch completion in all modes.offload_on_evict.API Proposal
Add an optional boolean flag to Python
is_exist:Default behavior remains unchanged:
Opt-in prefetch:
When
prefetch=True,is_existshould:0if the key does not exist.1immediately or after a best-effort prefetch attempt if the key exists.LOCAL_DISKreplica, no completeMEMORYreplica, and no in-flightMEMORYreplica.The same flag can be added to batch exist as a follow-up:
Batch support is useful for KV cache block probes, but the single-key API is sufficient for the first implementation.
Semantics
Existing Behavior
Current
existchecks master metadata:false.false.true.No object data is transferred.
Proposed Behavior with
prefetch=FalseNo behavior change.
Proposed Behavior with
prefetch=TrueWhen the caller enables prefetch, the client should inspect replica placement after confirming the key exists:
MEMORYreplica exists, return success without prefetch.MEMORYreplica is being written or otherwise in flight, return success without prefetch.MEMORYreplica exists, but a completeLOCAL_DISKreplica exists, prefetch the object from SSD into DRAM.DISKreplica exists, do not prefetch in the initial implementation unless explicitly extended later.Prefetch should be best effort from an API compatibility perspective:
existmay still return1.This keeps
existas an existence API rather than making it a strict data movement API.Placement Policy
Prefetch should use one memory replica by default.
The target placement should follow this order:
This mirrors the intent of the existing "prefer local segment" behavior used when putting data from HBM or local buffers: local placement is preferred for read locality, but the system should still make progress when local DRAM is full.
Implementation-wise, the prefetch allocation should use
ReplicateConfigwith:replica_num = 1 preferred_segment = local_hostnameor an equivalent preferred-segment list. The allocation strategy should try the preferred segment first, then fall back to other segments if the preferred segment cannot satisfy the allocation.
Data Flow
The high-level decision flow is:
flowchart TD A["Python calls is_exist(key, prefetch)"] --> B{"prefetch enabled?"} B -- "No" --> C["Use existing ExistKey path"] C --> Z["Return existing result"] B -- "Yes" --> D["RealClient queries metadata"] D --> E{"Key has any complete replica?"} E -- "No" --> F["Return 0"] E -- "Yes" --> G{"Has complete MEMORY replica?"} G -- "Yes" --> H["Return 1 without prefetch"] G -- "No" --> I{"Has in-flight MEMORY replica?"} I -- "Yes" --> J["Return 1 without prefetch"] I -- "No" --> K{"Has complete LOCAL_DISK replica?"} K -- "No" --> L["Return 1 without prefetch"] K -- "Yes" --> M["Allocate MEMORY replica, prefer local segment"] M --> N{"Local segment has space?"} N -- "Yes" --> P["Use local DRAM target"] N -- "No" --> O["Try another memory segment"] O --> Q{"Global DRAM needs eviction?"} Q -- "Yes" --> R["Trigger normal eviction path"] Q -- "No" --> S["Use remote DRAM target if allocated"] R --> S P --> T["Read object from LOCAL_DISK"] S --> T T --> U["Write object into allocated MEMORY replica"] U --> V["Mark MEMORY replica complete"] V --> W["Return 1"] M --> X{"Allocation or transfer failed?"} X -- "Yes" --> Y["Log and count prefetch failure; return 1"]The concrete SSD-to-DRAM prefetch path is:
The prefetch path can be implemented as a specialized internal copy from
LOCAL_DISKtoMEMORY:LOCAL_DISKsource replica only if no complete or in-flightMEMORYreplica exists.MEMORYreplica for the same key using preferred-local placement.MEMORYreplica complete.Master-Side Requirements
The current master
ExistKeyAPI only returns a boolean. To implement prefetch, the caller needs replica placement information. There are two possible approaches:Option A: Client-side query before or after
ExistKeyKeep master
ExistKeyunchanged. Whenprefetch=True, the real client uses the normal query path to fetch replica descriptors, then decides whether prefetch is needed.Pros:
ExistKey.existfast.Cons:
exist(prefetch=True)may require an additional metadata query.Option B: Extend
ExistKeyresponseIntroduce a richer RPC response for prefetch-capable exist:
Pros:
Cons:
Recommendation: start with Option A. It is simpler and keeps the existing
ExistKeyRPC stable.Client-Side Requirements
Add new client-layer methods, for example:
For Python binding:
.def( "is_exist", [](MooncakeStorePyWrapper& self, const std::string& key, bool prefetch) { py::gil_scoped_release release; return self.store_->isExist(key, prefetch); }, py::arg("key"), py::arg("prefetch") = false)The existing overload without the flag should continue to work.
Prefetch Operation
The real client should provide an internal helper:
The helper should:
MEMORYreplica already exists.LOCAL_DISKreplica as the source only when noMEMORYreplica is complete or in flight.LOCAL_DISK.Races are expected. For example, another client may prefetch or put the same key concurrently. If a complete or in-flight memory replica appears before completion, the prefetch should be treated as successful or safely revoked. In particular, a key with a
MEMORYreplica in aPROCESSINGstate should not start SSD-to-DRAM prefetch, because the normal put path is already materializing that key in DRAM.Eviction Behavior
Prefetch allocation should use the same memory allocation path as normal puts. Therefore:
offload_on_evictbehavior should apply.The RFC does not propose a special eviction policy for prefetched objects. They should participate in normal LRU/lease-based eviction after being materialized in memory.
Error Handling
Recommended behavior:
0.1.1without prefetch.1.1, log the prefetch failure, and update a metric.Returning
1on prefetch failure is intentional because the object does exist. Applications that require strict materialization should use a future explicit prefetch API or callget.Metrics
TBD
Compatibility
This change is backward compatible:
is_exist(key)keeps the same behavior.prefetch=Falseavoids unexpected data movement.existas a cheap metadata check are not affected.Summary
This RFC proposes an opt-in
existprefetch mode for Mooncake Store. When memory and SSD offload are both enabled,is_exist(key, prefetch=True)can promote SSD-only objects back into DRAM, preferring the caller's local segment and falling back to other memory segments when local DRAM is full.The default remains unchanged. The feature improves cache-warming behavior for callers that use
existas a predictor of near-future access, while preserving the lightweight semantics of existingexistcalls.cc List: @ykwd @ascend-direct-dev @LCAIZJ @LujhCoconut
Feedbacks are welcome!
Before submitting a new issue...