Skip to content

[Store] Prefer LOCAL_DISK replica when both LOCAL_DISK and DISK types exist#1963

Open
ertcmm wants to merge 1 commit into
kvcache-ai:mainfrom
ertcmm:feat-localdisk_first
Open

[Store] Prefer LOCAL_DISK replica when both LOCAL_DISK and DISK types exist#1963
ertcmm wants to merge 1 commit into
kvcache-ai:mainfrom
ertcmm:feat-localdisk_first

Conversation

@ertcmm

@ertcmm ertcmm commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Description

This PR fixes a data retrieval failure and routing bug in RealClient during SSD Offloading when an object simultaneously possesses both LOCAL_DISK and DISK replicas. It ensures proper data retrieval functionality when LOCAL_DISK and DISK mutually coexist after an object's memory replica has been evicted.

The Issue:

  1. Ignored Priority System (Blind Indexing): In RealClient::batch_get_into_internal (and identically across batch_get_buffer_internal), the fetch dispatcher skipped GetPreferredReplica() entirely and naively locked onto query_result_values.replicas[0].
  2. The system was coerced to bypass the dedicated high-efficiency SSD offline proxy (batch_get_into_offload_object_internal) and push the LOCAL_DISK item into the standard Client::BatchGet pipeline mappings. Standard backend transfers failed to parse or resolve LocalDiskDescriptor schemas concurrently, resulting in broken routes and failures.

Fix

  • Removed Hardcoded Indexing (Blind Indexing): Rewrote RealClient fetch pipelines to strictly call client_->GetPreferredReplica() uniformly across all batch operations instead of blindly seizing replicas[0].
  • Removed Restrictive Routing Constraints: Completely eliminated the replicas.size() == 1 precondition. The path dispatcher now seamlessly routes data internally based purely on whether the optimal replica evaluates to .is_local_disk_replica().
  • Secured Target Fetch Accuracy (Key Operations): Updated KeyOp properties to meticulously cache and utilize the specifically chosen preferred_replica. This guarantees that configurations with [LOCAL_DISK, DISK] naturally and safely detour into the batch_get_into_offload_object_internal channel, averting parsing collisions.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

The modifications were verified by forcing local memory eviction policies triggering simultaneous LOCAL_DISK and global DISK entries bound to target objects.

  • Assured the RealClient arrays reliably generated payload size arrays via combinations mimicking [LOCAL_DISK, DISK].
  • Validated that GetPreferredReplica consistently selects the .is_local_disk_replica().
  • Successfully validated that eliminating the size == 1 check correctly delegates these payloads back natively into batch_get_into_offload_object_internal, securing safe operations away from legacy paths.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@ertcmm ertcmm changed the title feat: Prefer LOCAL_DISK replica when both LOCAL_DISK and DISK types exist [Store] Prefer LOCAL_DISK replica when both LOCAL_DISK and DISK types exist Apr 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for local disk replicas within the mooncake-store. Key changes include the addition of a completion status check for replicas and a revised selection strategy in GetPreferredReplica that prioritizes local memory, remote memory, local disk, and global disk in descending order. The RealClient has been updated to handle local disk offloading for single and batch retrieval operations. Review feedback identifies several critical issues: the local disk offload path currently lacks support for multi-slice objects, which could lead to data loss, and the use of key-based maps for tracking operations fails to account for duplicate keys in input vectors, potentially leaving buffers uninitialized. Additionally, an optimization was suggested to improve the efficiency of processing batch results from multiple storage nodes.

Comment on lines +1843 to +1845
if (replica.is_local_disk_replica()) {
std::unordered_map<std::string, Slice> slices_map;
slices_map.emplace(key, slices.at(0));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The LOCAL_DISK offload path currently only supports a single slice per key. If the object size exceeds kMaxSliceSize, allocateSlices will produce multiple slices, but only the first one is fetched here, leading to incomplete data retrieval. A check should be added to ensure slices.size() == 1 before proceeding.

    if (replica.is_local_disk_replica()) {
        if (slices.size() != 1) {
            LOG(ERROR) << "Local disk offload currently only supports 1 slice per key, given: "
                       << slices.size() << " for key: " << key;
            return nullptr;
        }
        std::unordered_map<std::string, Slice> slices_map;
        slices_map.emplace(key, slices.at(0));

Comment on lines +2180 to +2185
for (const auto &op : valid_local_disk_ops) {
const auto &replica = op.preferred_replica;
auto [it, _] = offload_objects.try_emplace(
replica.get_local_disk_descriptor().transport_endpoint);
it->second.emplace(op.key, op.slices.at(0));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues here:

  1. Similar to get_buffer_internal, this path only fetches the first slice (slices.at(0)). If an object is split into multiple slices, data will be lost. A check for op.slices.size() == 1 is needed.
  2. If the input keys vector contains duplicate entries, it->second.emplace will only store the first occurrence. Consequently, only the buffer for the first occurrence will be filled, while subsequent occurrences will remain uninitialized but still be marked as success in the result processing loop (lines 2195-2201).

Comment on lines +3301 to 3302
if (replica.is_local_disk_replica()) {
valid_local_disk_operations.emplace(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

valid_local_disk_operations is a std::unordered_map, which means if the input keys vector contains duplicate entries, only the first occurrence will be recorded and fetched. However, results[i] is set to success for all occurrences (lines 3310 and 3323), leading to uninitialized buffers for duplicate keys. Positional tracking (e.g., using indices) should be used instead of a key-based map for destination buffers.

Comment on lines +3651 to +3652
valid_local_disk_operations.emplace(
key,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to batch_get_into_internal, using a map for valid_local_disk_operations causes issues when duplicate keys are present in the input. Only the first occurrence is fetched, but all are marked as success in the results vector.

Comment on lines +2195 to +2201
for (auto &op : valid_local_disk_ops) {
if (offload_objects_it.second.count(op.key)) {
final_results[op.original_index] =
std::make_shared<BufferHandle>(
std::move(*op.buffer_handle));
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This nested loop has $O(N_{endpoints} \times N_{ops})$ complexity. For large batches and multiple storage nodes, this can be inefficient. Consider grouping valid_local_disk_ops by endpoint during the initial pass or using a mapping from key to operations to speed up result processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant