[Store] Prefer LOCAL_DISK replica when both LOCAL_DISK and DISK types exist#1963
[Store] Prefer LOCAL_DISK replica when both LOCAL_DISK and DISK types exist#1963ertcmm wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for local disk replicas within the mooncake-store. Key changes include the addition of a completion status check for replicas and a revised selection strategy in GetPreferredReplica that prioritizes local memory, remote memory, local disk, and global disk in descending order. The RealClient has been updated to handle local disk offloading for single and batch retrieval operations. Review feedback identifies several critical issues: the local disk offload path currently lacks support for multi-slice objects, which could lead to data loss, and the use of key-based maps for tracking operations fails to account for duplicate keys in input vectors, potentially leaving buffers uninitialized. Additionally, an optimization was suggested to improve the efficiency of processing batch results from multiple storage nodes.
| if (replica.is_local_disk_replica()) { | ||
| std::unordered_map<std::string, Slice> slices_map; | ||
| slices_map.emplace(key, slices.at(0)); |
There was a problem hiding this comment.
The LOCAL_DISK offload path currently only supports a single slice per key. If the object size exceeds kMaxSliceSize, allocateSlices will produce multiple slices, but only the first one is fetched here, leading to incomplete data retrieval. A check should be added to ensure slices.size() == 1 before proceeding.
if (replica.is_local_disk_replica()) {
if (slices.size() != 1) {
LOG(ERROR) << "Local disk offload currently only supports 1 slice per key, given: "
<< slices.size() << " for key: " << key;
return nullptr;
}
std::unordered_map<std::string, Slice> slices_map;
slices_map.emplace(key, slices.at(0));| for (const auto &op : valid_local_disk_ops) { | ||
| const auto &replica = op.preferred_replica; | ||
| auto [it, _] = offload_objects.try_emplace( | ||
| replica.get_local_disk_descriptor().transport_endpoint); | ||
| it->second.emplace(op.key, op.slices.at(0)); | ||
| } |
There was a problem hiding this comment.
There are two issues here:
- Similar to
get_buffer_internal, this path only fetches the first slice (slices.at(0)). If an object is split into multiple slices, data will be lost. A check forop.slices.size() == 1is needed. - If the input
keysvector contains duplicate entries,it->second.emplacewill only store the first occurrence. Consequently, only the buffer for the first occurrence will be filled, while subsequent occurrences will remain uninitialized but still be marked as success in the result processing loop (lines 2195-2201).
| if (replica.is_local_disk_replica()) { | ||
| valid_local_disk_operations.emplace( |
There was a problem hiding this comment.
valid_local_disk_operations is a std::unordered_map, which means if the input keys vector contains duplicate entries, only the first occurrence will be recorded and fetched. However, results[i] is set to success for all occurrences (lines 3310 and 3323), leading to uninitialized buffers for duplicate keys. Positional tracking (e.g., using indices) should be used instead of a key-based map for destination buffers.
| valid_local_disk_operations.emplace( | ||
| key, |
| for (auto &op : valid_local_disk_ops) { | ||
| if (offload_objects_it.second.count(op.key)) { | ||
| final_results[op.original_index] = | ||
| std::make_shared<BufferHandle>( | ||
| std::move(*op.buffer_handle)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Description
This PR fixes a data retrieval failure and routing bug in
RealClientduring SSD Offloading when an object simultaneously possesses bothLOCAL_DISKandDISKreplicas. It ensures proper data retrieval functionality whenLOCAL_DISKandDISKmutually coexist after an object's memory replica has been evicted.The Issue:
RealClient::batch_get_into_internal(and identically acrossbatch_get_buffer_internal), the fetch dispatcher skippedGetPreferredReplica()entirely and naively locked ontoquery_result_values.replicas[0].batch_get_into_offload_object_internal) and push theLOCAL_DISKitem into the standardClient::BatchGetpipeline mappings. Standard backend transfers failed to parse or resolveLocalDiskDescriptorschemas concurrently, resulting in broken routes and failures.Fix
RealClientfetch pipelines to strictly callclient_->GetPreferredReplica()uniformly across all batch operations instead of blindly seizingreplicas[0].replicas.size() == 1precondition. The path dispatcher now seamlessly routes data internally based purely on whether the optimal replica evaluates to.is_local_disk_replica().KeyOpproperties to meticulously cache and utilize the specifically chosenpreferred_replica. This guarantees that configurations with[LOCAL_DISK, DISK]naturally and safely detour into thebatch_get_into_offload_object_internalchannel, averting parsing collisions.Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
The modifications were verified by forcing local memory eviction policies triggering simultaneous
LOCAL_DISKand globalDISKentries bound to target objects.[LOCAL_DISK, DISK].GetPreferredReplicaconsistently selects the.is_local_disk_replica().size == 1check correctly delegates these payloads back natively intobatch_get_into_offload_object_internal, securing safe operations away from legacy paths.Checklist
./scripts/code_format.shbefore submitting.