feat(store): route NoF replicas through put and get#2247
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for SPDK NoF (NVMe-oF) replicas, including NUMA socket auto-detection, NoF-specific transfer submissions, and enhanced batch finalization logic to handle flexible dual-replica write modes. The review feedback highlights critical security and correctness issues: potential buffer overflows and memory corruption across several transfer paths (BatchGet, SubmitTransfers, and TransferData) if multiple non-contiguous slices are passed for NoF transfers, a resource leak in FinalizeBatchPut where allocated replicas are not revoked for early-failed operations, and an issue where early failure error codes are overwritten during finalization.
b5aede1 to
810da11
Compare
810da11 to
8f7c39e
Compare
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
I noticed your changes to if (client_cfg.prefer_alloc_in_same_node) {
if (client_cfg.replica_num != 1) { // ← only checks memory replica
LOG(ERROR) << "prefer_alloc_in_same_node is not supported with "
"replica_num != 1";
return std::vector<<tl::expected<void, ErrorCode>>(
keys.size(), tl::unexpected(ErrorCode::INVALID_PARAMS));
}
StartBatchPut(ops, client_cfg);
return BatchPutWhenPreferSameNode(ops); // ← enters the MEMORY-only path
} |
| VLOG(1) << "Successfully completed put for key " | ||
| << successful_keys[i]; | ||
| } | ||
| return; |
There was a problem hiding this comment.
When responses.size() != group.keys.size(), the function returns after setting finalize_rpc_errors but does not decrement pending_finalize_actions. The RPC has already completed (just with a wrong response count), so the counter should be decremented. Otherwise the final loop sees pending_finalize_actions[i] != 0 and reports "Operation has unfinished finalize actions" — which is misleading, since the RPC did finish, it just returned unexpected results.
Suggested fix. add --pending_finalize_actions[idx] inside the mismatch branch:
if (responses.size() != group.keys.size()) {
for (size_t idx : group.indices) {
finalize_rpc_errors[idx] = ErrorCode::RPC_FAIL;
--pending_finalize_actions[idx];
}
return;
} | LOG(INFO) << "Successfully revoked failed put for key " | ||
| << failed_keys[i]; | ||
| } | ||
| return; |
There was a problem hiding this comment.
same as upper comments. add --pending_finalize_actions[idx] inside the mismatch branch:
| auto add_finalize_action = | ||
| [&](const std::optional<ReplicaType>& replica_type, bool is_end, | ||
| const std::string& key, size_t index) { | ||
| if (!replica_type.has_value()) { | ||
| return; | ||
| } | ||
| ++pending_finalize_actions[index]; | ||
| switch (*replica_type) { | ||
| case ReplicaType::ALL: | ||
| add_group_entry(is_end ? end_all_group : revoke_all_group, | ||
| key, index); | ||
| break; | ||
| case ReplicaType::MEMORY: | ||
| add_group_entry( | ||
| is_end ? end_memory_group : revoke_memory_group, key, | ||
| index); | ||
| break; | ||
| case ReplicaType::NOF_SSD: | ||
| add_group_entry(is_end ? end_nof_group : revoke_nof_group, | ||
| key, index); | ||
| break; | ||
| default: | ||
| LOG(ERROR) << "Unexpected replica type in batch finalize: " | ||
| << *replica_type; | ||
| finalize_rpc_errors[index] = ErrorCode::INVALID_PARAMS; | ||
| break; | ||
| } | ||
| }; |
There was a problem hiding this comment.
++pending_finalize_actions[index] is executed before the switch, but the default branch never decrements it. If a new ReplicaType is added in the future without a corresponding case here, the count will never reach zero and the final loop will report "unfinished finalize actions".
Suggested fix (pick one):
- Add
--pending_finalize_actions[index]in the default branch - Remove the default branch entirely and let the compiler enforce exhaustive coverage (the compiler will warn when a new enum value is added):
switch (*replica_type) {
case ReplicaType::ALL: ... break;
case ReplicaType::MEMORY: ... break;
case ReplicaType::NOF_SSD: ... break;
// no default — compiler warns on new enum values
} | struct FinalizeDecision { | ||
| std::optional<ReplicaType> end_type; | ||
| std::optional<ReplicaType> revoke_type; | ||
| bool success = false; | ||
| ErrorCode error = ErrorCode::OK; | ||
| }; | ||
|
|
||
| FinalizeDecision DetermineFinalizeDecision( | ||
| const ReplicateConfig& config, const ReplicaTransferSummary& summary) { | ||
| const auto write_mode = DetermineReplicaWriteMode(config); | ||
| const bool allocation_satisfied = | ||
| HasExpectedReplicaAllocation(config, summary); | ||
|
|
There was a problem hiding this comment.
The success field has different meanings depending on the write mode:
Non-FLEXIBLE: success = truemeans all replicas transferred successfullyFLEXIBLE_DUAL_REPLICA: success = truemeans at least one replica type succeeded (bothend_typeandrevoke_typemay be set simultaneously)
Suggest adding a comment on the FinalizeDecision struct or DetermineFinalizeDecision function to clarify this distinction for future maintainers.
| PutOperation(std::string_view k, const std::vector<Slice>& s) | ||
| : key(k), slices(s) { | ||
| value_length = CalculateSliceSize(slices); | ||
| ptr = ((!s.empty()) ? slices[0].ptr : nullptr); |
There was a problem hiding this comment.
The ptr field is assigned in the constructor, but it appears to be never read anywhere in this PR—it looks like dead code. Will a subsequent PR use it? If not, I suggest removing it.
|
Glad to review this follow-up PR — it's overall clean. My main concerns are in the comments above. Thanks for your patience. |
Description
This PR is the third split-out part of the original NVMe-oF SSD cache support patch.
It builds on #2143 and #2172. #2143 introduced master-side NoF segment metadata and control-plane support, and #2172 added the SPDK/NVMe-oF worker pool plus low-level data-plane primitives. This PR wires those pieces into the regular Mooncake Store put/get paths so NoF replicas can be selected, written, read, finalized, and revoked through the existing client workflow.
This PR does not add the NoF end-to-end test suite or deployment/registration tooling. Those will be handled in follow-up PRs.
Changes
NOF_SSDreplicas through regular StoreGet/BatchGetpaths.NOF_SSDreplicas through regular StorePut/BatchPutpaths.MEMORYandNOF_SSDreplicas.ReplicaType::ALLReplicaType::MEMORYReplicaType::NOF_SSDMEMORYNOF_SSDMC_STORE_NUMA_SOCKET_IDor current CPU NUMA node.Scope
This PR only connects NoF replicas to the Store client put/get data path.
The remaining NoF SSD support is intentionally left for follow-up PRs:
Related: #2084, #2143, #2172, #1940
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.