Skip to content

feat(store): route NoF replicas through put and get#2247

Open
Enigmo-x wants to merge 1 commit into
kvcache-ai:mainfrom
Enigmo-x:dev_nof_ssd_split
Open

feat(store): route NoF replicas through put and get#2247
Enigmo-x wants to merge 1 commit into
kvcache-ai:mainfrom
Enigmo-x:dev_nof_ssd_split

Conversation

@Enigmo-x
Copy link
Copy Markdown
Contributor

Description

This PR is the third split-out part of the original NVMe-oF SSD cache support patch.

It builds on #2143 and #2172. #2143 introduced master-side NoF segment metadata and control-plane support, and #2172 added the SPDK/NVMe-oF worker pool plus low-level data-plane primitives. This PR wires those pieces into the regular Mooncake Store put/get paths so NoF replicas can be selected, written, read, finalized, and revoked through the existing client workflow.

This PR does not add the NoF end-to-end test suite or deployment/registration tooling. Those will be handled in follow-up PRs.

Changes

  • Route NOF_SSD replicas through regular Store Get / BatchGet paths.
  • Route NOF_SSD replicas through regular Store Put / BatchPut paths.
  • Pass NoF transfer buffer pointer and total size into the transfer submitter so the SPDK NoF path can issue I/O correctly.
  • Add per-replica transfer accounting for MEMORY and NOF_SSD replicas.
  • Finalize put operations with the proper replica type:
    • ReplicaType::ALL
    • ReplicaType::MEMORY
    • ReplicaType::NOF_SSD
  • Revoke failed or partially failed replica writes with the matching replica type.
  • Support flexible dual-replica write behavior where either memory or NoF writes may complete independently.
  • Prefer local replicas in get selection order:
    • local MEMORY
    • local NOF_SSD
    • fallback replica
  • Initialize NoF transfer submitter with a NUMA socket id, either from MC_STORE_NUMA_SOCKET_ID or current CPU NUMA node.
  • Keep existing memory, disk, local-disk, and non-NoF paths unchanged.

Scope

This PR only connects NoF replicas to the Store client put/get data path.

The remaining NoF SSD support is intentionally left for follow-up PRs:

  1. NoF end-to-end tests and benchmark coverage
  2. Deployment and SSD registration tools

Related: #2084, #2143, #2172, #1940

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for SPDK NoF (NVMe-oF) replicas, including NUMA socket auto-detection, NoF-specific transfer submissions, and enhanced batch finalization logic to handle flexible dual-replica write modes. The review feedback highlights critical security and correctness issues: potential buffer overflows and memory corruption across several transfer paths (BatchGet, SubmitTransfers, and TransferData) if multiple non-contiguous slices are passed for NoF transfers, a resource leak in FinalizeBatchPut where allocated replicas are not revoked for early-failed operations, and an issue where early failure error codes are overwritten during finalization.

Comment thread mooncake-store/src/client_service.cpp
Comment thread mooncake-store/src/client_service.cpp
Comment thread mooncake-store/src/client_service.cpp
Comment thread mooncake-store/src/client_service.cpp
Comment thread mooncake-store/src/client_service.cpp
@Enigmo-x Enigmo-x force-pushed the dev_nof_ssd_split branch from b5aede1 to 810da11 Compare May 28, 2026 03:04
@Enigmo-x Enigmo-x force-pushed the dev_nof_ssd_split branch from 810da11 to 8f7c39e Compare May 28, 2026 06:27
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 47.74011% with 185 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/client_service.cpp 47.74% 185 Missing ⚠️

📢 Thoughts on this report? Let us know!

@LujhCoconut
Copy link
Copy Markdown
Collaborator

I noticed your changes to BatchPutWhenPreferSameNode. After tracing the function call chain, I found that this function is invoked from BatchPut. However, BatchPut does not validate NOF replicas, so the combination of nof_replica_num=1 and prefer_alloc_in_same_node=true can pass the entry checks and enter BatchPutWhenPreferSameNode. Inside that function, all non-MEMORY replicas are immediately rejected by if (!replica.is_memory_replica()). This is problematic.

if (client_cfg.prefer_alloc_in_same_node) {
    if (client_cfg.replica_num != 1) {              // ← only checks memory replica
        LOG(ERROR) << "prefer_alloc_in_same_node is not supported with "
                      "replica_num != 1";
        return std::vector<<tl::expected<void, ErrorCode>>(
            keys.size(), tl::unexpected(ErrorCode::INVALID_PARAMS));
    }
    StartBatchPut(ops, client_cfg);
    return BatchPutWhenPreferSameNode(ops);         // ← enters the MEMORY-only path
}

VLOG(1) << "Successfully completed put for key "
<< successful_keys[i];
}
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When responses.size() != group.keys.size(), the function returns after setting finalize_rpc_errors but does not decrement pending_finalize_actions. The RPC has already completed (just with a wrong response count), so the counter should be decremented. Otherwise the final loop sees pending_finalize_actions[i] != 0 and reports "Operation has unfinished finalize actions" — which is misleading, since the RPC did finish, it just returned unexpected results.

Suggested fix. add --pending_finalize_actions[idx] inside the mismatch branch:

  if (responses.size() != group.keys.size()) {
      for (size_t idx : group.indices) {                                                                              
          finalize_rpc_errors[idx] = ErrorCode::RPC_FAIL;                                                             
          --pending_finalize_actions[idx];                                                                            
      }                                                                                                               
      return;                                                                                                         
  }      

LOG(INFO) << "Successfully revoked failed put for key "
<< failed_keys[i];
}
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as upper comments. add --pending_finalize_actions[idx] inside the mismatch branch:

Comment on lines +2040 to +2067
auto add_finalize_action =
[&](const std::optional<ReplicaType>& replica_type, bool is_end,
const std::string& key, size_t index) {
if (!replica_type.has_value()) {
return;
}
++pending_finalize_actions[index];
switch (*replica_type) {
case ReplicaType::ALL:
add_group_entry(is_end ? end_all_group : revoke_all_group,
key, index);
break;
case ReplicaType::MEMORY:
add_group_entry(
is_end ? end_memory_group : revoke_memory_group, key,
index);
break;
case ReplicaType::NOF_SSD:
add_group_entry(is_end ? end_nof_group : revoke_nof_group,
key, index);
break;
default:
LOG(ERROR) << "Unexpected replica type in batch finalize: "
<< *replica_type;
finalize_rpc_errors[index] = ErrorCode::INVALID_PARAMS;
break;
}
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++pending_finalize_actions[index] is executed before the switch, but the default branch never decrements it. If a new ReplicaType is added in the future without a corresponding case here, the count will never reach zero and the final loop will report "unfinished finalize actions".

Suggested fix (pick one):

  1. Add --pending_finalize_actions[index] in the default branch
  2. Remove the default branch entirely and let the compiler enforce exhaustive coverage (the compiler will warn when a new enum value is added):
  switch (*replica_type) {                                                                                            
      case ReplicaType::ALL: ... break;                                                                               
      case ReplicaType::MEMORY: ... break;                                                                            
      case ReplicaType::NOF_SSD: ... break;                                                                           
      // no default — compiler warns on new enum values                                                               
  }                   

Comment on lines +183 to +195
struct FinalizeDecision {
std::optional<ReplicaType> end_type;
std::optional<ReplicaType> revoke_type;
bool success = false;
ErrorCode error = ErrorCode::OK;
};

FinalizeDecision DetermineFinalizeDecision(
const ReplicateConfig& config, const ReplicaTransferSummary& summary) {
const auto write_mode = DetermineReplicaWriteMode(config);
const bool allocation_satisfied =
HasExpectedReplicaAllocation(config, summary);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The success field has different meanings depending on the write mode:

  • Non-FLEXIBLE: success = true means all replicas transferred successfully
  • FLEXIBLE_DUAL_REPLICA: success = true means at least one replica type succeeded (both end_type and revoke_type may be set simultaneously)

Suggest adding a comment on the FinalizeDecision struct or DetermineFinalizeDecision function to clarify this distinction for future maintainers.

PutOperation(std::string_view k, const std::vector<Slice>& s)
: key(k), slices(s) {
value_length = CalculateSliceSize(slices);
ptr = ((!s.empty()) ? slices[0].ptr : nullptr);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ptr field is assigned in the constructor, but it appears to be never read anywhere in this PR—it looks like dead code. Will a subsequent PR use it? If not, I suggest removing it.

@LujhCoconut
Copy link
Copy Markdown
Collaborator

Glad to review this follow-up PR — it's overall clean. My main concerns are in the comments above. Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants