Skip to content

[Store] Implement tenant metadata map isolation#2232

Open
Lin-z-w wants to merge 1 commit into
kvcache-ai:mainfrom
Lin-z-w:feat/tenant-metadata-map
Open

[Store] Implement tenant metadata map isolation#2232
Lin-z-w wants to merge 1 commit into
kvcache-ai:mainfrom
Lin-z-w:feat/tenant-metadata-map

Conversation

@Lin-z-w
Copy link
Copy Markdown
Contributor

@Lin-z-w Lin-z-w commented May 26, 2026

Description

This PR introduces the first stage of tenant-aware metadata isolation in Mooncake Store.

The main change is to refactor MasterService metadata from a single-level key -> ObjectMetadata map into a tenant-aware namespace:

MetadataShard -> tenant_id -> user_key -> ObjectMetadata

This provides the internal metadata foundation needed for multi-tenancy while preserving legacy/default-tenant behavior. The default tenant keeps the old shard mapping so snapshots written before this change can still restore keys that are reachable by legacy APIs.

This PR also updates related master paths to avoid incorrect cross-tenant behavior:

  • Adds tenant-aware internal overloads for put/get/exist/regex/remove paths.
  • Keeps no-tenant APIs scoped to the default tenant.
  • Serializes tenant IDs in new metadata snapshots while remaining compatible with old snapshot entries.
  • Prevents non-default tenant objects from entering the existing offload queue until the offload callback protocol carries tenant identity.
  • Makes drain completion account for replicas from all tenants so a segment is not incorrectly marked as drained while non-default tenant replicas remain.
  • Adds focused tests for same-user-key tenant isolation and tenant-scoped regex behavior.

This PR does not implement full server-side tenant authorization. In particular, client-to-tenant binding, RPC-level tenant identity propagation, and tenant validation will be implemented in follow-up PRs. The planned staging is:

  1. This PR: stabilize the tenant-aware metadata data model and preserve legacy behavior.
  2. Follow-up PR: pass tenant identity from client/MasterClient through RPC and establish a trusted client_id -> tenant_id binding in master.
  3. Follow-up PR: make async/cross-component flows such as offload, promotion, copy/move, and drain fully tenant-aware with authorization checks.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Built the focused test targets with Ninja:

ninja -C build master_service_test offload_on_evict_test promotion_on_hit_test snapshot_child_process_test

Ran the focused CTest suite:

ctest --test-dir build -R "^(master_service_test|offload_on_evict_test|promotion_on_hit_test|snapshot_child_process_test)$" --output-on-failure

Result:

100% tests passed, 0 tests failed out of 4

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-tenancy support to the MasterService by scoping metadata, processing keys, and tasks under a new TenantState structure within each metadata shard. Key APIs such as PutStart, PutEnd, GetReplicaList, and Remove have been updated to accept an optional tenant_id parameter, defaulting to "default". The review feedback highlights a performance optimization opportunity in the serialization logic, suggesting the use of a custom struct with direct pointers to avoid redundant hash map lookups during key sorting.

Comment thread mooncake-store/src/master_service.cpp Outdated
Comment on lines 5622 to 5639
std::vector<std::pair<std::string, std::string>> sorted_keys;
sorted_keys.reserve(metadata_count);
for (const auto& [tenant_id, tenant_state] : shard.tenants) {
for (const auto& [key, metadata] : tenant_state.metadata) {
sorted_keys.emplace_back(tenant_id, key);
}
}
std::sort(sorted_keys.begin(), sorted_keys.end());

for (const auto& key : sorted_keys) {
const auto& metadata = shard.metadata.at(key);
// Each metadata item format: [key, metadata_object]
packer.pack_array(2);
for (const auto& [tenant_id, key] : sorted_keys) {
const auto& tenant_state = shard.tenants.at(tenant_id);
const auto& metadata = tenant_state.metadata.at(key);
// Each metadata item format: [tenant_id, key, metadata_object].
packer.pack_array(3);
packer.pack(tenant_id);
packer.pack(key);

auto result = SerializeMetadata(metadata, packer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

During serialization, sorting the keys and then looking up each tenant_state and metadata using .at() introduces significant overhead because it performs two hash map lookups (shard.tenants.at and tenant_state.metadata.at) for every single metadata item.

We can optimize this by storing pointers to the ObjectMetadata objects directly in the sorted collection, sorting them with a custom comparator, and then accessing the metadata directly without any map lookups.

    struct SortedEntry {
        std::string tenant_id;
        std::string key;
        const ObjectMetadata* metadata;
    };
    std::vector<SortedEntry> sorted_entries;
    sorted_entries.reserve(metadata_count);
    for (const auto& [tenant_id, tenant_state] : shard.tenants) {
        for (const auto& [key, metadata] : tenant_state.metadata) {
            sorted_entries.push_back({tenant_id, key, &metadata});
        }
    }
    std::sort(sorted_entries.begin(), sorted_entries.end(),
              [](const SortedEntry& a, const SortedEntry& b) {
                  return a.tenant_id != b.tenant_id ? a.tenant_id < b.tenant_id
                                                    : a.key < b.key;
              });

    for (const auto& entry : sorted_entries) {
        // Each metadata item format: [tenant_id, key, metadata_object].
        packer.pack_array(3);
        packer.pack(entry.tenant_id);
        packer.pack(entry.key);

        auto result = SerializeMetadata(*entry.metadata, packer);

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 87.39377% with 89 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/master_service.cpp 84.51% 83 Missing ⚠️
mooncake-store/include/master_service.h 94.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@yokinoshitayoki
Copy link
Copy Markdown
Collaborator

One small compatibility note with the grouped-lifecycle work in #2127/#2180: this PR moves object metadata into tenant-scoped TenantState, while the grouped routing path tracks object-to-group and group members. When these two changes are combined, the group routing/indexes probably need to be tenant-scoped as well, e.g. route grouped objects by (tenant_id, group_id) and ungrouped objects by (tenant_id, key), and keep group members under the same tenant boundary.

Otherwise tenants using the same user key or group_id could accidentally share routing or lifecycle state. This does not look blocking for this PR now, but it would be good to keep in mind for the later PR.

@Lin-z-w Lin-z-w force-pushed the feat/tenant-metadata-map branch 2 times, most recently from b987ec1 to 566b3f4 Compare May 28, 2026 11:52
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Lin-z-w Lin-z-w force-pushed the feat/tenant-metadata-map branch from c59f4d5 to 4ed418e Compare May 28, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants