[Store] rebuild offset-allocator metadata on restart by zxpdemonio · Pull Request #2215 · kvcache-ai/Mooncake

zxpdemonio · 2026-05-25T07:18:05Z

Motivation

OffsetAllocatorStorageBackend stores all objects in one shared data file and relies on in-memory metadata plus allocator state to locate objects. Before this change, that backend did not have a complete restart-recovery path: process restart could preserve kv_cache.data, but it could not reliably reconstruct the allocator state and key index needed to serve reads and replay local-disk metadata back to the master.

This PR closes that gap by persisting recovery metadata alongside the data file and rebuilding the backend's in-memory state during startup. The goal is to make OffsetAllocatorStorageBackend follow the existing synchronous startup recovery model already used by the storage stack, instead of introducing a new async rebuild path in this PR.

Scenario

This PR targets the existing startup flow for local-disk recovery:

FileStorage::Init() starts the backend.
OffsetAllocatorStorageBackend::Init() restores local allocator/index state from persisted recovery metadata.
ScanMeta() iterates the rebuilt in-memory object view.
NotifyOffloadSuccess() re-registers recovered objects with the master.

With this change, restart recovery for offset-allocator-backed storage works the same way as the existing synchronous startup recovery skeleton: startup succeeds only after the backend's local readable state is rebuilt.

Description

This PR adds restart recovery for OffsetAllocatorStorageBackend through three pieces:

Recovery metadata persistence

Persist generation-scoped recovery files next to kv_cache.data:
- kv_cache.allocator.<generation>
- kv_cache.index.<generation>
- kv_cache.manifest
Persist allocator state separately from object index state.
Use the manifest as the authoritative pointer to the active recovery generation.
Remove obsolete allocator/index snapshot generations after a successful manifest update.

Restart validation and rebuild

Load manifest, allocator snapshot, and index snapshot during backend init.
Fail fast when:
- the manifest is missing but snapshot files exist,
- manifest-referenced snapshot files are missing,
- allocator/index snapshots cannot be deserialized,
- persisted allocation state is invalid,
- duplicate keys or duplicate allocation states appear in the snapshot,
- the backing data file is missing or truncated relative to configured capacity.
Rebuild shard maps, object metadata, aggregate counters, and allocator-backed handles from persisted recovery state.
Keep the startup flow synchronous; this PR does not introduce background async rebuild.

Write-path integration and recovery correctness

Update recovery metadata on successful batch offload so restart state tracks overwrites and partial-success batches.
Preserve overwrite recovery correctness through stale-allocation tracking.
Keep recovery snapshot state rollback-safe when recovery persistence fails during publish.
Reduce reviewer-facing failure ambiguity by tightening snapshot validation and rebuild invariants.

Documentation

Update docs/source/design/ssd-offload.md to describe:
- the recovery file layout,
- the startup rebuild path,
- fail-fast recovery semantics for the offset-allocator backend.

How Has This Been Tested?

/root/Mooncake/build-min-check/mooncake-store/tests/storage_backend_test --gtest_filter="StorageBackendTest.OffsetAllocatorStorageBackend_RestartRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_RestartOverwriteRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_RestartRecoveryAfterOverwriteBatchFailure:StorageBackendTest.OffsetAllocatorStorageBackend_RestartEmptyBaseline:StorageBackendTest.OffsetAllocatorStorageBackend_CleansObsoleteRecoverySnapshots:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsWhenSnapshotPairMissing:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnCorruptedIndexSnapshot:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnDuplicateSnapshotKeys:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnDuplicateSnapshotAllocations:StorageBackendTest.OffsetAllocatorStorageBackend_ScanMetaBatchesAfterRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_IsEnableOffloadingRestoredAfterRecovery"
/root/Mooncake/build-min-check/mooncake-store/tests/storage_backend_test

Notes

This PR intentionally keeps rebuild synchronous during startup to match the existing storage recovery model.
Async/background rebuild is intentionally deferred to a follow-up PR.
A pre-existing stale-task gap was exposed in the promotion-on-hit path: PromotionAllocStart and NotifyPromotionSuccess only checked whether a promotion task still existed, not whether it had already expired. In the window before the async reaper swept the task, late RPCs could still allocate or commit stale promotions, potentially leaving orphaned staged memory replicas.

Module

Type of Change

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

Persist allocator and index snapshots with a manifest so OffsetAllocatorStorageBackend can restore its in-memory state after restart, fail fast on corrupted recovery metadata, and keep the recovery path covered by focused restart tests and docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements restart recovery for the OffsetAllocatorStorageBackend by introducing a mechanism to persist and restore allocator and index snapshots. The implementation includes atomic file operations, snapshot validation, and comprehensive unit tests. Feedback identifies several high-severity issues: the memory limit for snapshots is dangerously high, risking OOM crashes; BatchOffload suffers from significant performance bottlenecks due to O(N) index copies and synchronous I/O performed under an exclusive lock; and the directory fsync logic may fail for paths without parent components. It is also recommended to use the error-code-based directory_iterator to avoid exceptions.

gemini-code-assist · 2026-05-25T07:22:49Z

+        snapshot.stale_allocations = std::move(stale_allocations);
+        uint64_t generation =
+            next_snapshot_generation_.load(std::memory_order_relaxed);
+        auto persist_result = PersistRecoverySnapshots(snapshot, generation);


Calling PersistRecoverySnapshots synchronously inside BatchOffload while holding the exclusive recovery_state_mutex_ is a major performance bottleneck. This operation involves serializing the entire index (O(N)) and performing multiple fsync calls. All other concurrent BatchOffload calls will be blocked from updating their metadata until this I/O completes. Consider moving the persistence to a background thread or implementing an incremental logging mechanism (WAL) to keep the write path O(batch_size) instead of O(total_keys).

In my plan, async/background rebuild is intentionally deferred to a follow-up PR, should I implement it in this PR?

In my plan, async/background rebuild is intentionally deferred to a follow-up PR, should I implement it in this PR?

I think it's fine to address this in a follow-up PR. No problem with that.

Harden snapshot file handling with a fixed metadata size cap and safer directory fsync/iteration paths, and avoid copying the full recovery snapshot state on every batch persist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reject and synchronously clean up expired promotion-on-hit tasks before stale AllocStart or success RPCs can leave orphaned memory replicas, and harden the promotion tests around real reap state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-05-25T10:04:32Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 82.54620% with 170 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/src/storage_backend.cpp	71.21%	154 Missing ⚠️
mooncake-store/tests/storage_backend_test.cpp	97.01%	8 Missing ⚠️
mooncake-store/src/master_service.cpp	91.17%	3 Missing ⚠️
mooncake-store/src/offset_allocator.cpp	88.46%	3 Missing ⚠️
mooncake-store/tests/promotion_on_hit_test.cpp	98.16%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

LujhCoconut · 2026-05-25T11:39:13Z

-                    close(fd);
+        recovery_manifest_path_ = GetRecoveryManifestPath();
+
+        const bool recovery_mode = fs::exists(recovery_manifest_path_);


Have you considered adding an explicit enable_recovery flag instead of inferring from manifest existence? Default true to keep compatibility, but allow false for the old "clean slate" semantics. This would give operators an explicit switch in production to bypass the automatic logic during incidents or abnormal situations.

if (!recovery_mode) { for (const auto& entry : fs::directory_iterator(storage_path_, ec_dir)) { // ...existing iteration... const auto filename = entry.path().filename().string(); if (filename.rfind(kAllocatorSnapshotPrefix, 0) == 0 || filename.rfind(kIndexSnapshotPrefix, 0) == 0) { if (!enable_recovery) { LOG(WARNING) << "Recovery disabled, removing stale snapshot: " << filename; fs::remove(entry.path(), ec_dir); ec_dir.clear(); } else { LOG(ERROR) << "Recovery snapshot mismatch: manifest " "missing but recovery snapshot file exists: " << filename; return tl::make_unexpected(ErrorCode::FILE_NOT_FOUND); } } } }

Building on this idea, we could also add graceful degradation for capacity mismatches: when enable_recovery=true but LoadRecoverySnapshots() fails with SNAPSHOT_INCOMPATIBLE (e.g., capacity changed), fall back to a fresh start instead of hard failure.

if (recovery_mode) { auto load_result = LoadRecoverySnapshots(); if (!load_result) { if (load_result.error() == ErrorCode::SNAPSHOT_INCOMPATIBLE && enable_recovery_) { LOG(WARNING) << "Snapshot capacity/config mismatch, " << "discarding old snapshots and starting fresh"; recovery_mode = false; CleanupObsoleteRecoverySnapshots(/*all=*/0); } else { return load_result; } } }

Thanks for suggestion, I'll fix it.

LujhCoconut · 2026-05-25T11:44:38Z

+        if (recovery_mode) {
+            auto load_result = LoadRecoverySnapshots();
+            if (!load_result) {
+                return load_result;
+            }


The code locations additionally referenced in the previous comment. The issue is that changing the capacity configuration and restarting causes a hard startup failure — the system fails to come up completely. It feels elegant to unify both fixes behind a single flag.

Thanks for suggestion, I'll fix it.

zxpdemonio requested review from ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners May 25, 2026 07:18

github-actions Bot added run-ci Store labels May 25, 2026

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

ykwd requested a review from LujhCoconut May 25, 2026 07:24

zxpdemonio and others added 2 commits May 25, 2026 15:39

[Store] tighten restart recovery snapshot handling

81c4bfd

Harden snapshot file handling with a fixed metadata size cap and safer directory fsync/iteration paths, and avoid copying the full recovery snapshot state on every batch persist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

LujhCoconut reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] rebuild offset-allocator metadata on restart#2215

[Store] rebuild offset-allocator metadata on restart#2215
zxpdemonio wants to merge 3 commits into
kvcache-ai:mainfrom
openanolis:cruz/data-rebuild

zxpdemonio commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

zxpdemonio May 25, 2026

Uh oh!

LujhCoconut May 25, 2026

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 25, 2026

Uh oh!

LujhCoconut May 25, 2026 •

edited

Loading

Uh oh!

zxpdemonio May 27, 2026

Uh oh!

LujhCoconut May 25, 2026

Uh oh!

zxpdemonio May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zxpdemonio commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Scenario

Description

Recovery metadata persistence

Restart validation and rebuild

Write-path integration and recovery correctness

Documentation

How Has This Been Tested?

Notes

Module

Type of Change

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

zxpdemonio May 25, 2026

Choose a reason for hiding this comment

Uh oh!

LujhCoconut May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 25, 2026

Codecov Report

Uh oh!

LujhCoconut May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zxpdemonio May 27, 2026

Choose a reason for hiding this comment

Uh oh!

LujhCoconut May 25, 2026

Choose a reason for hiding this comment

Uh oh!

zxpdemonio May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zxpdemonio commented May 25, 2026 •

edited

Loading

LujhCoconut May 25, 2026 •

edited

Loading