Skip to content

[Store] rebuild offset-allocator metadata on restart#2215

Open
zxpdemonio wants to merge 3 commits into
kvcache-ai:mainfrom
openanolis:cruz/data-rebuild
Open

[Store] rebuild offset-allocator metadata on restart#2215
zxpdemonio wants to merge 3 commits into
kvcache-ai:mainfrom
openanolis:cruz/data-rebuild

Conversation

@zxpdemonio
Copy link
Copy Markdown
Collaborator

@zxpdemonio zxpdemonio commented May 25, 2026

Motivation

OffsetAllocatorStorageBackend stores all objects in one shared data file and relies on in-memory metadata plus allocator state to locate objects. Before this change, that backend did not have a complete restart-recovery path: process restart could preserve kv_cache.data, but it could not reliably reconstruct the allocator state and key index needed to serve reads and replay local-disk metadata back to the master.

This PR closes that gap by persisting recovery metadata alongside the data file and rebuilding the backend's in-memory state during startup. The goal is to make OffsetAllocatorStorageBackend follow the existing synchronous startup recovery model already used by the storage stack, instead of introducing a new async rebuild path in this PR.

Scenario

This PR targets the existing startup flow for local-disk recovery:

  1. FileStorage::Init() starts the backend.
  2. OffsetAllocatorStorageBackend::Init() restores local allocator/index state from persisted recovery metadata.
  3. ScanMeta() iterates the rebuilt in-memory object view.
  4. NotifyOffloadSuccess() re-registers recovered objects with the master.

With this change, restart recovery for offset-allocator-backed storage works the same way as the existing synchronous startup recovery skeleton: startup succeeds only after the backend's local readable state is rebuilt.

Description

This PR adds restart recovery for OffsetAllocatorStorageBackend through three pieces:

Recovery metadata persistence

  • Persist generation-scoped recovery files next to kv_cache.data:
    • kv_cache.allocator.<generation>
    • kv_cache.index.<generation>
    • kv_cache.manifest
  • Persist allocator state separately from object index state.
  • Use the manifest as the authoritative pointer to the active recovery generation.
  • Remove obsolete allocator/index snapshot generations after a successful manifest update.

Restart validation and rebuild

  • Load manifest, allocator snapshot, and index snapshot during backend init.
  • Fail fast when:
    • the manifest is missing but snapshot files exist,
    • manifest-referenced snapshot files are missing,
    • allocator/index snapshots cannot be deserialized,
    • persisted allocation state is invalid,
    • duplicate keys or duplicate allocation states appear in the snapshot,
    • the backing data file is missing or truncated relative to configured capacity.
  • Rebuild shard maps, object metadata, aggregate counters, and allocator-backed handles from persisted recovery state.
  • Keep the startup flow synchronous; this PR does not introduce background async rebuild.

Write-path integration and recovery correctness

  • Update recovery metadata on successful batch offload so restart state tracks overwrites and partial-success batches.
  • Preserve overwrite recovery correctness through stale-allocation tracking.
  • Keep recovery snapshot state rollback-safe when recovery persistence fails during publish.
  • Reduce reviewer-facing failure ambiguity by tightening snapshot validation and rebuild invariants.

Documentation

  • Update docs/source/design/ssd-offload.md to describe:
    • the recovery file layout,
    • the startup rebuild path,
    • fail-fast recovery semantics for the offset-allocator backend.

How Has This Been Tested?

  • /root/Mooncake/build-min-check/mooncake-store/tests/storage_backend_test --gtest_filter="StorageBackendTest.OffsetAllocatorStorageBackend_RestartRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_RestartOverwriteRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_RestartRecoveryAfterOverwriteBatchFailure:StorageBackendTest.OffsetAllocatorStorageBackend_RestartEmptyBaseline:StorageBackendTest.OffsetAllocatorStorageBackend_CleansObsoleteRecoverySnapshots:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsWhenSnapshotPairMissing:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnCorruptedIndexSnapshot:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnDuplicateSnapshotKeys:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnDuplicateSnapshotAllocations:StorageBackendTest.OffsetAllocatorStorageBackend_ScanMetaBatchesAfterRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_IsEnableOffloadingRestoredAfterRecovery"
  • /root/Mooncake/build-min-check/mooncake-store/tests/storage_backend_test

Notes

  • This PR intentionally keeps rebuild synchronous during startup to match the existing storage recovery model.
  • Async/background rebuild is intentionally deferred to a follow-up PR.
  • A pre-existing stale-task gap was exposed in the promotion-on-hit path: PromotionAllocStart and NotifyPromotionSuccess only checked whether a promotion task still existed, not whether it had already expired. In the window before the async reaper swept the task, late RPCs could still allocate or commit stale promotions, potentially leaving orphaned staged memory replicas.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Persist allocator and index snapshots with a manifest so OffsetAllocatorStorageBackend can restore its in-memory state after restart, fail fast on corrupted recovery metadata, and keep the recovery path covered by focused restart tests and docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements restart recovery for the OffsetAllocatorStorageBackend by introducing a mechanism to persist and restore allocator and index snapshots. The implementation includes atomic file operations, snapshot validation, and comprehensive unit tests. Feedback identifies several high-severity issues: the memory limit for snapshots is dangerously high, risking OOM crashes; BatchOffload suffers from significant performance bottlenecks due to O(N) index copies and synchronous I/O performed under an exclusive lock; and the directory fsync logic may fail for paths without parent components. It is also recommended to use the error-code-based directory_iterator to avoid exceptions.

Comment thread mooncake-store/src/storage_backend.cpp Outdated
Comment thread mooncake-store/src/storage_backend.cpp Outdated
Comment thread mooncake-store/src/storage_backend.cpp Outdated
snapshot.stale_allocations = std::move(stale_allocations);
uint64_t generation =
next_snapshot_generation_.load(std::memory_order_relaxed);
auto persist_result = PersistRecoverySnapshots(snapshot, generation);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling PersistRecoverySnapshots synchronously inside BatchOffload while holding the exclusive recovery_state_mutex_ is a major performance bottleneck. This operation involves serializing the entire index (O(N)) and performing multiple fsync calls. All other concurrent BatchOffload calls will be blocked from updating their metadata until this I/O completes. Consider moving the persistence to a background thread or implementing an incremental logging mechanism (WAL) to keep the write path O(batch_size) instead of O(total_keys).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my plan, async/background rebuild is intentionally deferred to a follow-up PR, should I implement it in this PR?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my plan, async/background rebuild is intentionally deferred to a follow-up PR, should I implement it in this PR?

I think it's fine to address this in a follow-up PR. No problem with that.

Comment thread mooncake-store/src/storage_backend.cpp Outdated
Comment thread mooncake-store/src/storage_backend.cpp Outdated
@ykwd ykwd requested a review from LujhCoconut May 25, 2026 07:24
zxpdemonio and others added 2 commits May 25, 2026 15:39
Harden snapshot file handling with a fixed metadata size cap and safer directory fsync/iteration paths, and avoid copying the full recovery snapshot state on every batch persist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject and synchronously clean up expired promotion-on-hit tasks before stale AllocStart or success RPCs can leave orphaned memory replicas, and harden the promotion tests around real reap state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 82.54620% with 170 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/storage_backend.cpp 71.21% 154 Missing ⚠️
mooncake-store/tests/storage_backend_test.cpp 97.01% 8 Missing ⚠️
mooncake-store/src/master_service.cpp 91.17% 3 Missing ⚠️
mooncake-store/src/offset_allocator.cpp 88.46% 3 Missing ⚠️
mooncake-store/tests/promotion_on_hit_test.cpp 98.16% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

close(fd);
recovery_manifest_path_ = GetRecoveryManifestPath();

const bool recovery_mode = fs::exists(recovery_manifest_path_);
Copy link
Copy Markdown
Collaborator

@LujhCoconut LujhCoconut May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered adding an explicit enable_recovery flag instead of inferring from manifest existence? Default true to keep compatibility, but allow false for the old "clean slate" semantics. This would give operators an explicit switch in production to bypass the automatic logic during incidents or abnormal situations.

if (!recovery_mode) {                                                                                                 
      for (const auto& entry : fs::directory_iterator(storage_path_, ec_dir)) {                                         
          // ...existing iteration...                                                                                   
          const auto filename = entry.path().filename().string();                                                       
          if (filename.rfind(kAllocatorSnapshotPrefix, 0) == 0 ||                                                       
              filename.rfind(kIndexSnapshotPrefix, 0) == 0) {                                                           
              if (!enable_recovery) {                                                                                                                     
                  LOG(WARNING) << "Recovery disabled, removing stale snapshot: "                                        
                               << filename;                                                                             
                  fs::remove(entry.path(), ec_dir);                                                                     
                  ec_dir.clear();                                                                                       
              } else {                                                                                                                                                     
                  LOG(ERROR) << "Recovery snapshot mismatch: manifest "                                                 
                                "missing but recovery snapshot file exists: "                                           
                             << filename;                                                                               
                  return tl::make_unexpected(ErrorCode::FILE_NOT_FOUND);                                                
              }                                                                                                         
          }                                                                                                             
      }                                                                                                                 
  }               

Building on this idea, we could also add graceful degradation for capacity mismatches: when enable_recovery=true but LoadRecoverySnapshots() fails with SNAPSHOT_INCOMPATIBLE (e.g., capacity changed), fall back to a fresh start instead of hard failure.

  if (recovery_mode) {
      auto load_result = LoadRecoverySnapshots();
      if (!load_result) {
          if (load_result.error() == ErrorCode::SNAPSHOT_INCOMPATIBLE
              && enable_recovery_) {
              LOG(WARNING) << "Snapshot capacity/config mismatch, "
                           << "discarding old snapshots and starting fresh";
              recovery_mode = false;
              CleanupObsoleteRecoverySnapshots(/*all=*/0);
          } else {
              return load_result;
          }
      }
  }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, I'll fix it.

Comment on lines +3377 to +3381
if (recovery_mode) {
auto load_result = LoadRecoverySnapshots();
if (!load_result) {
return load_result;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code locations additionally referenced in the previous comment. The issue is that changing the capacity configuration and restarting causes a hard startup failure — the system fails to come up completely. It feels elegant to unify both fixes behind a single flag.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, I'll fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants