[Store] rebuild offset-allocator metadata on restart#2215
Conversation
Persist allocator and index snapshots with a manifest so OffsetAllocatorStorageBackend can restore its in-memory state after restart, fail fast on corrupted recovery metadata, and keep the recovery path covered by focused restart tests and docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements restart recovery for the OffsetAllocatorStorageBackend by introducing a mechanism to persist and restore allocator and index snapshots. The implementation includes atomic file operations, snapshot validation, and comprehensive unit tests. Feedback identifies several high-severity issues: the memory limit for snapshots is dangerously high, risking OOM crashes; BatchOffload suffers from significant performance bottlenecks due to O(N) index copies and synchronous I/O performed under an exclusive lock; and the directory fsync logic may fail for paths without parent components. It is also recommended to use the error-code-based directory_iterator to avoid exceptions.
| snapshot.stale_allocations = std::move(stale_allocations); | ||
| uint64_t generation = | ||
| next_snapshot_generation_.load(std::memory_order_relaxed); | ||
| auto persist_result = PersistRecoverySnapshots(snapshot, generation); |
There was a problem hiding this comment.
Calling PersistRecoverySnapshots synchronously inside BatchOffload while holding the exclusive recovery_state_mutex_ is a major performance bottleneck. This operation involves serializing the entire index (O(N)) and performing multiple fsync calls. All other concurrent BatchOffload calls will be blocked from updating their metadata until this I/O completes. Consider moving the persistence to a background thread or implementing an incremental logging mechanism (WAL) to keep the write path O(batch_size) instead of O(total_keys).
There was a problem hiding this comment.
In my plan, async/background rebuild is intentionally deferred to a follow-up PR, should I implement it in this PR?
There was a problem hiding this comment.
In my plan, async/background rebuild is intentionally deferred to a follow-up PR, should I implement it in this PR?
I think it's fine to address this in a follow-up PR. No problem with that.
Harden snapshot file handling with a fixed metadata size cap and safer directory fsync/iteration paths, and avoid copying the full recovery snapshot state on every batch persist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject and synchronously clean up expired promotion-on-hit tasks before stale AllocStart or success RPCs can leave orphaned memory replicas, and harden the promotion tests around real reap state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| close(fd); | ||
| recovery_manifest_path_ = GetRecoveryManifestPath(); | ||
|
|
||
| const bool recovery_mode = fs::exists(recovery_manifest_path_); |
There was a problem hiding this comment.
Have you considered adding an explicit enable_recovery flag instead of inferring from manifest existence? Default true to keep compatibility, but allow false for the old "clean slate" semantics. This would give operators an explicit switch in production to bypass the automatic logic during incidents or abnormal situations.
if (!recovery_mode) {
for (const auto& entry : fs::directory_iterator(storage_path_, ec_dir)) {
// ...existing iteration...
const auto filename = entry.path().filename().string();
if (filename.rfind(kAllocatorSnapshotPrefix, 0) == 0 ||
filename.rfind(kIndexSnapshotPrefix, 0) == 0) {
if (!enable_recovery) {
LOG(WARNING) << "Recovery disabled, removing stale snapshot: "
<< filename;
fs::remove(entry.path(), ec_dir);
ec_dir.clear();
} else {
LOG(ERROR) << "Recovery snapshot mismatch: manifest "
"missing but recovery snapshot file exists: "
<< filename;
return tl::make_unexpected(ErrorCode::FILE_NOT_FOUND);
}
}
}
}
Building on this idea, we could also add graceful degradation for capacity mismatches: when enable_recovery=true but LoadRecoverySnapshots() fails with SNAPSHOT_INCOMPATIBLE (e.g., capacity changed), fall back to a fresh start instead of hard failure.
if (recovery_mode) {
auto load_result = LoadRecoverySnapshots();
if (!load_result) {
if (load_result.error() == ErrorCode::SNAPSHOT_INCOMPATIBLE
&& enable_recovery_) {
LOG(WARNING) << "Snapshot capacity/config mismatch, "
<< "discarding old snapshots and starting fresh";
recovery_mode = false;
CleanupObsoleteRecoverySnapshots(/*all=*/0);
} else {
return load_result;
}
}
}There was a problem hiding this comment.
Thanks for suggestion, I'll fix it.
| if (recovery_mode) { | ||
| auto load_result = LoadRecoverySnapshots(); | ||
| if (!load_result) { | ||
| return load_result; | ||
| } |
There was a problem hiding this comment.
The code locations additionally referenced in the previous comment. The issue is that changing the capacity configuration and restarting causes a hard startup failure — the system fails to come up completely. It feels elegant to unify both fixes behind a single flag.
There was a problem hiding this comment.
Thanks for suggestion, I'll fix it.
Motivation
OffsetAllocatorStorageBackendstores all objects in one shared data file and relies on in-memory metadata plus allocator state to locate objects. Before this change, that backend did not have a complete restart-recovery path: process restart could preservekv_cache.data, but it could not reliably reconstruct the allocator state and key index needed to serve reads and replay local-disk metadata back to the master.This PR closes that gap by persisting recovery metadata alongside the data file and rebuilding the backend's in-memory state during startup. The goal is to make
OffsetAllocatorStorageBackendfollow the existing synchronous startup recovery model already used by the storage stack, instead of introducing a new async rebuild path in this PR.Scenario
This PR targets the existing startup flow for local-disk recovery:
FileStorage::Init()starts the backend.OffsetAllocatorStorageBackend::Init()restores local allocator/index state from persisted recovery metadata.ScanMeta()iterates the rebuilt in-memory object view.NotifyOffloadSuccess()re-registers recovered objects with the master.With this change, restart recovery for offset-allocator-backed storage works the same way as the existing synchronous startup recovery skeleton: startup succeeds only after the backend's local readable state is rebuilt.
Description
This PR adds restart recovery for
OffsetAllocatorStorageBackendthrough three pieces:Recovery metadata persistence
kv_cache.data:kv_cache.allocator.<generation>kv_cache.index.<generation>kv_cache.manifestRestart validation and rebuild
Write-path integration and recovery correctness
Documentation
docs/source/design/ssd-offload.mdto describe:How Has This Been Tested?
/root/Mooncake/build-min-check/mooncake-store/tests/storage_backend_test --gtest_filter="StorageBackendTest.OffsetAllocatorStorageBackend_RestartRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_RestartOverwriteRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_RestartRecoveryAfterOverwriteBatchFailure:StorageBackendTest.OffsetAllocatorStorageBackend_RestartEmptyBaseline:StorageBackendTest.OffsetAllocatorStorageBackend_CleansObsoleteRecoverySnapshots:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsWhenSnapshotPairMissing:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnCorruptedIndexSnapshot:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnDuplicateSnapshotKeys:StorageBackendTest.OffsetAllocatorStorageBackend_InitFailsOnDuplicateSnapshotAllocations:StorageBackendTest.OffsetAllocatorStorageBackend_ScanMetaBatchesAfterRecovery:StorageBackendTest.OffsetAllocatorStorageBackend_IsEnableOffloadingRestoredAfterRecovery"/root/Mooncake/build-min-check/mooncake-store/tests/storage_backend_testNotes
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
Checklist
./scripts/code_format.shbefore submitting.