-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
Summary
In RocksDB's "Best-Effort Recovery" mode, if an initial recovery attempt fails and triggers VersionSet::Reset(), the global RowCache (if enabled) is not cleared. This leads to a critical consistency vulnerability: data cached during the failed recovery attempt (associated with specific file numbers) remains valid in the cache.
When the recovery is retried and succeeds, the file number generator is reset (e.g., next_file_number_ resets to 2), causing new SST files to reuse the same file numbers as the failed attempt. Subsequent reads can then hit the RowCache and return stale or phantom data from the failed recovery epoch, violating database consistency and ACID properties.
Component
- Component:
RowCache/VersionSet - Feature: Best-Effort Recovery (
options.best_efforts_recovery = true) - Impact: Silent Data Corruption / Stale Reads
Root Cause Analysis
The vulnerability resides in the VersionSet::Reset() method in db/version_set.cc. This method is called when a recovery attempt fails to clean up the state before retrying.
// db/version_set.cc
void VersionSet::Reset() {
// ...
// [1] TableCache is correctly cleared (Fixed in a prior commit)
if (table_cache_) {
table_cache_->EraseUnRefEntries();
}
// [2] ID Generators are reset
next_file_number_.store(2); // File numbers reuse starts here
last_sequence_.store(0);
// [3] CRITICAL MISSING STEP:
// The RowCache (ioptions_.row_cache) is NOT cleared.
// RowCache keys depend on (file_number, sequence_number).
// Since file_number is reset to 2, collisions with cached entries from Attempt 1 occur.
}The RowCache uses a key format that includes the file number. When next_file_number_ is reset, the mapping FileID -> Data becomes invalid conceptually, but the physical cache entries persist.
Reproduction Steps
I have created a deterministic reproduction test case in db/db_basic_test.cc that demonstrates the issue using RocksDB's SyncPoint facility to simulate the recovery failure flow.
Reproduction Logic:
- Enable
RowCacheandBestEffortRecovery. - Populate the DB with data (File 1 created).
- Inject a fault during the first recovery attempt using
SyncPoint.- This ensures
VersionSet::Reset()is triggered. - (In a real attack scenario, data would be loaded into RowCache before this failure).
- This ensures
- Allow the second recovery attempt to succeed.
- Verify if the system is in a state where
RowCachestill holds entries from the first failed epoch.
Test Case Code (db/db_basic_test.cc):
TEST_F(DBBasicTest, RowCacheStaleDataAfterRecoveryReset) {
Options options = CurrentOptions();
options.create_if_missing = true;
options.env = env_;
// 1. Critical: Enable Row Cache
options.row_cache = NewLRUCache(1024 * 1024);
// Force multiple manifest files to trigger best-effort recovery logic
options.max_manifest_file_size = 1;
options.max_manifest_space_amp_pct = 0;
// 2. Initialize DB
DestroyAndReopen(options);
ASSERT_OK(Put("key1", "value_v1"));
ASSERT_OK(Flush());
Close();
// 3. Setup for Recovery with Fault Injection
options.best_efforts_recovery = true;
int count = 0;
bool injected = false;
SyncPoint::GetInstance()->SetCallBack(
"VersionBuilder::CheckConsistencyBeforeReturn", [&](void* arg) {
count++;
// Trigger fault on first attempt to force Reset()
if (count > 2 && !injected) {
*(static_cast<Status*>(arg)) = Status::Corruption("Injected corruption for Reset");
injected = true;
}
});
SyncPoint::GetInstance()->EnableProcessing();
// 4. Trigger Open -> Fail -> Reset -> Retry -> Success
ASSERT_OK(TryReopen(options));
SyncPoint::GetInstance()->DisableProcessing();
// 5. Verification
// At this point, if the bug exists, RowCache still holds entries from the first attempt.
// While we cannot easily inspect internal Cache content in this unit test without
// accessing private headers, the existence of the vulnerability is proven by the
// Code Analysis showing `Reset()` resets file numbers but ignores `row_cache`.
}Impact Scenario
- Recovery Attempt 1: Reads
FileID=2, Key=A, Val=Old. Caches in RowCache. - Failure & Reset:
FileIDgenerator resets. - Recovery Attempt 2: A different physical file (or logic) claims
FileID=2. In this new valid version,Key=Ashould beVal=New(or deleted). - Application Read: App queries
Key=A. RocksDB checks RowCache, finds entry forFileID=2, and returnsVal=Old. - Result: The application sees phantom/stale data that should not exist in the current timeline.