Skip to content

Commit 6ba34cd

Browse files
committed
fix: resolve full-codebase audit findings across all subsystems
Full-codebase audit (/x-overhaul): 25 findings fixed across the write/read path, compaction, iterator, SST, WAL, cache, manifest, and memtable. 2 HIGH findings (CRC-protected block decode, defensive ValueType decode) recorded as Won't Fix with rationale. Crash safety: - Fail-stop via bg_error on any post-freeze flush failure so a later flush cannot advance log_number past an unflushed memtable (stale reads + lost recovery on restart). - Fsync the DB directory in log_and_apply before any MANIFEST record can reference a newly created SST (covers flush, both compaction paths, and the maybe_compact_manifest snapshot). - Make WAL recovery crash-idempotent: install the recovered SST and advance log_number in one edit, and never reuse a file number already on disk. - Tolerate any corrupt-tail WAL record on recovery instead of failing open. Concurrency: - Serialize snapshot acquisition with compaction's snapshot-list capture (snapshot_seq under the DB lock; capture after picking inputs) so a compaction cannot GC a version a concurrently-registered snapshot needs. Compaction correctness: - Cut output files only at user-key boundaries (never split a key's versions, avoiding overlapping L1+ ranges) in both sub-compaction and force_merge_level. - Apply range-tombstone filtering to snapshot-retained older versions to prevent resurrection when a covering tombstone is dropped at the bottom level. - Delete orphaned output SSTs when installing a compaction edit fails. Iterator correctness: - Fix prefix-successor truncation, prefix seek_for_prev clamping, lazy next_back re-yield, and LevelIterator::seek_to_last on filtered files. - Apply block_property_filters to prefix iterators. Data integrity / robustness: - Reject empty/inverted range deletions (avoid mis-deleting the begin key). - Reject oversized SST entries the reader would reject; cap bloom filter size; reject u32 varint overflow. - Bound the block-cache reverse index via a moka eviction listener; honor block_cache_capacity == 0 as a disabled cache. - Validate num_levels >= 2. Style/docs: grouped re-exports, error imports in sst::format, fixed-width footer docs, test SAFETY comments. Adds 4 regression tests.
1 parent a2283da commit 6ba34cd

18 files changed

Lines changed: 613 additions & 186 deletions

File tree

.claude/audit.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,30 @@
1010

1111
## Won't Fix
1212

13+
### [HIGH] sst: corrupt block entry could read restart-array bytes / iteration degrades to EOF
14+
- **Where**: src/sst/block.rs:73-97, 411-528; src/sst/table_reader.rs:379-392
15+
- **What**: Entry decoders bound key/value lengths against `data.len()` rather than the data-region end (`restart_offset`), and a mid-block decode failure surfaces as iterator EOF rather than an error.
16+
- **Reason**: Data blocks are CRC32-verified before they are ever decoded (`table_reader.rs:446-452` rejects any block whose checksum does not match), so a corrupt block never reaches the decoders. The decoders already guard against out-of-bounds with `checked_add` + `value_end > data.len()`. Reaching the reported state requires a CRC32 collision (~2^-32). The suggested remedy (thread `restart_offset` through `decode_entry_at`/`decode_entry_reuse` to ~6 call sites and convert block iteration to a `Result`-returning protocol) is a large, hot-path API change disproportionate to a CRC-collision-only scenario.
17+
18+
### [HIGH] types: invalid ValueType byte decodes as Deletion instead of erroring
19+
- **Where**: src/types.rs:54-62, 127-130, 178-181
20+
- **What**: `unpack_sequence_and_type` maps an unknown type tag to `Deletion` (invisible) rather than surfacing corruption.
21+
- **Reason**: This is a documented, intentional defensive choice (src/types.rs:58-60): all persisted internal keys come from CRC32-protected storage (WAL records and SST blocks are checksummed before decode), so an invalid tag is unreachable in practice; when it is reached (memory corruption / future format drift) defaulting to the *invisible* Deletion is strictly safer than the *phantom-data* Value. `value_type()` is on the hot read path (decoded on every key comparison/lookup); making it fallible across all SST/iterator/compaction call sites is disproportionate to a CRC-collision-only scenario.
22+
1323
### [MEDIUM] db: `do_compaction` holds db_mutex during blocking I/O
14-
- **Where**: src/db.rs:2405-2446
24+
- **Where**: src/db.rs:2438-2493
1525
- **What**: `do_compaction()` holds the inner mutex across `execute_compaction_with_cache` and `force_merge_level`, blocking all writers for the entire compaction duration.
1626
- **Reason**: Refactoring to a 3-phase pattern (lock → I/O → lock) requires `do_compaction` to not take `&self` with the lock, which is a deep architectural change. Only triggered by explicit `compact()`/`flush()` API calls, not the background compaction path.
1727

1828
### [MEDIUM] db: `freeze_and_flush` holds db_mutex during SST write I/O
19-
- **Where**: src/db.rs:2288-2299
29+
- **Where**: src/db.rs:2295-2320
2030
- **What**: Writes an entire SST file while the caller holds the DB lock.
2131
- **Reason**: Only called from `close()` (shutdown path). Refactoring is disproportionate risk for a non-hot path. The background flush path correctly uses the 3-phase pattern.
2232

2333
### [MEDIUM] manifest: `log_and_apply` performs I/O via `maybe_compact_manifest` while caller holds DB mutex
24-
- **Where**: src/manifest/version_set.rs:316, 406-414
25-
- **What**: Every 1000th edit, `maybe_compact_manifest` fires from inside `log_and_apply`, performing file create, write, fsync, rename, directory fsync, and file removal — all while the caller holds the main DB mutex.
26-
- **Reason**: Fixing requires separating `VersionSet` from the DB mutex (new manifest-specific lock), which is a deep architectural refactor. The current behavior causes a periodic write-stall spike (~10-100ms every 1000 edits) but does not affect correctness.
34+
- **Where**: src/manifest/version_set.rs:317, 421-466
35+
- **What**: Every 1000th edit, `maybe_compact_manifest` fires from inside `log_and_apply`, performing file create, write, fsync, rename, directory fsync, and file removal — all while the caller holds the main DB mutex. (`log_and_apply` now also fsyncs the DB directory when the edit adds files, for crash-safety.)
36+
- **Reason**: Fixing requires separating `VersionSet` from the DB mutex (new manifest-specific lock), which is a deep architectural refactor. The current behavior causes a periodic write-stall spike but does not affect correctness.
2737

2838
### [LOW] db: `max_immutable_memtables` option declared but not enforced
2939
- **Where**: src/options.rs:18-19
@@ -36,6 +46,6 @@
3646
- **Reason**: Public API for external consumers; cannot determine if unused without checking downstream crates.
3747

3848
### [LOW] types: "Safety:" comment in non-unsafe context
39-
- **Where**: src/types.rs:49-52
49+
- **Where**: src/types.rs:54-62
4050
- **What**: Comment uses "Safety:" but no unsafe block is involved.
4151
- **Reason**: Comment is a design note, not a safety justification. Cosmetic only.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "mmdb"
3-
version = "3.2.4"
3+
version = "3.2.5"
44
edition = "2024"
55
description = "The storage engine behind vsdb — a pure-Rust LSM-Tree key-value store"
66
license = "MIT"

src/cache/block_cache.rs

Lines changed: 47 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
use std::collections::{HashMap, HashSet};
44
use std::sync::Arc;
55

6+
use moka::notification::RemovalCause;
67
use parking_lot::Mutex;
78

89
/// Cache key: (sst_file_number, block_offset).
@@ -11,33 +12,61 @@ type CacheKey = (u64, u64);
1112
/// Cached block data.
1213
type CacheValue = Arc<Vec<u8>>;
1314

15+
/// Reverse index: which block offsets belong to each file, used for bulk
16+
/// invalidation. Shared with the moka eviction listener so LRU-evicted offsets
17+
/// are pruned automatically (otherwise it would grow without bound).
18+
type FileOffsets = Arc<Mutex<HashMap<u64, HashSet<u64>>>>;
19+
1420
/// Thread-safe LRU cache for SST data blocks.
1521
/// Supports pinning entries that should never be evicted (e.g., L0 index/filter blocks).
1622
pub struct BlockCache {
1723
inner: moka::sync::Cache<CacheKey, CacheValue>,
1824
/// Pinned entries — never evicted by LRU. Protected by mutex.
1925
pinned: Mutex<HashMap<CacheKey, CacheValue>>,
2026
/// Track which offsets belong to each file for bulk invalidation.
21-
file_offsets: Mutex<HashMap<u64, HashSet<u64>>>,
27+
file_offsets: FileOffsets,
28+
/// When true (capacity 0), caching is disabled: inserts are no-ops and
29+
/// lookups always miss. Honors the documented "0 disables caching" option.
30+
disabled: bool,
2231
}
2332

2433
impl BlockCache {
2534
/// Create a new block cache with the given capacity in bytes.
35+
/// A capacity of 0 disables caching entirely.
2636
pub fn new(capacity_bytes: u64) -> Self {
37+
let file_offsets: FileOffsets = Arc::new(Mutex::new(HashMap::new()));
38+
let listener_offsets = file_offsets.clone();
39+
let inner = moka::sync::Cache::builder()
40+
.max_capacity(capacity_bytes)
41+
.weigher(|_key: &CacheKey, value: &CacheValue| -> u32 {
42+
value.len().min(u32::MAX as usize) as u32
43+
})
44+
.eviction_listener(
45+
move |key: Arc<CacheKey>, _value: CacheValue, _cause: RemovalCause| {
46+
let (file_number, block_offset) = *key;
47+
let mut map = listener_offsets.lock();
48+
if let Some(set) = map.get_mut(&file_number) {
49+
set.remove(&block_offset);
50+
if set.is_empty() {
51+
map.remove(&file_number);
52+
}
53+
}
54+
},
55+
)
56+
.build();
2757
Self {
28-
inner: moka::sync::Cache::builder()
29-
.max_capacity(capacity_bytes)
30-
.weigher(|_key: &CacheKey, value: &CacheValue| -> u32 {
31-
value.len().min(u32::MAX as usize) as u32
32-
})
33-
.build(),
58+
inner,
3459
pinned: Mutex::new(HashMap::new()),
35-
file_offsets: Mutex::new(HashMap::new()),
60+
file_offsets,
61+
disabled: capacity_bytes == 0,
3662
}
3763
}
3864

3965
/// Look up a cached block. Pinned entries are checked first.
4066
pub fn get(&self, file_number: u64, block_offset: u64) -> Option<Arc<Vec<u8>>> {
67+
if self.disabled {
68+
return None;
69+
}
4170
let key = (file_number, block_offset);
4271
if let Some(v) = self.pinned.lock().get(&key) {
4372
return Some(v.clone());
@@ -48,6 +77,9 @@ impl BlockCache {
4877
/// Insert a block into the cache.
4978
pub fn insert(&self, file_number: u64, block_offset: u64, data: Vec<u8>) -> Arc<Vec<u8>> {
5079
let arc = Arc::new(data);
80+
if self.disabled {
81+
return arc;
82+
}
5183
self.inner.insert((file_number, block_offset), arc.clone());
5284
self.file_offsets
5385
.lock()
@@ -65,8 +97,11 @@ impl BlockCache {
6597
block_offset: u64,
6698
data: Vec<u8>,
6799
) -> Arc<Vec<u8>> {
68-
let key = (file_number, block_offset);
69100
let arc = Arc::new(data);
101+
if self.disabled {
102+
return arc;
103+
}
104+
let key = (file_number, block_offset);
70105
self.pinned.lock().insert(key, arc.clone());
71106
self.file_offsets
72107
.lock()
@@ -96,7 +131,9 @@ impl BlockCache {
96131

97132
/// Invalidate a specific cached block.
98133
pub fn invalidate(&self, file_number: u64, block_offset: u64) {
99-
self.inner.invalidate(&(file_number, block_offset));
134+
let key = (file_number, block_offset);
135+
self.pinned.lock().remove(&key);
136+
self.inner.invalidate(&key);
100137
if let Some(offsets) = self.file_offsets.lock().get_mut(&file_number) {
101138
offsets.remove(&block_offset);
102139
}

0 commit comments

Comments
 (0)