You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: correct doc-code misalignments from docs review; remove dead l0_cv
Documentation accuracy pass over review guides and user docs, fixing
every claim that did not match the code:
- unsafe-audit.md: db.rs unsafe is Send/Sync impls + flock + group-commit
WriteRequest pointer derefs (no transmute/ArcSwap/Arc-raw patterns exist);
table_reader.rs unsafe is a posix_fadvise call; node_kv/node_next0 live in
skiplist_impl.rs and are crate-internal
- concurrency.md: L0 backpressure uses inline do_compaction + progressive
sleep, not a condvar stall; only two condvar wait sites exist (write_cv,
compaction_notify); flush paths corrected (do_compaction never flushes
memtables)
- iterator.md: cross-level tombstone pruning invariant was stated backwards;
FragmentedRangeTombstoneList queries are O(log T) binary search, not
sweep-line O(1); removed fabricated "LevelIterator dirty cache" bullet
- docs/issues.md: flush() releases the DB mutex during SST writes (only the
do_compaction path holds it, blocking writers, not lock-free readers);
InternalKey asserts fire in release builds too and new() needs no guard
- README.md: range_del.rs documents both tombstone structures
Code changes:
- db.rs: remove l0_cv condvar — it was notify-only dead code (no wait site
anywhere); fix stale freeze_and_flush caller comment (close() only)
- Makefile: drop redundant second clippy run (--all-targets already covers
tests); CLAUDE.md lint annotation updated to match
Bump minor version 3.2.6 -> 3.3.0.
Copy file name to clipboardExpand all lines: .claude/commands/x-review.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,7 +78,7 @@ Check every change for:
78
78
Check changed files against project style rules:
79
79
80
80
1.**No lint suppression** — `#[allow(clippy::...)]`, `#[allow(unused_...)]`, `#[allow(dead_code)]` etc. are forbidden. All warnings must be fixed at the source, not silenced. Report every `allow(...)` attribute in changed code as a finding.
81
-
2.**No inline paths** — Types must be imported via `use` at the top of the file, not referenced inline as `std::collections::HashMap::new()`. The only exception is disambiguation in a single call site where two types share a name.
81
+
2.**Prefer imports over inline paths** — Avoid `std::foo::Bar::new()` inline in function bodies when the same path appears 3+ times in a file; add `use std::foo;` at file top (or `use std::foo::Bar;`) instead. Function-body `use` statements (scoped imports) are fine and don't count as inline paths. 1-2 inline uses of common `std::` items are acceptable. Report only when the same full-qualified path appears 3+ times inline across a file.
82
82
3.**Import grouping** — Imports with a common prefix must be merged using nested braces: `use std::sync::{Arc, Mutex};` not separate `use std::sync::Arc; use std::sync::Mutex;`.
83
83
4.**Doc-code alignment** — If the change modifies a public function signature, struct field, module structure, or adds/removes/renames a public type or module, verify docs still match. Specifically check:
84
84
-`CLAUDE.md` architecture table (subsystem paths, type names, dependency info)
@@ -214,7 +214,7 @@ After all agents complete:
214
214
```
215
215
## Full Audit Report
216
216
217
-
**Scope**: All src/ files (~17K LOC)
217
+
**Scope**: All src/ files (~19K LOC)
218
218
**Subsystems Audited**: <list>
219
219
**Total Findings**: N (X critical, Y high, Z medium, W low)
Copy file name to clipboardExpand all lines: .claude/docs/patterns/compaction.md
+18-1Lines changed: 18 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,11 @@ L1+ output files must have non-overlapping key ranges. If two output files overl
20
20
Tombstones at non-bottommost levels MUST be written to output. At bottommost level, tombstones are dropped only if no live snapshot covers them.
21
21
**Check**: Verify `is_bottommost_level` accounts for ALL levels below, not just `level + 1`.
22
22
23
+
### INV-C2a: Range Tombstone Sub-Compaction Propagation
24
+
When a range tombstone spans sub-compaction boundaries, it must be propagated to all sub-compactions whose key ranges intersect it. The `RangeTombstoneTracker` (pre-populated with all input range deletions) is shared across sub-compactions so that each sub-compaction's `CompactionIter` sees the complete set of covering tombstones.
25
+
26
+
**Check**: Verify the `RangeTombstoneTracker` is fully constructed from all input files BEFORE sub-compactions are spawned. Sub-compaction boundaries must not truncate range tombstone coverage.
27
+
23
28
### INV-C3: Sequence Number Zeroing
24
29
Sequence numbers may only be zeroed at the bottommost level, and only for keys with no live snapshot dependency.
25
30
**Check**: Verify sequence zeroing logic checks both bottommost AND snapshot list.
@@ -32,6 +37,11 @@ When splitting work across threads, the boundary key must be assigned to exactly
32
37
Compaction input files may only be deleted after the output files are fully written AND the MANIFEST records the new version.
33
38
**Check**: Verify file deletion happens AFTER `VersionSet::log_and_apply()` succeeds.
34
39
40
+
### INV-C5a: Stale-Output Detection in `install_compaction`
41
+
After compaction I/O completes and the lock is re-acquired to install results, verify that the SuperVersion used to pick inputs has not been superseded by a concurrent flush or another compaction. If a newer version exists, the output files from this compaction may reference stale inputs and MUST be discarded rather than installed.
42
+
43
+
**Check**: Verify `install_compaction()` checks that its input files are still present in the current Version before applying the VersionEdit. Output files built from stale inputs must be deleted without being added to the MANIFEST.
44
+
35
45
### INV-C6: Compaction Filter Contract
36
46
User-provided `CompactionFilter` receives every key-value pair exactly once. Filter decisions (Keep/Remove/ChangeValue) must be applied atomically per entry.
37
47
**Check**: Verify filter is called inside the merge loop, not on intermediate state.
When computing whether a file overlaps with deeper levels (for `is_bottommost_level` and tombstone decisions), the overlap check must consider the aggregate key ranges of all input files, not individual file ranges. A single input file may have a narrow key range, but the compaction as a whole spans a wider range that overlaps files in deeper levels.
66
+
67
+
**Check**: Verify `add_file_extents()` collects the union of all input file user-key extents before checking for overlaps in deeper levels. Individual-file checks will miss overlaps covered by other files in the same compaction.
68
+
54
69
## Review Checklist
55
70
-[ ] Output files have monotonically increasing, non-overlapping key ranges
56
-
-[ ] Tombstones retained at non-bottommost levels
71
+
-[ ] Tombstones retained at non-bottommost levels; range tombstones propagated to all sub-compactions
57
72
-[ ] Sequence zeroing only at bottommost with no snapshot dependency
58
73
-[ ] Sub-compaction boundaries don't duplicate or skip keys
59
74
-[ ] MANIFEST updated before input file deletion
60
75
-[ ] CompactionFilter called once per logical key-value
61
76
-[ ] Trivial move conditions are sufficient (1 input file, no overlap, no compaction filter)
62
77
-[ ] Error handling: partial compaction failure leaves DB in consistent state
63
78
-[ ] Rate limiter doesn't cause unbounded stalls
79
+
-[ ]`install_compaction` checks for stale inputs before applying VersionEdit
80
+
-[ ]`is_bottommost_level` uses aggregate key range extents from all input files, not per-file ranges
**Check**: Verify I/O operations happen between mutex release and re-acquire.
65
+
Flush never runs on the background compaction threads (`do_compaction` only compacts SSTs; it never freezes/flushes a memtable). The real flush paths are:
66
+
-**Write-path auto-flush** (`write_batch_group`): freezes under `db_mutex`, drops the lock during the SST write, re-acquires to install
67
+
-**Explicit `flush()` / `compact_range()`**: same pattern — lock released during the SST write
68
+
-**`close()`** via `freeze_and_flush`: holds the lock throughout (shutdown path, not performance-critical)
69
+
70
+
`do_compaction` holds `db_mutex` for its entire run; it is invoked by explicit `flush()`/`compact()`/`compact_range()` calls AND inline from the write path when L0 reaches `l0_stop_trigger` (`maybe_throttle_writes`).
71
+
72
+
**Check**: Verify I/O operations happen between mutex release and re-acquire. For auto-flush in the write path, verify the lock is dropped before SST write and re-acquired after.
62
73
63
-
### INV-CC5: Write Stall Signaling
64
-
When L0 file count exceeds the slowdown/stop threshold:
65
-
- Slowdown: artificial delay on write path
66
-
- Stop: block writers until compaction reduces L0 count
67
-
The signal (condvar) must be sent AFTER compaction completes AND VersionEdit is applied.
- Stop (`l0_stop_trigger`): the writer runs **inline**`do_compaction()` synchronously to reduce L0 count; if that compaction fails, the DB fail-stops (background error is set) rather than admitting the write
68
78
69
-
**Check**: Verify condvar::notify is called after the new version (with fewer L0 files) is installed.
79
+
**Check**: Verify the atomic L0 counter is refreshed after every install that changes L0 (flush install, compaction install, inline compaction). Verify the stop path fail-stops on compaction error instead of silently admitting writes.
70
80
71
81
### INV-CC6: Snapshot Lifetime
72
82
A snapshot pins a sequence number. All data visible at that sequence must remain accessible until the snapshot is dropped. This means:
@@ -83,8 +93,13 @@ Thread B holds `version_lock`, then tries to acquire `db_mutex`.
83
93
**Check**: Map all lock acquisition paths and verify no cycles.
84
94
85
95
### Lost Wakeup
86
-
Writer blocks on condvar waiting for compaction. Compaction completes and calls notify, but the writer hasn't entered wait() yet — notification is lost.
87
-
**Check**: Verify condvar wait is in a loop that re-checks the condition: `while l0_count >= stop_threshold { condvar.wait(&mut guard); }`
96
+
A thread waits on a condvar under lock, but the notifier calls `notify_all()` before the waiter acquires the lock and enters `wait()`. The notification is lost, and the waiter blocks forever.
97
+
98
+
**Two condvar wait patterns in db.rs** — verify each is in a loop re-checking its condition:
99
+
100
+
1.**`write_cv`** (group commit queue): `while !req.done && wq.leader_active { write_cv.wait(&mut wq); }` — writers wait until their batch is processed or they should become leader
101
+
2.**`compaction_notify`** (background compaction wake): `while !*has_work && !shutdown { has_work = cvar.wait(has_work).unwrap(); }` — compaction threads wait for work, use `std::sync::Condvar` paired with `std::sync::Mutex<bool>`
102
+
**Check**: Each wait site uses `while condition { condvar.wait(...) }` loop (not `if`). Each notify happens after the condition is changed under the lock the waiter holds in the wait loop.
88
103
89
104
### ArcSwap Load + Modify Race
90
105
Two threads both load the current SuperVersion, modify it, and try to store back. One modification is lost.
-FragmentedRangeTombstoneList: tombstones pre-collected at iterator creation, fragmented into non-overlapping intervals; immutable, O(log T) binary search per key (range_del.rs also keeps the sweep-line `RangeTombstoneTracker` used by compaction)
16
16
17
17
## Critical Invariants
18
18
@@ -26,7 +26,7 @@ Every key in the valid range that is visible at the snapshot's sequence number m
26
26
27
27
### INV-I3: Range Tombstone Coverage
28
28
If a range tombstone `[start, end)` with seq S covers a key K with seq S' < S, the key must be filtered out.
29
-
**Check**: Verify RangeTombstoneTracker is consulted for every key yielded by MergingIterator, and the check uses the correct sequence comparison.
29
+
**Check**: Verify `FragmentedRangeTombstoneList` is consulted for every key yielded by MergingIterator, and the check uses the correct sequence comparison. For the DBIterator path, every yielded key is checked against the pre-fragmented tombstone list via O(log T) binary search.
30
30
31
31
### INV-I4: Direction Switch Correctness
32
32
Switching from forward to backward (or vice versa) must not skip or duplicate any keys.
@@ -51,20 +51,39 @@ Switching from `next()` to `prev()` yields the current key twice.
51
51
**Check**: Verify the direction switch logic accounts for whether the current key has been consumed.
52
52
53
53
### Range Tombstone Not Checked on Seek
54
-
`seek(target)` lands on a key covered by a range tombstone, but the tombstone tracker isn't updated for the new position.
55
-
**Check**: Verify seek() resets/repositions the RangeTombstoneTracker.
54
+
`seek(target)` lands on a key covered by a range tombstone, but the covering-tombstone check is skipped for the new position.
55
+
**Check**: Verify every key yielded after seek() is still checked against the FragmentedRangeTombstoneList (the list is immutable and stateless — the risk is a code path that bypasses the per-key query, not stale tracker state).
56
56
57
57
### LevelIterator File Boundary Skip
58
58
When crossing from one SST file to the next in a level, a key at the exact boundary is skipped.
59
59
**Check**: Verify LevelIterator's file transition logic — the first key of the new file must be yielded.
These patterns exist in the codebase and should be checked during review:
64
+
65
+
-**`next_into()` buffer reuse**: Copies entries directly into caller-provided buffers — avoids per-entry heap allocation. Verify no aliasing with live MemTable/SST data.
66
+
-**Prefetch-based init_heap I/O**: `MergingIterator` issues `posix_fadvise(WILLNEED)` prefetch hints for all seekable sources before draining them during heap initialization (overlaps I/O across sources).
67
+
-**SetBounds propagation**: `ReadOptions` bounds are propagated to `LevelIterator` and `TableIterator` sub-iterators for early termination.
68
+
-**SkipPoint callback**: `ReadOptions.skip_point` filter allows callers to skip entries without consulting the value.
69
+
-**LevelIterator file skip by bound**: Files whose smallest user key is `>= upper_bound` are skipped entirely; the bound is also pushed into each opened `TableIterator`.
70
+
-**Cross-level tombstone pruning**: A tombstone from level L may only delete keys from levels deeper than L (`level >= source_level` tombstones are ignored). `FragmentedRangeTombstoneList` tracks per-level tombstone origin to enforce this — deeper-or-same-level tombstones must never suppress shallower keys.
71
+
-**`posix_fadvise` sequential readahead**: Detects sequential block access patterns and issues `WILLNEED` hints to the OS page cache.
72
+
-**Deferred block read**: SST index stores `first_key` per block; seek positions the iterator without reading data blocks until `next()` is called.
73
+
-**L0 metadata pinning**: Index and bloom filter blocks for L0 files are pinned in cache via `insert_pinned()` and unpinned when the file is compacted.
74
+
-**Atomic L0 counter**: Write-throttle checks use an atomic counter for L0 file count, avoiding mutex contention on the hot write path.
75
+
61
76
## Review Checklist
62
77
-[ ] MergingIterator heap invariant maintained after every next()/prev()/seek()
0 commit comments