Skip to content

Commit 88afa1e

Browse files
committed
fix: correct doc-code misalignments from docs review; remove dead l0_cv
Documentation accuracy pass over review guides and user docs, fixing every claim that did not match the code: - unsafe-audit.md: db.rs unsafe is Send/Sync impls + flock + group-commit WriteRequest pointer derefs (no transmute/ArcSwap/Arc-raw patterns exist); table_reader.rs unsafe is a posix_fadvise call; node_kv/node_next0 live in skiplist_impl.rs and are crate-internal - concurrency.md: L0 backpressure uses inline do_compaction + progressive sleep, not a condvar stall; only two condvar wait sites exist (write_cv, compaction_notify); flush paths corrected (do_compaction never flushes memtables) - iterator.md: cross-level tombstone pruning invariant was stated backwards; FragmentedRangeTombstoneList queries are O(log T) binary search, not sweep-line O(1); removed fabricated "LevelIterator dirty cache" bullet - docs/issues.md: flush() releases the DB mutex during SST writes (only the do_compaction path holds it, blocking writers, not lock-free readers); InternalKey asserts fire in release builds too and new() needs no guard - README.md: range_del.rs documents both tombstone structures Code changes: - db.rs: remove l0_cv condvar — it was notify-only dead code (no wait site anywhere); fix stale freeze_and_flush caller comment (close() only) - Makefile: drop redundant second clippy run (--all-targets already covers tests); CLAUDE.md lint annotation updated to match Bump minor version 3.2.6 -> 3.3.0.
1 parent 3be5c49 commit 88afa1e

15 files changed

Lines changed: 165 additions & 64 deletions

.claude/commands/x-review.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Check every change for:
7878
Check changed files against project style rules:
7979

8080
1. **No lint suppression**`#[allow(clippy::...)]`, `#[allow(unused_...)]`, `#[allow(dead_code)]` etc. are forbidden. All warnings must be fixed at the source, not silenced. Report every `allow(...)` attribute in changed code as a finding.
81-
2. **No inline paths**Types must be imported via `use` at the top of the file, not referenced inline as `std::collections::HashMap::new()`. The only exception is disambiguation in a single call site where two types share a name.
81+
2. **Prefer imports over inline paths**Avoid `std::foo::Bar::new()` inline in function bodies when the same path appears 3+ times in a file; add `use std::foo;` at file top (or `use std::foo::Bar;`) instead. Function-body `use` statements (scoped imports) are fine and don't count as inline paths. 1-2 inline uses of common `std::` items are acceptable. Report only when the same full-qualified path appears 3+ times inline across a file.
8282
3. **Import grouping** — Imports with a common prefix must be merged using nested braces: `use std::sync::{Arc, Mutex};` not separate `use std::sync::Arc; use std::sync::Mutex;`.
8383
4. **Doc-code alignment** — If the change modifies a public function signature, struct field, module structure, or adds/removes/renames a public type or module, verify docs still match. Specifically check:
8484
- `CLAUDE.md` architecture table (subsystem paths, type names, dependency info)
@@ -214,7 +214,7 @@ After all agents complete:
214214
```
215215
## Full Audit Report
216216
217-
**Scope**: All src/ files (~17K LOC)
217+
**Scope**: All src/ files (~19K LOC)
218218
**Subsystems Audited**: <list>
219219
**Total Findings**: N (X critical, Y high, Z medium, W low)
220220

.claude/docs/patterns/compaction.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,11 @@ L1+ output files must have non-overlapping key ranges. If two output files overl
2020
Tombstones at non-bottommost levels MUST be written to output. At bottommost level, tombstones are dropped only if no live snapshot covers them.
2121
**Check**: Verify `is_bottommost_level` accounts for ALL levels below, not just `level + 1`.
2222

23+
### INV-C2a: Range Tombstone Sub-Compaction Propagation
24+
When a range tombstone spans sub-compaction boundaries, it must be propagated to all sub-compactions whose key ranges intersect it. The `RangeTombstoneTracker` (pre-populated with all input range deletions) is shared across sub-compactions so that each sub-compaction's `CompactionIter` sees the complete set of covering tombstones.
25+
26+
**Check**: Verify the `RangeTombstoneTracker` is fully constructed from all input files BEFORE sub-compactions are spawned. Sub-compaction boundaries must not truncate range tombstone coverage.
27+
2328
### INV-C3: Sequence Number Zeroing
2429
Sequence numbers may only be zeroed at the bottommost level, and only for keys with no live snapshot dependency.
2530
**Check**: Verify sequence zeroing logic checks both bottommost AND snapshot list.
@@ -32,6 +37,11 @@ When splitting work across threads, the boundary key must be assigned to exactly
3237
Compaction input files may only be deleted after the output files are fully written AND the MANIFEST records the new version.
3338
**Check**: Verify file deletion happens AFTER `VersionSet::log_and_apply()` succeeds.
3439

40+
### INV-C5a: Stale-Output Detection in `install_compaction`
41+
After compaction I/O completes and the lock is re-acquired to install results, verify that the SuperVersion used to pick inputs has not been superseded by a concurrent flush or another compaction. If a newer version exists, the output files from this compaction may reference stale inputs and MUST be discarded rather than installed.
42+
43+
**Check**: Verify `install_compaction()` checks that its input files are still present in the current Version before applying the VersionEdit. Output files built from stale inputs must be deleted without being added to the MANIFEST.
44+
3545
### INV-C6: Compaction Filter Contract
3646
User-provided `CompactionFilter` receives every key-value pair exactly once. Filter decisions (Keep/Remove/ChangeValue) must be applied atomically per entry.
3747
**Check**: Verify filter is called inside the merge loop, not on intermediate state.
@@ -51,13 +61,20 @@ Background compaction thread deadlocks, causing write stall (L0 file count grows
5161
Token bucket rate limiter starves compaction during heavy writes, causing L0 buildup.
5262
**Check**: Verify rate limiter allows burst capacity and doesn't block indefinitely.
5363

64+
### Aggregate Key Range Overlap
65+
When computing whether a file overlaps with deeper levels (for `is_bottommost_level` and tombstone decisions), the overlap check must consider the aggregate key ranges of all input files, not individual file ranges. A single input file may have a narrow key range, but the compaction as a whole spans a wider range that overlaps files in deeper levels.
66+
67+
**Check**: Verify `add_file_extents()` collects the union of all input file user-key extents before checking for overlaps in deeper levels. Individual-file checks will miss overlaps covered by other files in the same compaction.
68+
5469
## Review Checklist
5570
- [ ] Output files have monotonically increasing, non-overlapping key ranges
56-
- [ ] Tombstones retained at non-bottommost levels
71+
- [ ] Tombstones retained at non-bottommost levels; range tombstones propagated to all sub-compactions
5772
- [ ] Sequence zeroing only at bottommost with no snapshot dependency
5873
- [ ] Sub-compaction boundaries don't duplicate or skip keys
5974
- [ ] MANIFEST updated before input file deletion
6075
- [ ] CompactionFilter called once per logical key-value
6176
- [ ] Trivial move conditions are sufficient (1 input file, no overlap, no compaction filter)
6277
- [ ] Error handling: partial compaction failure leaves DB in consistent state
6378
- [ ] Rate limiter doesn't cause unbounded stalls
79+
- [ ] `install_compaction` checks for stale inputs before applying VersionEdit
80+
- [ ] `is_bottommost_level` uses aggregate key range extents from all input files, not per-file ranges

.claude/docs/patterns/concurrency.md

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ MMDB uses a single-writer / multi-reader concurrency model:
88
Writers: [User Thread] → group commit leader → WAL + MemTable (serialized)
99
Readers: [User Thread] → ArcSwap<SuperVersion> load (lock-free) → MemTable + SST
1010
Background: [Compaction Thread(s)] → read SSTs, write new SSTs, update VersionSet
11-
Flush: [Background Thread] → freeze MemTable → write SST → update VersionSet
11+
Flush: [Write-path auto-flush, explicit flush()/compact_range(), close()] → freeze MemTable → write SST → update VersionSet
1212
```
1313

1414
## Key Synchronization Primitives
@@ -19,9 +19,13 @@ Flush: [Background Thread] → freeze MemTable → write SST → update Versi
1919
| `ArcSwap<SuperVersion>` | db.rs | Lock-free read path |
2020
| `Arc<MemTable>` | db.rs | MemTable lifetime across readers |
2121
| `Atomic*` | db.rs, skiplist | Counters, flags, skiplist node pointers |
22-
| `Condvar` | db.rs | Write stall (L0 backpressure) |
22+
| `write_cv` (parking_lot Condvar) | db.rs | Group commit write queue — writers wait for leadership/notification |
23+
| `compaction_notify` (std Condvar) | db.rs | Background compaction thread wake-up — work available notification |
2324
| `std::thread::scope` | compaction | Sub-compaction parallelism |
2425

26+
> **Note**: L0 backpressure does NOT use a condvar. There are exactly two condvar
27+
> wait sites in db.rs: `write_cv` and `compaction_notify`.
28+
2529
## Critical Invariants
2630

2731
### INV-CC1: Lock Ordering
@@ -53,20 +57,26 @@ Steps 2-5 must happen atomically (under the write lock). Step 6 may happen after
5357
**Check**: Verify the write lock is held from step 2 through step 5.
5458

5559
### INV-CC4: Background Thread Safety
56-
Compaction and flush threads must not hold `db_mutex` while performing I/O (reading/writing SST files). They should:
60+
Compaction threads must not hold `db_mutex` while performing I/O (reading/writing SST files). The 3-phase lock-release pattern:
5761
1. Acquire mutex → pick compaction inputs → release mutex
58-
2. Perform compaction I/O (no mutex)
62+
2. Perform compaction/flush I/O (no mutex)
5963
3. Acquire mutex → apply VersionEdit → release mutex
6064

61-
**Check**: Verify I/O operations happen between mutex release and re-acquire.
65+
Flush never runs on the background compaction threads (`do_compaction` only compacts SSTs; it never freezes/flushes a memtable). The real flush paths are:
66+
- **Write-path auto-flush** (`write_batch_group`): freezes under `db_mutex`, drops the lock during the SST write, re-acquires to install
67+
- **Explicit `flush()` / `compact_range()`**: same pattern — lock released during the SST write
68+
- **`close()`** via `freeze_and_flush`: holds the lock throughout (shutdown path, not performance-critical)
69+
70+
`do_compaction` holds `db_mutex` for its entire run; it is invoked by explicit `flush()`/`compact()`/`compact_range()` calls AND inline from the write path when L0 reaches `l0_stop_trigger` (`maybe_throttle_writes`).
71+
72+
**Check**: Verify I/O operations happen between mutex release and re-acquire. For auto-flush in the write path, verify the lock is dropped before SST write and re-acquired after.
6273

63-
### INV-CC5: Write Stall Signaling
64-
When L0 file count exceeds the slowdown/stop threshold:
65-
- Slowdown: artificial delay on write path
66-
- Stop: block writers until compaction reduces L0 count
67-
The signal (condvar) must be sent AFTER compaction completes AND VersionEdit is applied.
74+
### INV-CC5: Write Stall Behavior (L0 Backpressure)
75+
When L0 file count (cached in an atomic, no lock) exceeds a threshold, the write path applies backpressure — without any condvar:
76+
- Slowdown (`l0_slowdown_trigger`): progressive `thread::sleep` delay, longer as L0 count grows
77+
- Stop (`l0_stop_trigger`): the writer runs **inline** `do_compaction()` synchronously to reduce L0 count; if that compaction fails, the DB fail-stops (background error is set) rather than admitting the write
6878

69-
**Check**: Verify condvar::notify is called after the new version (with fewer L0 files) is installed.
79+
**Check**: Verify the atomic L0 counter is refreshed after every install that changes L0 (flush install, compaction install, inline compaction). Verify the stop path fail-stops on compaction error instead of silently admitting writes.
7080

7181
### INV-CC6: Snapshot Lifetime
7282
A snapshot pins a sequence number. All data visible at that sequence must remain accessible until the snapshot is dropped. This means:
@@ -83,8 +93,13 @@ Thread B holds `version_lock`, then tries to acquire `db_mutex`.
8393
**Check**: Map all lock acquisition paths and verify no cycles.
8494

8595
### Lost Wakeup
86-
Writer blocks on condvar waiting for compaction. Compaction completes and calls notify, but the writer hasn't entered wait() yet — notification is lost.
87-
**Check**: Verify condvar wait is in a loop that re-checks the condition: `while l0_count >= stop_threshold { condvar.wait(&mut guard); }`
96+
A thread waits on a condvar under lock, but the notifier calls `notify_all()` before the waiter acquires the lock and enters `wait()`. The notification is lost, and the waiter blocks forever.
97+
98+
**Two condvar wait patterns in db.rs** — verify each is in a loop re-checking its condition:
99+
100+
1. **`write_cv`** (group commit queue): `while !req.done && wq.leader_active { write_cv.wait(&mut wq); }` — writers wait until their batch is processed or they should become leader
101+
2. **`compaction_notify`** (background compaction wake): `while !*has_work && !shutdown { has_work = cvar.wait(has_work).unwrap(); }` — compaction threads wait for work, use `std::sync::Condvar` paired with `std::sync::Mutex<bool>`
102+
**Check**: Each wait site uses `while condition { condvar.wait(...) }` loop (not `if`). Each notify happens after the condition is changed under the lock the waiter holds in the wait loop.
88103

89104
### ArcSwap Load + Modify Race
90105
Two threads both load the current SuperVersion, modify it, and try to store back. One modification is lost.

.claude/docs/patterns/iterator.md

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
- MergingIterator: min-heap merge of K sources with single-source fast path
1313
- LevelIterator: deferred block reads, binary search on level
1414
- BidiIterator: direction switch with heap rebuild
15-
- RangeTombstoneTracker: fragmented tombstone list, sweep-line O(1) amortized
15+
- FragmentedRangeTombstoneList: tombstones pre-collected at iterator creation, fragmented into non-overlapping intervals; immutable, O(log T) binary search per key (range_del.rs also keeps the sweep-line `RangeTombstoneTracker` used by compaction)
1616

1717
## Critical Invariants
1818

@@ -26,7 +26,7 @@ Every key in the valid range that is visible at the snapshot's sequence number m
2626

2727
### INV-I3: Range Tombstone Coverage
2828
If a range tombstone `[start, end)` with seq S covers a key K with seq S' < S, the key must be filtered out.
29-
**Check**: Verify RangeTombstoneTracker is consulted for every key yielded by MergingIterator, and the check uses the correct sequence comparison.
29+
**Check**: Verify `FragmentedRangeTombstoneList` is consulted for every key yielded by MergingIterator, and the check uses the correct sequence comparison. For the DBIterator path, every yielded key is checked against the pre-fragmented tombstone list via O(log T) binary search.
3030

3131
### INV-I4: Direction Switch Correctness
3232
Switching from forward to backward (or vice versa) must not skip or duplicate any keys.
@@ -51,20 +51,39 @@ Switching from `next()` to `prev()` yields the current key twice.
5151
**Check**: Verify the direction switch logic accounts for whether the current key has been consumed.
5252

5353
### Range Tombstone Not Checked on Seek
54-
`seek(target)` lands on a key covered by a range tombstone, but the tombstone tracker isn't updated for the new position.
55-
**Check**: Verify seek() resets/repositions the RangeTombstoneTracker.
54+
`seek(target)` lands on a key covered by a range tombstone, but the covering-tombstone check is skipped for the new position.
55+
**Check**: Verify every key yielded after seek() is still checked against the FragmentedRangeTombstoneList (the list is immutable and stateless — the risk is a code path that bypasses the per-key query, not stale tracker state).
5656

5757
### LevelIterator File Boundary Skip
5858
When crossing from one SST file to the next in a level, a key at the exact boundary is skipped.
5959
**Check**: Verify LevelIterator's file transition logic — the first key of the new file must be yielded.
6060

61+
## Key Optimizations (not documented individually above)
62+
63+
These patterns exist in the codebase and should be checked during review:
64+
65+
- **`next_into()` buffer reuse**: Copies entries directly into caller-provided buffers — avoids per-entry heap allocation. Verify no aliasing with live MemTable/SST data.
66+
- **Prefetch-based init_heap I/O**: `MergingIterator` issues `posix_fadvise(WILLNEED)` prefetch hints for all seekable sources before draining them during heap initialization (overlaps I/O across sources).
67+
- **SetBounds propagation**: `ReadOptions` bounds are propagated to `LevelIterator` and `TableIterator` sub-iterators for early termination.
68+
- **SkipPoint callback**: `ReadOptions.skip_point` filter allows callers to skip entries without consulting the value.
69+
- **LevelIterator file skip by bound**: Files whose smallest user key is `>= upper_bound` are skipped entirely; the bound is also pushed into each opened `TableIterator`.
70+
- **Cross-level tombstone pruning**: A tombstone from level L may only delete keys from levels deeper than L (`level >= source_level` tombstones are ignored). `FragmentedRangeTombstoneList` tracks per-level tombstone origin to enforce this — deeper-or-same-level tombstones must never suppress shallower keys.
71+
- **`posix_fadvise` sequential readahead**: Detects sequential block access patterns and issues `WILLNEED` hints to the OS page cache.
72+
- **Deferred block read**: SST index stores `first_key` per block; seek positions the iterator without reading data blocks until `next()` is called.
73+
- **L0 metadata pinning**: Index and bloom filter blocks for L0 files are pinned in cache via `insert_pinned()` and unpinned when the file is compacted.
74+
- **Atomic L0 counter**: Write-throttle checks use an atomic counter for L0 file count, avoiding mutex contention on the hot write path.
75+
6176
## Review Checklist
6277
- [ ] MergingIterator heap invariant maintained after every next()/prev()/seek()
6378
- [ ] DBIterator skips same-user-key entries with higher sequence numbers correctly
6479
- [ ] All sources included: active MemTable + immutable MemTables + L0 files + L1+ levels
65-
- [ ] Range tombstone tracker consulted for every yielded key
80+
- [ ] FragmentedRangeTombstoneList consulted for every yielded key
6681
- [ ] Direction switch rebuilds heap and repositions all children
6782
- [ ] Prefix bound check is byte-exact, not lexicographic over-bound
6883
- [ ] Seek handles the case where target key is covered by a range tombstone
6984
- [ ] Iterator holds Arc references to prevent SuperVersion/MemTable/SST cleanup during iteration
7085
- [ ] `next_into()` buffer reuse doesn't alias with live data
86+
- [ ] SetBounds propagated correctly to LevelIterator and TableIterator sub-iterators
87+
- [ ] SkipPoint callback filter applied without advancing past matching entries
88+
- [ ] Prefetch hints (fadvise) issued at correct sequential-access detection points
89+
- [ ] L0 pinned blocks unpinned after compaction deletes the L0 file

0 commit comments

Comments
 (0)