This document describes the synchronization primitives and locking conventions used in RooDB.
RooDB uses a Raft-as-WAL architecture that naturally minimizes lock contention:
- Writes are buffered in session memory until COMMIT
- On COMMIT, changes are proposed to Raft
- Storage writes only occur in
apply()after Raft consensus - This means DML operations (INSERT/UPDATE/DELETE) don't hold database locks during execution
| Type | Library | Use Case |
|---|---|---|
RwLock |
parking_lot | Sync contexts (catalog, txn state, raft caches) |
Mutex |
tokio::sync | Async contexts where lock is held across await (compaction, Raft log) |
Mutex |
parking_lot | Short sync critical sections (I/O handles) |
Atomic* |
std::sync::atomic | Counters and flags |
- No lock poisoning (panics don't leave locks in invalid state)
- Better performance than std::sync
- Simpler API (no
.unwrap()needed)
The compaction mutexes (l0_compaction_mutex, level_compaction_mutex) and Raft log_mutex are held across await points. Using parking_lot::Mutex would block the async runtime. tokio::sync::Mutex is designed for this use case.
| Component | Type | Purpose |
|---|---|---|
next_conn_id |
AtomicU32 | Connection ID allocation |
No locks held during connection spawn.
| Component | Type | Purpose |
|---|---|---|
catalog |
parking_lot::RwLock | Schema access for planning |
Critical pattern: Catalog lock is acquired in a closure and dropped before any await:
let plan_result = (|| {
let catalog_guard = self.catalog.read();
// ... planning ...
})(); // Guard dropped here, before await
self.execute_plan(physical).await| Component | Type | Purpose |
|---|---|---|
Catalog |
parking_lot::RwLock | Tables/indexes HashMap |
Always acquired alone (never nested with other locks).
| Component | Type | Purpose |
|---|---|---|
next_txn_id |
AtomicU64 | Transaction ID generation |
active_transactions |
parking_lot::RwLock | Active txn tracking |
committed_txns |
parking_lot::RwLock | Visibility checks |
timeout_config |
parking_lot::RwLock | Timeout settings |
is_leader |
AtomicBool | Raft leader status |
undo_log.records |
parking_lot::RwLock | Undo records |
undo_log.roll_ptr_index |
parking_lot::RwLock | Roll pointer lookup |
next_roll_ptr |
AtomicU64 | Roll pointer generation |
Multiple RwLocks but NEVER nested - each acquired/released independently.
| Component | Type | Purpose |
|---|---|---|
memtable.data |
parking_lot::RwLock | In-memory KV store |
memtable.size |
AtomicUsize | Approximate size tracking |
imm_memtables |
parking_lot::RwLock | Pending flush memtables |
manifest |
parking_lot::RwLock | SSTable metadata (readers concurrent, writers exclusive) |
l0_compaction_mutex |
tokio::sync::Mutex | Serializes L0→L1 compactions |
level_compaction_mutex |
tokio::sync::Mutex | Serializes L1+→L2+ compactions |
reader_cache |
parking_lot::RwLock | Cached open SSTable readers (avoids re-opening per scan) |
block_cache.inner |
parking_lot::Mutex | LRU cache of SSTable data blocks (Mutex: all ops mutate LRU order) |
closed |
AtomicBool | Engine shutdown flag |
Critical pattern: Atomic memtable snapshots prevent rotation races:
Both read_all() and scan() must read the active memtable and immutable memtables atomically
to prevent entries from becoming temporarily invisible during memtable rotation:
// CRITICAL: Both must be read while holding memtable lock
let (memtable_snapshot, imm_snapshot) = {
let mem_guard = self.memtable.read();
let memtable_snapshot = Arc::clone(&*mem_guard);
let imm_snapshot: Vec<Arc<Memtable>> = self.imm_memtables.read().clone();
(memtable_snapshot, imm_snapshot)
};
// Use snapshots outside lock for actual readsWithout this, a concurrent put() could rotate the memtable between reading the two
structures, causing an entry to be missed (moved from memtable to imm_memtables
between our reads).
Critical pattern: Flush removes from imm_memtables only AFTER disk write:
When flushing memtables to disk, entries must remain visible in imm_memtables
until the disk write completes. Otherwise there's a visibility gap:
// WRONG: Entries invisible during flush
let imm_mems = std::mem::take(&mut *imm_memtables); // Entries gone!
for mem in imm_mems {
flush_to_disk(mem).await; // Not on disk yet - entries invisible
}
// CORRECT: Entries visible until disk write completes
loop {
let mem = imm_memtables.first().cloned(); // Peek, don't remove
flush_to_disk(mem).await; // Now on disk
imm_memtables.remove(0); // Safe to remove
}Critical pattern: Compaction uses split-phase locking:
// Phase 1: Brief write-lock to prepare job
let job = {
let mut manifest = self.manifest.write();
prepare_l0_compaction(&mut manifest)
// Lock released here
};
// Phase 2: I/O without lock
let new_file = execute_l0_compaction(&*self.factory, &job).await?;
// Phase 3: Re-acquire write-lock to finalize
{
let mut manifest = self.manifest.write();
finalize_l0_compaction(&mut manifest, &job, new_file)?;
}L0 and level compactions use separate tokio Mutexes, allowing concurrent L0→L1 and L1+→L2+ compaction. Within each path, the manifest RwLock is held only briefly (prepare and finalize phases), while the expensive I/O phase runs without any lock.
| Component | Type | Purpose |
|---|---|---|
UringIO::ring |
parking_lot::Mutex | io_uring queue |
PosixIO::file |
parking_lot::Mutex | POSIX file handle |
Single lock per I/O instance; never nested.
| Component | Type | Purpose |
|---|---|---|
IoScheduler::files |
parking_lot::RwLock | Open files registry |
IoEngine::rings |
parking_lot::RwLock | Hot/cold ring allocation |
AdaptiveLimiter::bucket |
parking_lot::Mutex | Token bucket for background I/O rate limiting |
| Component | Type | Purpose |
|---|---|---|
log_mutex |
tokio::sync::Mutex | Serializes log operations |
cached_vote |
parking_lot::RwLock | Vote state cache |
cached_last_applied |
parking_lot::RwLock | Last applied log ID |
cached_membership |
parking_lot::RwLock | Cluster membership |
cached_last_log_id |
parking_lot::RwLock | Last log entry |
cached_last_purged |
parking_lot::RwLock | Last purged log ID |
catalog |
parking_lot::RwLock | Schema updates in apply() |
nodes |
parking_lot::RwLock | Peer addresses |
Critical pattern: Log mutex serializes log operations to establish memory barrier ordering:
LsmRaftStorage is Clone with shared Arc fields. OpenRaft's &mut self methods imply
serialized access, but clones can call methods concurrently. The log_mutex ensures that
when get_log_state() reports log entry N exists, try_get_log_entries() will find it.
Without the mutex, the separate locks for storage (memtable) and cache (cached_last_log_id) don't establish happens-before ordering across operations:
// Thread A: append()
storage.put(entry_49).await; // Acquires memtable lock
cached_last_log_id = 49; // Acquires cache lock
// Thread B: get_log_state() then try_get_log_entries()
// Without log_mutex, Thread B could see cache=49 but miss the memtable entryCritical pattern: Only applied_state() acquires log_mutex:
Of the state machine methods, only applied_state() acquires log_mutex — it must see
a consistent view of cached state alongside log operations. get_current_snapshot() and
build_snapshot() do NOT hold log_mutex; they read only non-Raft-prefixed keys (user data)
while OpenRaft serializes all state machine operations, so no concurrent apply() can
modify user data. This avoids write stalls under snapshot load.
Critical pattern: Async work (storage scans) done before catalog lock:
// Scan storage first (no lock)
let table_defs = scan_system_tables().await;
// Then acquire lock for update (no await after this)
let mut catalog = self.catalog.write();
// ... update catalog ...- Never hold sync locks across await (except tokio::sync::Mutex)
- Catalog acquired alone - no other locks held when catalog is locked
- Independent subsystem locks - txn, storage, raft don't hold each other's locks
- No circular dependencies - call graph naturally orders acquisitions
| Use Case | Ordering |
|---|---|
| Transaction/row ID allocation | SeqCst |
| Connection ID allocation | Relaxed (uniqueness only, no ordering requirement) |
| Approximate size tracking | Relaxed |
| Leader flag | SeqCst |
| Closed/shutdown flags | SeqCst |
Row IDs are cluster-unique via node-ID-in-high-bits encoding (src/storage/row_id.rs):
- Format:
[16-bit node_id][48-bit local counter]— supports 65536 nodes, 281T rows/node NODE_ID: AtomicU64— static, set once at startup from the Raft node ID (SeqCst)GLOBAL_ROW_ID_GEN: RowIdGenerator— single global counter (SeqCstfetch_add)next_row_id()— allocates one ID, encodes with node_idallocate_row_id_batch(count)— batch allocation reducing atomic contention for bulk INSERTs
pub fn next_row_id() -> u64 {
let local = GLOBAL_ROW_ID_GEN.next();
let node_id = NODE_ID.load(Ordering::SeqCst);
(node_id << 48) | (local & 0x0000_FFFF_FFFF_FFFF)
}When adding a new lock:
- Prefer parking_lot for sync contexts
- Use tokio::sync::Mutex only if lock must be held across await
- Document the lock in this file
- Never nest with existing locks unless absolutely necessary
- Keep critical sections short - do work outside the lock when possible
- Follow the closure pattern for catalog access to ensure guard is dropped