RooDB Locking Architecture

This document describes the synchronization primitives and locking conventions used in RooDB.

Design Philosophy

RooDB uses a Raft-as-WAL architecture that naturally minimizes lock contention:

Writes are buffered in session memory until COMMIT
On COMMIT, changes are proposed to Raft
Storage writes only occur in apply() after Raft consensus
This means DML operations (INSERT/UPDATE/DELETE) don't hold database locks during execution

Synchronization Primitives

Lock Types Used

Type	Library	Use Case
`RwLock`	parking_lot	Sync contexts (catalog, txn state, raft caches)
`Mutex`	tokio::sync	Async contexts where lock is held across await (compaction, Raft log)
`Mutex`	parking_lot	Short sync critical sections (I/O handles)
`Atomic*`	std::sync::atomic	Counters and flags

Why parking_lot?

No lock poisoning (panics don't leave locks in invalid state)
Better performance than std::sync
Simpler API (no .unwrap() needed)

Why tokio::sync::Mutex for compaction/log?

The compaction mutexes (l0_compaction_mutex, level_compaction_mutex) and Raft log_mutex are held across await points. Using parking_lot::Mutex would block the async runtime. tokio::sync::Mutex is designed for this use case.

Module Locking Inventory

Server (`src/server/`)

Component	Type	Purpose
`next_conn_id`	AtomicU32	Connection ID allocation

No locks held during connection spawn.

Protocol (`src/protocol/roodb/`)

Component	Type	Purpose
`catalog`	parking_lot::RwLock	Schema access for planning

Critical pattern: Catalog lock is acquired in a closure and dropped before any await:

let plan_result = (|| {
    let catalog_guard = self.catalog.read();
    // ... planning ...
})();  // Guard dropped here, before await
self.execute_plan(physical).await

Catalog (`src/catalog/`)

Component	Type	Purpose
`Catalog`	parking_lot::RwLock	Tables/indexes HashMap

Always acquired alone (never nested with other locks).

Transaction (`src/txn/`)

Component	Type	Purpose
`next_txn_id`	AtomicU64	Transaction ID generation
`active_transactions`	parking_lot::RwLock	Active txn tracking
`committed_txns`	parking_lot::RwLock	Visibility checks
`timeout_config`	parking_lot::RwLock	Timeout settings
`is_leader`	AtomicBool	Raft leader status
`undo_log.records`	parking_lot::RwLock	Undo records
`undo_log.roll_ptr_index`	parking_lot::RwLock	Roll pointer lookup
`next_roll_ptr`	AtomicU64	Roll pointer generation

Multiple RwLocks but NEVER nested - each acquired/released independently.

Storage (`src/storage/lsm/`)

Component	Type	Purpose
`memtable.data`	parking_lot::RwLock	In-memory KV store
`memtable.size`	AtomicUsize	Approximate size tracking
`imm_memtables`	parking_lot::RwLock	Pending flush memtables
`manifest`	parking_lot::RwLock	SSTable metadata (readers concurrent, writers exclusive)
`l0_compaction_mutex`	tokio::sync::Mutex	Serializes L0→L1 compactions
`level_compaction_mutex`	tokio::sync::Mutex	Serializes L1+→L2+ compactions
`reader_cache`	parking_lot::RwLock	Cached open SSTable readers (avoids re-opening per scan)
`block_cache.inner`	parking_lot::Mutex	LRU cache of SSTable data blocks (Mutex: all ops mutate LRU order)
`closed`	AtomicBool	Engine shutdown flag

Critical pattern: Atomic memtable snapshots prevent rotation races:

Both read_all() and scan() must read the active memtable and immutable memtables atomically to prevent entries from becoming temporarily invisible during memtable rotation:

// CRITICAL: Both must be read while holding memtable lock
let (memtable_snapshot, imm_snapshot) = {
    let mem_guard = self.memtable.read();
    let memtable_snapshot = Arc::clone(&*mem_guard);
    let imm_snapshot: Vec<Arc<Memtable>> = self.imm_memtables.read().clone();
    (memtable_snapshot, imm_snapshot)
};
// Use snapshots outside lock for actual reads

Without this, a concurrent put() could rotate the memtable between reading the two structures, causing an entry to be missed (moved from memtable to imm_memtables between our reads).

Critical pattern: Flush removes from imm_memtables only AFTER disk write:

When flushing memtables to disk, entries must remain visible in imm_memtables until the disk write completes. Otherwise there's a visibility gap:

// WRONG: Entries invisible during flush
let imm_mems = std::mem::take(&mut *imm_memtables);  // Entries gone!
for mem in imm_mems {
    flush_to_disk(mem).await;  // Not on disk yet - entries invisible
}

// CORRECT: Entries visible until disk write completes
loop {
    let mem = imm_memtables.first().cloned();  // Peek, don't remove
    flush_to_disk(mem).await;  // Now on disk
    imm_memtables.remove(0);  // Safe to remove
}

Critical pattern: Compaction uses split-phase locking:

// Phase 1: Brief write-lock to prepare job
let job = {
    let mut manifest = self.manifest.write();
    prepare_l0_compaction(&mut manifest)
    // Lock released here
};
// Phase 2: I/O without lock
let new_file = execute_l0_compaction(&*self.factory, &job).await?;
// Phase 3: Re-acquire write-lock to finalize
{
    let mut manifest = self.manifest.write();
    finalize_l0_compaction(&mut manifest, &job, new_file)?;
}

L0 and level compactions use separate tokio Mutexes, allowing concurrent L0→L1 and L1+→L2+ compaction. Within each path, the manifest RwLock is held only briefly (prepare and finalize phases), while the expensive I/O phase runs without any lock.

I/O (`src/io/`)

Component	Type	Purpose
`UringIO::ring`	parking_lot::Mutex	io_uring queue
`PosixIO::file`	parking_lot::Mutex	POSIX file handle

Single lock per I/O instance; never nested.

I/O Scheduler (`src/io/scheduler/`)

Component	Type	Purpose
`IoScheduler::files`	parking_lot::RwLock	Open files registry
`IoEngine::rings`	parking_lot::RwLock	Hot/cold ring allocation
`AdaptiveLimiter::bucket`	parking_lot::Mutex	Token bucket for background I/O rate limiting

Raft (`src/raft/`)

Component	Type	Purpose
`log_mutex`	tokio::sync::Mutex	Serializes log operations
`cached_vote`	parking_lot::RwLock	Vote state cache
`cached_last_applied`	parking_lot::RwLock	Last applied log ID
`cached_membership`	parking_lot::RwLock	Cluster membership
`cached_last_log_id`	parking_lot::RwLock	Last log entry
`cached_last_purged`	parking_lot::RwLock	Last purged log ID
`catalog`	parking_lot::RwLock	Schema updates in apply()
`nodes`	parking_lot::RwLock	Peer addresses

Critical pattern: Log mutex serializes log operations to establish memory barrier ordering:

LsmRaftStorage is Clone with shared Arc fields. OpenRaft's &mut self methods imply serialized access, but clones can call methods concurrently. The log_mutex ensures that when get_log_state() reports log entry N exists, try_get_log_entries() will find it.

Without the mutex, the separate locks for storage (memtable) and cache (cached_last_log_id) don't establish happens-before ordering across operations:

// Thread A: append()
storage.put(entry_49).await;    // Acquires memtable lock
cached_last_log_id = 49;        // Acquires cache lock

// Thread B: get_log_state() then try_get_log_entries()
// Without log_mutex, Thread B could see cache=49 but miss the memtable entry

Critical pattern: Only applied_state() acquires log_mutex:

Of the state machine methods, only applied_state() acquires log_mutex — it must see a consistent view of cached state alongside log operations. get_current_snapshot() and build_snapshot() do NOT hold log_mutex; they read only non-Raft-prefixed keys (user data) while OpenRaft serializes all state machine operations, so no concurrent apply() can modify user data. This avoids write stalls under snapshot load.

Critical pattern: Async work (storage scans) done before catalog lock:

// Scan storage first (no lock)
let table_defs = scan_system_tables().await;

// Then acquire lock for update (no await after this)
let mut catalog = self.catalog.write();
// ... update catalog ...

Lock Ordering Rules

Never hold sync locks across await (except tokio::sync::Mutex)
Catalog acquired alone - no other locks held when catalog is locked
Independent subsystem locks - txn, storage, raft don't hold each other's locks
No circular dependencies - call graph naturally orders acquisitions

Atomic Ordering Guidelines

Use Case	Ordering
Transaction/row ID allocation	SeqCst
Connection ID allocation	Relaxed (uniqueness only, no ordering requirement)
Approximate size tracking	Relaxed
Leader flag	SeqCst
Closed/shutdown flags	SeqCst

Row ID Generation

Row IDs are cluster-unique via node-ID-in-high-bits encoding (src/storage/row_id.rs):

Format: [16-bit node_id][48-bit local counter] — supports 65536 nodes, 281T rows/node
NODE_ID: AtomicU64 — static, set once at startup from the Raft node ID (SeqCst)
GLOBAL_ROW_ID_GEN: RowIdGenerator — single global counter (SeqCst fetch_add)
next_row_id() — allocates one ID, encodes with node_id
allocate_row_id_batch(count) — batch allocation reducing atomic contention for bulk INSERTs

pub fn next_row_id() -> u64 {
    let local = GLOBAL_ROW_ID_GEN.next();
    let node_id = NODE_ID.load(Ordering::SeqCst);
    (node_id << 48) | (local & 0x0000_FFFF_FFFF_FFFF)
}

Adding New Locks

When adding a new lock:

Prefer parking_lot for sync contexts
Use tokio::sync::Mutex only if lock must be held across await
Document the lock in this file
Never nest with existing locks unless absolutely necessary
Keep critical sections short - do work outside the lock when possible
Follow the closure pattern for catalog access to ensure guard is dropped

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RooDB Locking Architecture

Design Philosophy

Synchronization Primitives

Lock Types Used

Why parking_lot?

Why tokio::sync::Mutex for compaction/log?

Module Locking Inventory

Server (`src/server/`)

Protocol (`src/protocol/roodb/`)

Catalog (`src/catalog/`)

Transaction (`src/txn/`)

Storage (`src/storage/lsm/`)

I/O (`src/io/`)

I/O Scheduler (`src/io/scheduler/`)

Raft (`src/raft/`)

Lock Ordering Rules

Atomic Ordering Guidelines

Row ID Generation

Adding New Locks

FilesExpand file tree

LOCKING.md

Latest commit

History

LOCKING.md

File metadata and controls

RooDB Locking Architecture

Design Philosophy

Synchronization Primitives

Lock Types Used

Why parking_lot?

Why tokio::sync::Mutex for compaction/log?

Module Locking Inventory

Server (src/server/)

Protocol (src/protocol/roodb/)

Catalog (src/catalog/)

Transaction (src/txn/)

Storage (src/storage/lsm/)

I/O (src/io/)

I/O Scheduler (src/io/scheduler/)

Raft (src/raft/)

Lock Ordering Rules

Atomic Ordering Guidelines

Row ID Generation

Adding New Locks

Server (`src/server/`)

Protocol (`src/protocol/roodb/`)

Catalog (`src/catalog/`)

Transaction (`src/txn/`)

Storage (`src/storage/lsm/`)

I/O (`src/io/`)

I/O Scheduler (`src/io/scheduler/`)

Raft (`src/raft/`)