contextdb is a 10-crate Rust workspace. This document covers the crate structure, subsystem design, key traits, and extension points.
contextdb-core Types, executor traits, errors, Value enum, TableMeta
│
contextdb-tx MVCC transaction manager, WriteSet, WriteSetApplicator trait
│
├── contextdb-relational Row storage, scan, insert, upsert, delete
├── contextdb-graph Adjacency index, bounded BFS, DAG enforcement
└── contextdb-vector Cosine similarity, brute-force + HNSW auto-switch
│
contextdb-parser pest grammar → AST (SQL + GRAPH_TABLE + vector extensions)
│
contextdb-planner AST → PhysicalPlan (rule-based, no cost optimizer)
│
contextdb-engine Database struct — wires all subsystems, plugin API, subscriptions
│
├── contextdb-server SyncServer + SyncClient (NATS transport, conflict resolution)
└── contextdb-cli Interactive REPL binary
Dependencies flow downward. contextdb-engine owns the Database struct and is the crate applications depend on.
The canonical source of truth. All rows live here. Graph and vector indexes are secondary structures derived from relational data.
- In-memory row store with column-typed
Valueenum - Point lookups by primary key, range scans with filter predicates
- Upsert via
INSERT ... ON CONFLICT DO UPDATE - DDL metadata stored alongside rows (columns, types, constraints)
Dedicated adjacency index maintained incrementally as edges are inserted/deleted. Not recursive SQL over edge tables.
- Bounded BFS with configurable max depth (engine limit: 10)
- Edge-type filtering per hop
- Direction control (outgoing, incoming, bidirectional)
- DAG cycle detection on insert (BFS from target back to source)
- Deduplication:
(source_id, target_id, edge_type)is a natural key
Secondary index over relational rows with VECTOR(n) columns. Index identity is
the full (table, column) pair, so one table can carry separate text, image,
audio, or policy embeddings with different dimensions and quantization choices.
- Cosine similarity via
<=>operator VECTOR(N) WITH (quantization = 'F32'|'SQ8'|'SQ4')per column- SQ8/SQ4 columns keep quantized live payloads and quantized HNSW payloads; f32 is reconstructed only at API/materialization boundaries
- Below ~1000 vectors: brute-force exact scan
- At/above ~1000 vectors: HNSW (via
hnsw_rs) with 10x overfetch + exact reranking - Pre-filtered search: WHERE clause narrows candidates before scoring
- HNSW is built lazily per index; a search against one vector column does not build sibling indexes
- OOM during HNSW build falls back to brute-force via
catch_unwind
contextdb-tx provides MVCC with consistent read snapshots:
- Each read sees a consistent snapshot across relational, graph, and vector state
- Writers don't block readers; readers don't block writers
- Writes are serialized through a commit mutex (one writer at a time)
WriteSetaccumulates all mutations within a transaction- On commit, the
WriteSetis applied atomically to all subsystems - Propagation (state machine transitions cascading along edges/FKs) happens within the same
WriteSet
The boundary between compute and storage:
pub trait WriteSetApplicator: Send + Sync {
fn apply(&self, ws: WriteSet) -> Result<()>;
fn new_row_id(&self) -> RowId;
}Two implementations:
| Implementation | Used by | Behavior |
|---|---|---|
CompositeStore (in-memory) |
Database::open_memory() |
Applies to in-memory stores directly |
PersistentCompositeStore |
Database::open(path) |
Applies to in-memory stores + flushes to redb |
This trait is the extension point for additional backends if required. The engine owns compute state (in-memory stores, HNSW cache). The applicator owns durability.
Single-file storage via redb:
- Flush-on-commit: every committed
WriteSetis written to redb - On open: all data loaded from redb into memory, HNSW rebuilt
- Crash-safe: redb provides atomic transactions
- Tables: rows, DDL metadata, graph edges, vector entries, counters
- Vector entries use one composite-key table keyed by
(table, column, row_id). - A
metadatatable storesformat_version = "1.0.0"; missing markers are treated as legacy stores, while unreadable markers are reported as corrupt.
pub trait DatabasePlugin: Send + Sync {
fn pre_commit(&self, ws: &WriteSet, source: CommitSource) -> Result<()>;
fn post_commit(&self, ws: &WriteSet, source: CommitSource);
fn on_open(&self) -> Result<()>;
fn on_close(&self) -> Result<()>;
fn on_ddl(&self, change: &DdlChange) -> Result<()>;
fn on_query(&self, sql: &str) -> Result<()>;
fn post_query(&self, sql: &str, duration: Duration, outcome: &QueryOutcome);
fn health(&self) -> PluginHealth;
fn describe(&self) -> serde_json::Value;
fn on_sync_push(&self, changeset: &mut ChangeSet) -> Result<()>;
fn on_sync_pull(&self, changeset: &mut ChangeSet) -> Result<()>;
}All methods have default no-op implementations. CorePlugin ships as the default and handles engine-internal concerns (subscriptions, retention pruning).
Inject a custom plugin:
let plugin = Arc::new(MyPlugin::new());
let db = Database::open_with_plugin(path, plugin)?;
// or: Database::open_memory_with_plugin(plugin)?pre_commit can reject a transaction by returning Err. post_commit fires after the write is durable. Applications like cg and Vigil use contextdb as a library and accept Database via dependency injection — they are database users, not plugin authors.
Reactive commit notifications via bounded broadcast channels:
let rx: Receiver<CommitEvent> = db.subscribe();
// or with custom capacity:
let rx = db.subscribe_with_capacity(256);pub struct CommitEvent {
pub source: CommitSource, // User or Autocommit
pub lsn: u64,
pub tables_changed: Vec<String>,
pub row_count: usize,
}Fan-out to multiple subscribers. Dead channels are cleaned up automatically. Graceful shutdown disconnects all subscribers.
SET MEMORY_LIMIT, SHOW MEMORY_LIMIT, and the CONTEXTDB_MEMORY_LIMIT /
--memory-limit startup option all feed the same global memory accountant.
Vector operations attribute allocations with tags such as
vector_insert@evidence.vector_text and build_hnsw@evidence.vector_vision so
operators can identify the offending index from errors.
On a 2GB Jetson-class device, prefer SQ8 for high-dimensional evidence:
SET MEMORY_LIMIT '1536M';
CREATE TABLE evidence (
id UUID PRIMARY KEY,
vector_text VECTOR(768) WITH (quantization = 'SQ8'),
vector_vision VECTOR(512) WITH (quantization = 'SQ8')
);SHOW VECTOR_INDEXES gives structured per-index counts and live vector payload
byte totals, including any materialized HNSW payload estimate; use it
instead of parsing memory operation tags.
The wire protocol is currently PROTOCOL_VERSION = 2. The server prints the
supported protocol version in contextdb-server --version and logs it at
startup; mismatched envelopes are rejected instead of being partially applied.
contextdb uses a client-server sync model where every instance — client or server — runs the same database engine. There is no "replica" or "read-only copy." Each database is a full read-write contextdb that works independently offline.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ contextdb │ │ contextdb │ │ contextdb │
│ (laptop) │ │ (service) │ │ (device) │
│ SyncClient │ │ SyncClient │ │ SyncClient │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ ws:// │ ws:// │ ws://
│ │ │
└────────┬────────┴────────┬────────┘
│ NATS (WebSocket :9222) │
└────────┬────────────────┘
│
┌────────┴───────┐
│ contextdb │
│ (server) │
│ SyncServer │
└────────────────┘
Each client database accumulates knowledge independently — decisions, observations, corrections, embeddings. On sync, changesets flow bidirectionally: local changes push up, server changes pull down. This is collaborative sync, not WAL replication — logical changesets with per-table conflict resolution, so knowledge learned by any participant propagates to all others.
WebSocket transport means clients behind NAT (laptops, mobile, browser) connect outbound to the NATS server — no port forwarding, no VPN, no network configuration.
The server is just a contextdb instance running SyncServer. Self-host it alongside your own NATS, or point your client databases at a hosted server — the client binary and database files don't change, only the NATS connection string. Managed hosting is coming soon — join the waitlist.
SyncClient— runs on each participant. Pushes local changes to server, pulls remote changes.SyncServer— runs on the central server. Receives pushes, serves pulls.
Both communicate via NATS subjects: sync.{tenant_id}.push / sync.{tenant_id}.pull.
- Every committed row is assigned an LSN (Log Sequence Number)
SyncClienttracks push and pull watermarks (the LSN of the last synced change)- On push: sends all changes since the push watermark
- On pull: requests all changes since the pull watermark
- After restart:
full_state_snapshotfallback rebuilds from current state (the ephemeral change log is lost)
Per-table configurable policies:
LatestWins— most recent write by logical timestamp (default)ServerWins— server version takes precedenceEdgeWins— edge version takes precedenceInsertIfNotExists— insert if absent, skip otherwise
NATS with automatic chunking for payloads exceeding the 1MB NATS message limit. Vector byte sizes are accounted for in batch estimation. WebSocket transport for edge clients (port 9222), native protocol for server-to-server (port 4222).
Schema changes (CREATE TABLE, ALTER TABLE, DROP TABLE) are synced alongside data. Constraints (PRIMARY KEY, NOT NULL, UNIQUE, STATE MACHINE, DAG) are preserved across sync.
SQL string
→ contextdb-parser (pest grammar → AST)
→ contextdb-planner (AST → PhysicalPlan)
→ contextdb-engine (dispatches to executors)
→ contextdb-relational (row operations)
→ contextdb-graph (BFS traversal)
→ contextdb-vector (ANN search)
→ QueryResult { columns, rows, rows_affected }
The planner is rule-based (no cost optimizer). Key planning decisions:
GRAPH_TABLEin FROM →PhysicalPlan::GraphBfsORDER BY ... <=> ...→PhysicalPlan::VectorSearch(with candidate restriction from WHERE)- CTE containing
GRAPH_TABLE→ recursive plan composition IN (SELECT ...)→ subquery evaluation
MemoryAccountant tracks memory usage against a configurable budget. Set via --memory-limit in the CLI or MemoryAccountant::with_budget(bytes) in the API. All vector and row allocations are accounted. Budget exceeded → operations return MemoryBudgetExceeded.
File-backed databases also support a persisted disk budget:
- startup ceiling/default via
--disk-limitorCONTEXTDB_DISK_LIMIT - runtime control via
SET DISK_LIMIT/SHOW DISK_LIMIT - persisted live config in the redb file so reopen preserves the limit
Disk enforcement happens in the engine write paths before INSERT, UPDATE, and sync-apply work begins. Once the on-disk file is at or above the configured limit, further file-backed writes fail with DiskBudgetExceeded. In-memory databases accept the SQL but ignore disk budgeting because there is no backing file to measure.