A distributed, transactional SQL + KV database engine β written from scratch in Rust.
No RocksDB wrapper. No SQLite fork. Every byte is ours.
Most companies run 5+ separate services for their data layer:
| Service | What they deploy | What OmniKV gives you |
|---|---|---|
| Database | PostgreSQL / MySQL | β Full SQL engine with JOINs, aggregates, window functions |
| KV Store | Redis / etcd | β Sub-millisecond KV with TTL, range scans, MVCC |
| Consensus | etcd / ZooKeeper | β Built-in Raft consensus β 58 cluster tests |
| API Server | Express / Flask | β REST API + QUIC + TCP β built in |
| Auth + Metrics | Auth0 + Prometheus | β
JWT auth + Prometheus /metrics β built in |
OmniKV collapses all of this into a single cargo build binary.
# Start OmniKV (4 protocols start automatically)
cargo run --release
# Connect with psql β yes, your regular PostgreSQL client
psql -h localhost -p 5433
CREATE TABLE users (id INT, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'Alice', 'alice@dev.io');
INSERT INTO users VALUES (2, 'Bob', 'bob@dev.io');
CREATE TABLE orders (id INT, user_id INT, amount FLOAT);
INSERT INTO orders VALUES (101, 1, 299.99);
INSERT INTO orders VALUES (102, 2, 149.50);
-- Cost-based optimizer picks Hash JOIN, smaller table as build side
EXPLAIN SELECT u.name, SUM(o.amount)
FROM users u INNER JOIN orders o ON u.id = o.user_id
GROUP BY u.name;ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β ββββββββββββ ββββββββββββ βββββββββββββ ββββββββββββββββ β
β β PgWire β β REST API β β QUIC β β TCP Command β β
β β v3 β β HTTP/2 β β HTTP/3 β β Interface β β
β β β β + TLS β β (Quinn) β β β β
β ββββββ¬ββββββ ββββββ¬ββββββ βββββββ¬ββββββ ββββββββ¬ββββββββ β
βββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββ΄ββββββββββββ€
β SQL ENGINE β
β ββββββββββββββββ βββββββββββββββββββββ ββββββββββββββββββ β
β β SQL Parser ββ β Cost-Based ββ β Volcano β β
β β (recursive β β Optimizer β β Iterator β β
β β descent) β β (histograms, β β Executor β β
β β β β predicate push, β β (O(1) filter, β β
β β β β JOIN reorder, β β hash join, β β
β β β β index select) β β streaming) β β
β ββββββββββββββββ βββββββββββββββββββββ ββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β TRANSACTION ENGINE β
β Serializable Snapshot Isolation (SSI) Β· Savepoints β
β Write-write conflict detection Β· 2PC distributed txn β
β Transaction timeouts Β· RW-dependency tracking β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β STORAGE ENGINE β
β WAL (CRC32) β 16-shard SkipMap Memtable β SSTable (sorted) β
β Heap Store (CRC32/entry) Β· Bloom Filters Β· Block Cache (LRU) β
β ArcSwap Topology (zero-stall swap) Β· LZ4 Compression β
β L0 β L1 β L2 Tiered Compaction Β· MVCC Snapshots β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β RAFT CONSENSUS β
β OpenRaft 0.9.24 Β· Leader Election Β· Log Replication β
β Atomic Snapshot Install Β· Membership Changes Β· Rolling Upgrades β
β Network Partitions Β· 2PC Cross-Shard Replication β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The storage engine is a full LSM-tree implementation, not a wrapper around RocksDB or LevelDB.
| Component | How it works | Why it matters |
|---|---|---|
| 3-Phase Pipelined Write | Phase 1: LZ4 compress + CRC32 (no lock). Phase 2: sequence + offset reservation (Β΅s mutex). Phase 3: heap pwrite + WAL append + memtable insert. | Multiple concurrent batches overlap CPU-intensive compression. Only serializes for sequence numbering β microseconds. |
| 16-Shard SkipMap Memtable | crossbeam-skiplist::SkipMap Γ 16 shards, FNV-hashed. Lock-free concurrent inserts. |
Eliminates write contention. 16 threads write to 16 independent shards simultaneously. |
| MVCC via Atomic Sequence Numbers | Every write gets a monotonic seq. Reads specify read_seq β only see versions β€ that seq. |
True snapshot isolation. Readers never block writers. No read locks anywhere. |
| ArcSwap Topology | All read-visible state (memtable, SSTables, bloom filters, manifest) lives in a single Arc<StorageRoots>. Swapped atomically via arc_swap::ArcSwap. |
Compaction and snapshot install publish new topology in one atomic pointer swap. Readers holding old Arc continue reading stale-but-consistent data. Zero stalls. |
| CRC32 on Every Heap Entry | crc32fast::Hasher computed at write time, verified at read time. |
Silent data corruption is impossible. Bit-rot, torn writes, and disk errors are detected before data reaches the application. |
| WAL with Commit Markers | Each batch writes records + a __COMMIT_MARKER__ record. Recovery replays only complete batches. |
Partial batches from crashes are silently discarded. Proven by test_torn_wal_record_is_rejected. |
| Bloom Filters (per-SSTable) | Optimal bit count: -nΒ·ln(p) / lnΒ²(2). FNV double-hashing: h1 + iΒ·h2. |
Avoids reading SSTables that definitely don't contain the key. False positive rate: 1%. |
| Positional I/O | Unix: pwrite/pread. Windows: seek_write/seek_read with short-read loop. |
Concurrent batches write to non-overlapping heap regions without seeking. Cross-platform. |
| LZ4 Compression | Values β₯ 64 bytes compressed with lz4_flex::compress_prepend_size. Flag bit in length field. |
Transparent compression. Small values (< 64 bytes) stored raw to avoid overhead. |
| Block Cache | moka::sync::Cache (concurrent LRU, 100K entries). Keyed by heap offset. |
Hot data served from memory. No heap I/O for repeated reads of the same key. |
| Component | Implementation | Competitive comparison |
|---|---|---|
| Parser | Hand-written recursive descent. Handles SELECT/INSERT/UPDATE/DELETE/CREATE/DROP/EXPLAIN/JOIN/GROUP BY/ORDER BY/LIMIT/LIKE/IN/IS NULL/window functions. | Same approach as PostgreSQL's parser. No parser generator dependency. |
| Cost Model | SEQ_SCAN=1.0/row, INDEX_SCAN=0.25/row, PK_LOOKUP=1.0, HASH_BUILD=2.0/row, HASH_PROBE=0.1/row, SORT=NΒ·logβNΒ·2.0. |
Real cost constants, not arbitrary. Comparable to PostgreSQL's cost model (but simpler). |
| Statistics | gather_stats() scans actual table data. Per-column histograms with NDV (number of distinct values) and null fraction. Selectivity: equality=1/NDV, range=1/3, AND=multiplicative, OR=inclusion-exclusion. |
More sophisticated than SQLite (which has no statistics). Simpler than PostgreSQL (which has multi-column stats). |
| Predicate Pushdown | pushdown_join_predicates() splits AND-conjuncts and routes single-table predicates to the correct join side. |
Standard optimization. Reduces rows entering the join operator. |
| JOIN Reordering | Smaller table always becomes hash-build side. Cost = build_cost + probe_cost + build_rowsΒ·2.0 + probe_rowsΒ·0.1. |
Correct for 2-table joins. Multi-table DP planner would be needed for complex queries. |
| Volcano Model | RowIterator trait with next_row(). Operators: SeqScan, PkLookup, Filter (O(1)), Project (O(1)), Limit (O(1)), Sort (O(N)), HashJoin (O(build)), Aggregate (O(N)). |
Same architecture as PostgreSQL, Oracle, SQL Server. Pull-based streaming. |
| EXPLAIN ANALYZE | Collects actual_rows vs estimated_rows + wall-clock actual_time_ms per operator. |
Same output format as PostgreSQL's EXPLAIN ANALYZE. |
| Plan Cache | LRU cache keyed by query string. invalidate() on DDL changes. |
Avoids re-optimizing identical queries. Similar to PostgreSQL's plan cache. |
| Feature | Implementation | Comparison |
|---|---|---|
| SSI (Serializable Snapshot Isolation) | Read snapshot at BEGIN. Write-set buffered until COMMIT. At commit: acquire global lock β check if any concurrent transaction wrote to our write-set keys after our snapshot β abort if conflict β commit atomically via WriteBatch. |
Same isolation level as PostgreSQL's SERIALIZABLE. Stronger than MySQL's default (REPEATABLE READ). |
| Savepoints | Savepoint struct captures write_set + read_set snapshot. ROLLBACK TO restores to that point without aborting the entire transaction. |
Same semantics as PostgreSQL's SAVEPOINT. |
| Conflict Detection | Write-write conflict: checks if key was modified between our read_seq and current global_seq. Read-write dependency tracking with bounded memory. |
Correct SSI implementation. Detects dangerous structures (rw-antidependency cycles). |
| Timeouts | Configurable per-transaction timeout. Long-running transactions automatically aborted. | Prevents resource leaks from abandoned transactions. |
| 2PC (Distributed) | Coordinator WAL persistence. Prepare β Vote β Commit/Abort across shards. Cross-shard Raft replication. | Proven by test_2pc_cross_shard_with_raft_replication. |
| Test Category | Tests | What's proven |
|---|---|---|
| Core Replication | test_3_node_log_replication, test_state_machine_apply |
All entries identical across 3 nodes. State machine produces same result on all replicas. |
| Leader Election | test_leader_election_simulation, test_leader_election_under_load |
Exactly one leader emerges. Election works under concurrent write load. |
| Crash Recovery | test_crash_recovery_persistence, test_log_consistency_after_crash |
Committed data survives node restart. Log is consistent after crash. |
| Network Partitions | test_symmetric_partition_majority_progresses, test_asymmetric_partition_isolated_node, test_cascading_partitions_no_data_loss |
Majority side continues. Isolated node doesn't corrupt cluster. No data loss across cascading partitions. |
| Membership | test_membership_add_node_catches_up, test_membership_remove_node, test_membership_scale_out_3_to_5 |
Live scaling 3β5 nodes. Added node catches up. Removed node cleanly leaves. |
| Rolling Upgrades | test_rolling_restart_no_data_loss, test_rolling_upgrade_continuous_writes, test_rolling_upgrade_read_availability |
Zero data loss during rolling restarts. Reads available throughout upgrade. |
| Distributed Transactions | test_2pc_happy_path_commit, test_2pc_cross_shard_with_raft_replication |
2PC commit/abort works correctly. Cross-shard transactions replicated via Raft. |
| SSI Integration | test_ssi_write_write_conflict, test_ssi_read_write_conflict, test_ssi_dangerous_structure_chain |
Serializable isolation across replicated nodes. |
| MVCC Consistency | test_mvcc_logical_clock_ordering, test_ttl_consistency_across_replicas |
Logical clocks maintain ordering across nodes. TTL consistent across replicas. |
$ cargo test
test result: ok. 290 passed; 0 failed
| Suite | Tests | Focus |
|---|---|---|
| Storage Engine | 76 | WAL, crash recovery, CRC32, compaction |
| Raft Cluster | 58 | Replication, elections, partitions, upgrades |
| Operations | 25 | Health, metrics, backup, admin |
| Ops Maturity | 24 | Config, diagnostics, shutdown |
| Storage Perf | 21 | Throughput across storage paths |
| SQL Layer | 18 | Parser, JOINs, aggregates |
| Storage Correctness | 14 | Crash safety, MVCC, atomicity |
| Query + Optimizer | 16 | Planning, execution, cost estimation |
| Concurrent Stress | 6 | Multi-threaded contention |
| Anomaly Demos | 4 | Isolation level proofs |
| Other | 28 | Benchmarks, debugging |
git clone https://github.com/SBALAVIGNESH123/OmniKV.git
cd omni_engine
cargo build --release
cargo run --release ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β‘ OmniKV v0.1.0 β
β Embeddable Β· Distributed Β· Transactional KV β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β HTTP/1.1 + HTTP/2 (TLS) β 0.0.0.0:8443 β
β QUIC/HTTP3 (binary) β 0.0.0.0:4433 β
β PostgreSQL Wire Protocol β 0.0.0.0:5433 β
β TCP Command Interface β 0.0.0.0:8080 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
psql -h localhost -p 5433
CREATE TABLE users (id INTEGER, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
SELECT * FROM users WHERE id = 1;curl -k https://localhost:8443/health
curl -k -X POST https://localhost:8443/kv -d '{"key":"hello","value":"world"}'
curl -k https://localhost:8443/kv/hellouse omni_engine::{OmniKV, WriteBatch};
let db = OmniKV::open("manifest.json", "data.wal").unwrap();
let mut batch = WriteBatch::new();
batch.set("user:1", r#"{"name":"Alice"}"#.into()).unwrap();
db.commit_batch(&batch).unwrap();
let val = db.find("user:1", db.get_seq()).unwrap();src/ ~12,500 lines
βββ lib.rs Storage engine core 2,038
βββ sql.rs SQL parser 869
βββ optimizer.rs Cost-based optimizer 840
βββ sql_exec.rs SQL execution 729
βββ prepared.rs Prepared statements 662
βββ transaction.rs SSI transaction engine 648
βββ dist_txn.rs 2PC distributed txn 592
βββ volcano.rs Volcano iterator executor 582
βββ secondary_index.rs Secondary index engine 580
βββ raft_storage.rs Raft storage trait 536
βββ schema.rs DDL engine 471
βββ chaos.rs Chaos testing framework 468
βββ plan_exec.rs Plan-driven executor 463
βββ pgwire.rs PostgreSQL wire protocol 430
βββ api.rs REST API (Axum) 300
βββ ... hardening, QUIC, WAL, auth
tests/ ~7,800 lines
βββ raft_cluster.rs Raft cluster tests 3,660 (58 tests)
βββ storage_tests.rs Storage engine 1,547 (76 tests)
βββ storage_perf.rs Performance 429 (21 tests)
βββ operations.rs Operational 446 (25 tests)
βββ storage_correctness Crash safety 389 (14 tests)
βββ sql_layer.rs SQL integration 282 (18 tests)
βββ ... stress, benchmarks
Total: ~20,000 lines of Rust Β· 290 tests
| Stage | Status |
|---|---|
| β Storage Correctness | ββββββββββββ 100% |
| β Internal Storage APIs | ββββββββββββ 100% |
| β Raft Hardening | ββββββββββββ 100% |
| β Transaction Engine | ββββββββββββ 100% |
| β Query Engine | ββββββββββββ 100% |
| β Operational Maturity | ββββββββββββ 100% |
| π¨ Ecosystem | ββββββββββββ 65% |
docker build -t omnikv .
docker run -p 8443:8443 -p 5433:5433 -p 4433:4433/udp omnikv
docker compose up # 3-node clusterMIT
Built from scratch in Rust. Every byte is ours.
By Balavignesh
