Skip to content

SBALAVIGNESH123/OmniKV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OmniKV Logo

OmniKV

The database engine that replaces 5 services with 1 binary.

A distributed, transactional SQL + KV database engine β€” written from scratch in Rust.
No RocksDB wrapper. No SQLite fork. Every byte is ours.

Rust Tests Lines License PgWire

Stars


🀯 Why OmniKV?

Most companies run 5+ separate services for their data layer:

Service What they deploy What OmniKV gives you
Database PostgreSQL / MySQL βœ… Full SQL engine with JOINs, aggregates, window functions
KV Store Redis / etcd βœ… Sub-millisecond KV with TTL, range scans, MVCC
Consensus etcd / ZooKeeper βœ… Built-in Raft consensus β€” 58 cluster tests
API Server Express / Flask βœ… REST API + QUIC + TCP β€” built in
Auth + Metrics Auth0 + Prometheus βœ… JWT auth + Prometheus /metrics β€” built in

OmniKV collapses all of this into a single cargo build binary.


⚑ 30-Second Demo

# Start OmniKV (4 protocols start automatically)
cargo run --release

# Connect with psql β€” yes, your regular PostgreSQL client
psql -h localhost -p 5433

CREATE TABLE users (id INT, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'Alice', 'alice@dev.io');
INSERT INTO users VALUES (2, 'Bob', 'bob@dev.io');

CREATE TABLE orders (id INT, user_id INT, amount FLOAT);
INSERT INTO orders VALUES (101, 1, 299.99);
INSERT INTO orders VALUES (102, 2, 149.50);

-- Cost-based optimizer picks Hash JOIN, smaller table as build side
EXPLAIN SELECT u.name, SUM(o.amount)
FROM users u INNER JOIN orders o ON u.id = o.user_id
GROUP BY u.name;

πŸ—οΈ Architecture β€” Everything is Custom

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CLIENT LAYER                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ PgWire   β”‚  β”‚ REST API β”‚  β”‚   QUIC    β”‚  β”‚ TCP Command  β”‚   β”‚
β”‚  β”‚ v3       β”‚  β”‚ HTTP/2   β”‚  β”‚  HTTP/3   β”‚  β”‚  Interface   β”‚   β”‚
β”‚  β”‚          β”‚  β”‚ + TLS    β”‚  β”‚  (Quinn)  β”‚  β”‚              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                        SQL ENGINE                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ SQL Parser   β”‚β†’ β”‚ Cost-Based        β”‚β†’ β”‚ Volcano        β”‚   β”‚
β”‚  β”‚ (recursive   β”‚  β”‚ Optimizer         β”‚  β”‚ Iterator       β”‚   β”‚
β”‚  β”‚  descent)    β”‚  β”‚ (histograms,      β”‚  β”‚ Executor       β”‚   β”‚
β”‚  β”‚              β”‚  β”‚  predicate push,  β”‚  β”‚ (O(1) filter,  β”‚   β”‚
β”‚  β”‚              β”‚  β”‚  JOIN reorder,    β”‚  β”‚  hash join,    β”‚   β”‚
β”‚  β”‚              β”‚  β”‚  index select)    β”‚  β”‚  streaming)    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                   TRANSACTION ENGINE                             β”‚
β”‚   Serializable Snapshot Isolation (SSI) Β· Savepoints             β”‚
β”‚   Write-write conflict detection Β· 2PC distributed txn           β”‚
β”‚   Transaction timeouts Β· RW-dependency tracking                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    STORAGE ENGINE                                β”‚
β”‚  WAL (CRC32) β†’ 16-shard SkipMap Memtable β†’ SSTable (sorted)    β”‚
β”‚  Heap Store (CRC32/entry) Β· Bloom Filters Β· Block Cache (LRU)  β”‚
β”‚  ArcSwap Topology (zero-stall swap) Β· LZ4 Compression           β”‚
β”‚  L0 β†’ L1 β†’ L2 Tiered Compaction Β· MVCC Snapshots               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    RAFT CONSENSUS                                β”‚
β”‚  OpenRaft 0.9.24 Β· Leader Election Β· Log Replication             β”‚
β”‚  Atomic Snapshot Install Β· Membership Changes Β· Rolling Upgrades β”‚
β”‚  Network Partitions Β· 2PC Cross-Shard Replication                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ Deep Feature Analysis

Storage Engine β€” lib.rs (2,038 lines)

The storage engine is a full LSM-tree implementation, not a wrapper around RocksDB or LevelDB.

Component How it works Why it matters
3-Phase Pipelined Write Phase 1: LZ4 compress + CRC32 (no lock). Phase 2: sequence + offset reservation (Β΅s mutex). Phase 3: heap pwrite + WAL append + memtable insert. Multiple concurrent batches overlap CPU-intensive compression. Only serializes for sequence numbering β€” microseconds.
16-Shard SkipMap Memtable crossbeam-skiplist::SkipMap Γ— 16 shards, FNV-hashed. Lock-free concurrent inserts. Eliminates write contention. 16 threads write to 16 independent shards simultaneously.
MVCC via Atomic Sequence Numbers Every write gets a monotonic seq. Reads specify read_seq β€” only see versions ≀ that seq. True snapshot isolation. Readers never block writers. No read locks anywhere.
ArcSwap Topology All read-visible state (memtable, SSTables, bloom filters, manifest) lives in a single Arc<StorageRoots>. Swapped atomically via arc_swap::ArcSwap. Compaction and snapshot install publish new topology in one atomic pointer swap. Readers holding old Arc continue reading stale-but-consistent data. Zero stalls.
CRC32 on Every Heap Entry crc32fast::Hasher computed at write time, verified at read time. Silent data corruption is impossible. Bit-rot, torn writes, and disk errors are detected before data reaches the application.
WAL with Commit Markers Each batch writes records + a __COMMIT_MARKER__ record. Recovery replays only complete batches. Partial batches from crashes are silently discarded. Proven by test_torn_wal_record_is_rejected.
Bloom Filters (per-SSTable) Optimal bit count: -nΒ·ln(p) / lnΒ²(2). FNV double-hashing: h1 + iΒ·h2. Avoids reading SSTables that definitely don't contain the key. False positive rate: 1%.
Positional I/O Unix: pwrite/pread. Windows: seek_write/seek_read with short-read loop. Concurrent batches write to non-overlapping heap regions without seeking. Cross-platform.
LZ4 Compression Values β‰₯ 64 bytes compressed with lz4_flex::compress_prepend_size. Flag bit in length field. Transparent compression. Small values (< 64 bytes) stored raw to avoid overhead.
Block Cache moka::sync::Cache (concurrent LRU, 100K entries). Keyed by heap offset. Hot data served from memory. No heap I/O for repeated reads of the same key.

SQL Engine β€” Parser (869) + Optimizer (840) + Volcano (582) + Executor (463)

Component Implementation Competitive comparison
Parser Hand-written recursive descent. Handles SELECT/INSERT/UPDATE/DELETE/CREATE/DROP/EXPLAIN/JOIN/GROUP BY/ORDER BY/LIMIT/LIKE/IN/IS NULL/window functions. Same approach as PostgreSQL's parser. No parser generator dependency.
Cost Model SEQ_SCAN=1.0/row, INDEX_SCAN=0.25/row, PK_LOOKUP=1.0, HASH_BUILD=2.0/row, HASH_PROBE=0.1/row, SORT=NΒ·logβ‚‚NΒ·2.0. Real cost constants, not arbitrary. Comparable to PostgreSQL's cost model (but simpler).
Statistics gather_stats() scans actual table data. Per-column histograms with NDV (number of distinct values) and null fraction. Selectivity: equality=1/NDV, range=1/3, AND=multiplicative, OR=inclusion-exclusion. More sophisticated than SQLite (which has no statistics). Simpler than PostgreSQL (which has multi-column stats).
Predicate Pushdown pushdown_join_predicates() splits AND-conjuncts and routes single-table predicates to the correct join side. Standard optimization. Reduces rows entering the join operator.
JOIN Reordering Smaller table always becomes hash-build side. Cost = build_cost + probe_cost + build_rowsΒ·2.0 + probe_rowsΒ·0.1. Correct for 2-table joins. Multi-table DP planner would be needed for complex queries.
Volcano Model RowIterator trait with next_row(). Operators: SeqScan, PkLookup, Filter (O(1)), Project (O(1)), Limit (O(1)), Sort (O(N)), HashJoin (O(build)), Aggregate (O(N)). Same architecture as PostgreSQL, Oracle, SQL Server. Pull-based streaming.
EXPLAIN ANALYZE Collects actual_rows vs estimated_rows + wall-clock actual_time_ms per operator. Same output format as PostgreSQL's EXPLAIN ANALYZE.
Plan Cache LRU cache keyed by query string. invalidate() on DDL changes. Avoids re-optimizing identical queries. Similar to PostgreSQL's plan cache.

Transaction Engine β€” transaction.rs (648 lines)

Feature Implementation Comparison
SSI (Serializable Snapshot Isolation) Read snapshot at BEGIN. Write-set buffered until COMMIT. At commit: acquire global lock β†’ check if any concurrent transaction wrote to our write-set keys after our snapshot β†’ abort if conflict β†’ commit atomically via WriteBatch. Same isolation level as PostgreSQL's SERIALIZABLE. Stronger than MySQL's default (REPEATABLE READ).
Savepoints Savepoint struct captures write_set + read_set snapshot. ROLLBACK TO restores to that point without aborting the entire transaction. Same semantics as PostgreSQL's SAVEPOINT.
Conflict Detection Write-write conflict: checks if key was modified between our read_seq and current global_seq. Read-write dependency tracking with bounded memory. Correct SSI implementation. Detects dangerous structures (rw-antidependency cycles).
Timeouts Configurable per-transaction timeout. Long-running transactions automatically aborted. Prevents resource leaks from abandoned transactions.
2PC (Distributed) Coordinator WAL persistence. Prepare β†’ Vote β†’ Commit/Abort across shards. Cross-shard Raft replication. Proven by test_2pc_cross_shard_with_raft_replication.

Raft Consensus β€” 58 Tests, 3,660 Lines of Test Code

Test Category Tests What's proven
Core Replication test_3_node_log_replication, test_state_machine_apply All entries identical across 3 nodes. State machine produces same result on all replicas.
Leader Election test_leader_election_simulation, test_leader_election_under_load Exactly one leader emerges. Election works under concurrent write load.
Crash Recovery test_crash_recovery_persistence, test_log_consistency_after_crash Committed data survives node restart. Log is consistent after crash.
Network Partitions test_symmetric_partition_majority_progresses, test_asymmetric_partition_isolated_node, test_cascading_partitions_no_data_loss Majority side continues. Isolated node doesn't corrupt cluster. No data loss across cascading partitions.
Membership test_membership_add_node_catches_up, test_membership_remove_node, test_membership_scale_out_3_to_5 Live scaling 3β†’5 nodes. Added node catches up. Removed node cleanly leaves.
Rolling Upgrades test_rolling_restart_no_data_loss, test_rolling_upgrade_continuous_writes, test_rolling_upgrade_read_availability Zero data loss during rolling restarts. Reads available throughout upgrade.
Distributed Transactions test_2pc_happy_path_commit, test_2pc_cross_shard_with_raft_replication 2PC commit/abort works correctly. Cross-shard transactions replicated via Raft.
SSI Integration test_ssi_write_write_conflict, test_ssi_read_write_conflict, test_ssi_dangerous_structure_chain Serializable isolation across replicated nodes.
MVCC Consistency test_mvcc_logical_clock_ordering, test_ttl_consistency_across_replicas Logical clocks maintain ordering across nodes. TTL consistent across replicas.

πŸ“Š Test Suite β€” 290 Tests

$ cargo test
test result: ok. 290 passed; 0 failed
Suite Tests Focus
Storage Engine 76 WAL, crash recovery, CRC32, compaction
Raft Cluster 58 Replication, elections, partitions, upgrades
Operations 25 Health, metrics, backup, admin
Ops Maturity 24 Config, diagnostics, shutdown
Storage Perf 21 Throughput across storage paths
SQL Layer 18 Parser, JOINs, aggregates
Storage Correctness 14 Crash safety, MVCC, atomicity
Query + Optimizer 16 Planning, execution, cost estimation
Concurrent Stress 6 Multi-threaded contention
Anomaly Demos 4 Isolation level proofs
Other 28 Benchmarks, debugging

πŸ”§ Quick Start

git clone https://github.com/SBALAVIGNESH123/OmniKV.git
cd omni_engine
cargo build --release
cargo run --release
  ╔════════════════════════════════════════════════════╗
  β•‘        ⚑ OmniKV v0.1.0                           β•‘
  β•‘  Embeddable Β· Distributed Β· Transactional KV      β•‘
  ╠════════════════════════════════════════════════════╣
  β•‘  HTTP/1.1 + HTTP/2 (TLS)  β†’ 0.0.0.0:8443         β•‘
  β•‘  QUIC/HTTP3 (binary)      β†’ 0.0.0.0:4433         β•‘
  β•‘  PostgreSQL Wire Protocol β†’ 0.0.0.0:5433         β•‘
  β•‘  TCP Command Interface    β†’ 0.0.0.0:8080         β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Connect with psql

psql -h localhost -p 5433
CREATE TABLE users (id INTEGER, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
SELECT * FROM users WHERE id = 1;

REST API

curl -k https://localhost:8443/health
curl -k -X POST https://localhost:8443/kv -d '{"key":"hello","value":"world"}'
curl -k https://localhost:8443/kv/hello

Embedded library

use omni_engine::{OmniKV, WriteBatch};

let db = OmniKV::open("manifest.json", "data.wal").unwrap();
let mut batch = WriteBatch::new();
batch.set("user:1", r#"{"name":"Alice"}"#.into()).unwrap();
db.commit_batch(&batch).unwrap();

let val = db.find("user:1", db.get_seq()).unwrap();

πŸ“ Project Structure

src/                                    ~12,500 lines
β”œβ”€β”€ lib.rs              Storage engine core         2,038
β”œβ”€β”€ sql.rs              SQL parser                    869
β”œβ”€β”€ optimizer.rs        Cost-based optimizer           840
β”œβ”€β”€ sql_exec.rs         SQL execution                  729
β”œβ”€β”€ prepared.rs         Prepared statements            662
β”œβ”€β”€ transaction.rs      SSI transaction engine         648
β”œβ”€β”€ dist_txn.rs         2PC distributed txn            592
β”œβ”€β”€ volcano.rs          Volcano iterator executor      582
β”œβ”€β”€ secondary_index.rs  Secondary index engine         580
β”œβ”€β”€ raft_storage.rs     Raft storage trait              536
β”œβ”€β”€ schema.rs           DDL engine                     471
β”œβ”€β”€ chaos.rs            Chaos testing framework        468
β”œβ”€β”€ plan_exec.rs        Plan-driven executor           463
β”œβ”€β”€ pgwire.rs           PostgreSQL wire protocol       430
β”œβ”€β”€ api.rs              REST API (Axum)                300
└── ...                 hardening, QUIC, WAL, auth

tests/                                  ~7,800 lines
β”œβ”€β”€ raft_cluster.rs     Raft cluster tests          3,660 (58 tests)
β”œβ”€β”€ storage_tests.rs    Storage engine              1,547 (76 tests)
β”œβ”€β”€ storage_perf.rs     Performance                   429 (21 tests)
β”œβ”€β”€ operations.rs       Operational                   446 (25 tests)
β”œβ”€β”€ storage_correctness Crash safety                  389 (14 tests)
β”œβ”€β”€ sql_layer.rs        SQL integration               282 (18 tests)
└── ...                 stress, benchmarks

Total: ~20,000 lines of Rust Β· 290 tests


πŸ—ΊοΈ Maturity

Stage Status
βœ… Storage Correctness β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
βœ… Internal Storage APIs β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
βœ… Raft Hardening β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
βœ… Transaction Engine β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
βœ… Query Engine β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
βœ… Operational Maturity β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
πŸ”¨ Ecosystem β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 65%

🐳 Docker

docker build -t omnikv .
docker run -p 8443:8443 -p 5433:5433 -p 4433:4433/udp omnikv
docker compose up  # 3-node cluster

πŸ“„ License

MIT


Built from scratch in Rust. Every byte is ours.
By Balavignesh

About

Distributed transactional SQL + KV engine. PostgreSQL wire protocol. Raft consensus. Cost-based optimizer. 290 tests. 20K lines of Rust. From scratch.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors