A high-performance, persistent, and observable in-memory key-value store built with Go. This project is a deep dive into the internal mechanisms of modern databases, demonstrating a practical exploration of concepts like data persistence, concurrency control, and system observability from the ground up.
- Dual HTTP & gRPC APIs: Interact via a simple RESTful interface or a high-performance gRPC API for
PUT
,GET
, andDELETE
operations. - Concurrent & Performant: Utilizes a sharded map to minimize lock contention, allowing it to handle high-throughput workloads efficiently.
- Durable Persistence: Implements a Write-Ahead Log (WAL) using Protocol Buffers to ensure that no data is lost in a compact and efficient binary format.
- Fast Recovery: Periodically creates snapshots of the data to compact the WAL, ensuring quick restarts. Snapshots also use Protocol Buffers for efficient storage.
- Built-in Observability: Comes with a pre-configured Grafana dashboard for monitoring key performance metrics via Prometheus.
- Automatic Memory Reclamation: Periodically rebuilds storage shards to reclaim memory from deleted items, preventing memory bloat in write-heavy workloads.
This project was built with a focus on exploring the core principles behind modern data systems. The following are key architectural decisions made to address common challenges in database design:
- Problem: A single, global lock on a central data map creates a bottleneck under concurrent loads.
- Solution: A sharded map partitions the key space across many maps, each with its own lock. This distributes write contention, significantly improving parallelism and throughput.
- Problem: In-memory data is lost on server crash or restart.
- Solution: A Write-Ahead Log (WAL) persists all write operations to disk before they are applied in memory. This ensures that the store can be fully recovered by replaying the log after a crash.
- Problem: Replaying a large WAL file on startup can lead to slow recovery times.
- Solution: The system periodically creates snapshots of the in-memory state. On restart, the service loads the latest snapshot and replays only the WAL entries created since, dramatically reducing startup time.
Before running the application, you need to create a config.yml
file. A commented example is provided in config/config.example.yml
. You can copy and modify it to get started:
cp config/config.example.yml config/config.yml
You can run the application using either Docker Compose or native Go commands.
The easiest way to get started is with Docker Compose, which runs the KV store, Prometheus, and Grafana.
# Build and start all services in detached mode
docker-compose up -d --build
- The KV store will be available at
http://localhost:16700
. - Prometheus will be available at
http://localhost:9090
. - Grafana will be available at
http://localhost:3000
.
You must have Go installed on your system.
# Build the binary
make build
# Run the application
# (Requires a config file, one is provided in the `config` directory)
make run
# Run all unit tests
make test
# Run unit tests with coverage
make test-cover
# Run linters
make lint
The project includes a pre-configured Grafana dashboard for visualizing performance metrics:
- HTTP Performance: Request rates, latency distributions (heatmap), and latencypercentiles (P50, P90, P99).
- Key-Value Operations: P99 latency for PUT, GET, and DELETE operations.
- Error Rates: HTTP 4xx and 5xx error rates.
The store was benchmarked using k6 with an open model (arrival-rate) test to determine its limits under heavy, real-world-style load.
Metric | Result |
---|---|
Achieved RPS | ~40,000 req/s |
Target RPS | 50,000 req/s |
Failure Rate | 0.00% |
p95 Latency (PUT) | 15.55ms |
p95 Latency (DELETE) | 7.38ms |
- Performance Tuning: Profile the application under load to investigate the causes of the current performance ceiling and the observed "long-tail" latency (the significant gap between p95 and max response times) to further improve performance consistency.
- Replication: To improve availability, a replication mechanism could be added to synchronize data to one or more follower nodes.
- Clustering: For horizontal scalability, the store could be extended into a distributed system where data is sharded across multiple nodes in a cluster.
This project is licensed under the MIT License. See the LICENSE file for details.