A simplified implementation of Google's Spanner
This explores how large-scale distributed databases achieve fault tolerance, consistency, and scalability using concepts like tablet-based partitioning, Raft replication, directory-based request routing, two-phase commit (2PC) for distributed transactions and Timestamp Oracle (TSO) for globally consistent ordering.
This version, v1.0.0, has the following features
- ReplicatedTablet integrating Raft consensus for consistent writes.
- Bigtable Tablet acting as the underlying persistent key-value store.
- SpanServer
- Hosts multiple
ReplicatedTablet. - Exposes
/putand/getAPIs to set and get data.
- Hosts multiple
- DirectoryService for key-based routing
- Contains the key range to SpanServer mapping
— Returns the SpanServer responsible for a given key via
/lookupAPI
- Contains the key range to SpanServer mapping
— Returns the SpanServer responsible for a given key via
- Distributed Transaction Manager using Two Phase Commit (2PC) with MVCC
- Ensures atomicity for multi-key, multi-tablet transactions.
- Implements Prepare (lock) and Commit phases across all participant SpanServers.
- Integrates with Timestamp Oracle (TSO) to assign
StartTimestampandCommitTimestampfor each transaction. - Reads are non-blocking — they use MVCC snapshots based on
StartTimestamp. - Writes remain isolated — only
SEToperations acquire locks during Prepare. - All writes in a transaction commit under the same
CommitTimestamp, achieving Snapshot Isolation. - Write Conflict Detection (Commit-Time Validation)
- During the commit phase, each participant SpanServer verifies that the row being written has not been modified by another transaction since this transaction started.
- This is achieved by comparing the latest committed version’s timestamp with the transaction’s
StartTimestamp. - If a newer version exists (
latestTimestamp > StartTimestamp), the transaction is aborted to maintain snapshot isolation.
- Timestamp Oracle (TSO)
- A centralized service providing monotonically increasing timestamps (non-decreasing sequence).
- Acts as the global clock to order transactions system-wide, ensuring linearizability and global commit order.
- Exposes
/timestampAPI to return a JSON timestamp value.
- Versioned Data Storage using Timestamp Oracle
- Each write operation is assigned a commit timestamp from the Timestamp Oracle (TSO).
- The key written to the Bigtable Tablet is stored in the pattern
<rowKey>#<timestamp>. - Reads always return the latest version of the key (highest commit timestamp).
- 3-node simulation cluster — each SpanServer runs 2 replicated tablets belonging to 2 distinct Raft groups.
- Sends
/lookup?key=xto the DirectoryService. - Receives the responsible SpanServer address.
- Sends
POST /putwith{key, value}to that SpanServer.
- Determines which
ReplicatedTabletowns the key. - If it’s the Raft leader, replicates the log entry to followers.
- Once the entry is committed, it’s applied to the Bigtable Tablet.
- When a
/putrequest is received, the SpanServer fetches a commit timestamp from the TSO. - The
ReplicatedTabletcombines rowKey and timestamp into<rowKey>#<timestamp>. - This versioned key is written to the underlying Bigtable Tablet.
- As a result, the system stores multiple versions of each key, preserving historical writes.
- Looks up the SpanServer again via
/lookup?key=x. - Sends
GET /get?key=xto that SpanServer. - SpanServer calls
ReplicatedTablet.Get(), which:- Scans all versions of
<key>#<timestamp>in memory and SSTables. - Selects the version with the largest timestamp.
- Scans all versions of
- Returns the latest committed value for that key.
- Client sends a request to
/transactionwith a list of operations - SpanServer (acting as the coordinator) creates a new transaction.
-
- Start Phase
- The Transaction Manager fetches a StartTimestamp from the Timestamp Oracle Service.
- This timestamp defines the snapshot boundary for all reads in the transaction.
- Reads executed during this transaction will use
/get?key=<k>&readTs=<StartTimestamp>to ensure they return data consistent with this snapshot.
-
PreparePhase
- Each participant SpanServer is contacted to
lockits respective rows via /lockRowKey - only for SET operations - Read operations (
GET) are non-blocking and do not require locks — they rely on the consistent snapshot defined by theStartTimestamp. - If any lock fails, the transaction is aborted.
-
- Commit Phase
- Once all participants have successfully prepared, the Transaction Manager requests a CommitTimestamp from the Timestamp Oracle Service.
- The
Commitphase begins — each SpanServer executes its respective operations (/put) along with the CommitTimestamp- All writes within the transaction share the same commit version
- Before applying each write, the SpanServer performs commit-time conflict detection:
- Retrieves the latest committed version of the row from its
ReplicatedTablet. - Extracts the timestamp of that version (from
<rowKey>#<timestamp>). - If this
latestTimestampis greater than the transaction’sStartTimestamp, a write-write conflict is detected. - The SpanServer aborts the operation and the Transaction Manager rolls back the transaction.
- This ensures no transaction overwrites data written by a concurrent transaction that committed later, preserving snapshot isolation.
- Retrieves the latest committed version of the row from its
- Finally, all
locks are releasedvia /unlockRowKey.
- The transaction ensures atomicity — either all writes succeed or none do.
- Each call to /timestamp returns a JSON object
{ "timestamp": 1739772012345678 }
- Guarantees strictly monotonic timestamps.
| SpanServer | Port | RaftGroup1 Replica | RaftGroup2 Replica |
|---|---|---|---|
| spanserver1 | :8301 |
replicatedtablet1 | replicatedtablet4 |
| spanserver2 | :8302 |
replicatedtablet2 | replicatedtablet5 |
| spanserver3 | :8303 |
replicatedtablet3 | replicatedtablet6 |
Each Raft group spans all 3 SpanServers:
- raftgroup1 handles keys
< k - raftgroup2 handles keys
≥ k
go run cmd/directory/main.goIn three separate terminals:
go run cmd/spanserver1/main.go
go run cmd/spanserver2/main.go
go run cmd/spanserver3/main.goLookup SpanServer responsible for a key:
curl "http://localhost:8050/lookup?key=a"Write a key:
curl -X POST -d '{"key":"a","value":"apple"}' localhost:8301/putRead a key:
curl "http://localhost:8301/get?key=a"Multi-key Distributed Transaction
curl -X POST localhost:8301/transaction \
-H "Content-Type: application/json" \
-d '{
"operations": [
{"command": "SET", "rowKey": "Apple", "value": "1001"},
{"command": "SET", "rowKey": "Orange", "value": "1002"}
]
}'Timestamp from Timestamp Oracle Service
curl "http://localhost:8051/timestamp"