Skip to content

shrirambalakrishnan/spanner-db

Repository files navigation

Spanner DB

A simplified implementation of Google's Spanner

This explores how large-scale distributed databases achieve fault tolerance, consistency, and scalability using concepts like tablet-based partitioning, Raft replication, directory-based request routing, two-phase commit (2PC) for distributed transactions and Timestamp Oracle (TSO) for globally consistent ordering.

Features

This version, v1.0.0, has the following features

  • ReplicatedTablet integrating Raft consensus for consistent writes.
    • Bigtable Tablet acting as the underlying persistent key-value store.
  • SpanServer
    • Hosts multiple ReplicatedTablet.
    • Exposes /put and /get APIs to set and get data.
  • DirectoryService for key-based routing
    • Contains the key range to SpanServer mapping — Returns the SpanServer responsible for a given key via /lookup API
  • Distributed Transaction Manager using Two Phase Commit (2PC) with MVCC
    • Ensures atomicity for multi-key, multi-tablet transactions.
    • Implements Prepare (lock) and Commit phases across all participant SpanServers.
    • Integrates with Timestamp Oracle (TSO) to assign StartTimestamp and CommitTimestamp for each transaction.
    • Reads are non-blocking — they use MVCC snapshots based on StartTimestamp.
    • Writes remain isolated — only SET operations acquire locks during Prepare.
    • All writes in a transaction commit under the same CommitTimestamp, achieving Snapshot Isolation.
    • Write Conflict Detection (Commit-Time Validation)
      • During the commit phase, each participant SpanServer verifies that the row being written has not been modified by another transaction since this transaction started.
      • This is achieved by comparing the latest committed version’s timestamp with the transaction’s StartTimestamp.
      • If a newer version exists (latestTimestamp > StartTimestamp), the transaction is aborted to maintain snapshot isolation.
  • Timestamp Oracle (TSO)
    • A centralized service providing monotonically increasing timestamps (non-decreasing sequence).
    • Acts as the global clock to order transactions system-wide, ensuring linearizability and global commit order.
    • Exposes /timestamp API to return a JSON timestamp value.
  • Versioned Data Storage using Timestamp Oracle
    • Each write operation is assigned a commit timestamp from the Timestamp Oracle (TSO).
    • The key written to the Bigtable Tablet is stored in the pattern <rowKey>#<timestamp>.
    • Reads always return the latest version of the key (highest commit timestamp).
  • 3-node simulation cluster — each SpanServer runs 2 replicated tablets belonging to 2 distinct Raft groups.

How it works

Step-by-Step Flow

1. Client writes a key

  • Sends /lookup?key=x to the DirectoryService.
  • Receives the responsible SpanServer address.
  • Sends POST /put with {key, value} to that SpanServer.

2. SpanServer

  • Determines which ReplicatedTablet owns the key.
  • If it’s the Raft leader, replicates the log entry to followers.
  • Once the entry is committed, it’s applied to the Bigtable Tablet.

2.1 Versioned Write (TSO Integration)

  • When a /put request is received, the SpanServer fetches a commit timestamp from the TSO.
  • The ReplicatedTablet combines rowKey and timestamp into <rowKey>#<timestamp>.
  • This versioned key is written to the underlying Bigtable Tablet.
  • As a result, the system stores multiple versions of each key, preserving historical writes.

3. Client reads a key

  • Looks up the SpanServer again via /lookup?key=x.
  • Sends GET /get?key=x to that SpanServer.
  • SpanServer calls ReplicatedTablet.Get(), which:
    • Scans all versions of <key>#<timestamp> in memory and SSTables.
    • Selects the version with the largest timestamp.
  • Returns the latest committed value for that key.

4. Multi-key Distributed Transaction using Multi Version Concurrency Control (MVCC)

  • Client sends a request to /transaction with a list of operations
  • SpanServer (acting as the coordinator) creates a new transaction.
    1. Start Phase
    • The Transaction Manager fetches a StartTimestamp from the Timestamp Oracle Service.
    • This timestamp defines the snapshot boundary for all reads in the transaction.
    • Reads executed during this transaction will use /get?key=<k>&readTs=<StartTimestamp> to ensure they return data consistent with this snapshot.
    1. Prepare Phase
    • Each participant SpanServer is contacted to lock its respective rows via /lockRowKey - only for SET operations
    • Read operations (GET) are non-blocking and do not require locks — they rely on the consistent snapshot defined by the StartTimestamp.
    • If any lock fails, the transaction is aborted.
    1. Commit Phase
    • Once all participants have successfully prepared, the Transaction Manager requests a CommitTimestamp from the Timestamp Oracle Service.
    • The Commit phase begins — each SpanServer executes its respective operations (/put) along with the CommitTimestamp
      • All writes within the transaction share the same commit version
    • Before applying each write, the SpanServer performs commit-time conflict detection:
      • Retrieves the latest committed version of the row from its ReplicatedTablet.
      • Extracts the timestamp of that version (from <rowKey>#<timestamp>).
      • If this latestTimestamp is greater than the transaction’s StartTimestamp, a write-write conflict is detected.
      • The SpanServer aborts the operation and the Transaction Manager rolls back the transaction.
      • This ensures no transaction overwrites data written by a concurrent transaction that committed later, preserving snapshot isolation.
    • Finally, all locks are released via /unlockRowKey.
  • The transaction ensures atomicity — either all writes succeed or none do.

5. Timestamp Oracle Service

  • Each call to /timestamp returns a JSON object
    • { "timestamp": 1739772012345678 }
  • Guarantees strictly monotonic timestamps.

Cluster Setup

SpanServer Port RaftGroup1 Replica RaftGroup2 Replica
spanserver1 :8301 replicatedtablet1 replicatedtablet4
spanserver2 :8302 replicatedtablet2 replicatedtablet5
spanserver3 :8303 replicatedtablet3 replicatedtablet6

Each Raft group spans all 3 SpanServers:

  • raftgroup1 handles keys < k
  • raftgroup2 handles keys ≥ k

Getting Started

1. Run Directory Service

go run cmd/directory/main.go

2. Start SpanServers

In three separate terminals:

go run cmd/spanserver1/main.go
go run cmd/spanserver2/main.go
go run cmd/spanserver3/main.go

3. Sample Requests

Lookup SpanServer responsible for a key:

curl "http://localhost:8050/lookup?key=a"

Write a key:

curl -X POST -d '{"key":"a","value":"apple"}' localhost:8301/put

Read a key:

curl "http://localhost:8301/get?key=a"

Multi-key Distributed Transaction

curl -X POST localhost:8301/transaction \
  -H "Content-Type: application/json" \
  -d '{
        "operations": [
          {"command": "SET", "rowKey": "Apple", "value": "1001"},
          {"command": "SET", "rowKey": "Orange", "value": "1002"}
        ]
      }'

Timestamp from Timestamp Oracle Service

curl "http://localhost:8051/timestamp"

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages