Skip to content

An experimental key-value database using multiversion concurrency control (MVCC)

Notifications You must be signed in to change notification settings

seh/mvcc-key-value-database

Repository files navigation

MVCC Key-Value Database

Store byte vector values in experimental key-value database using multiversion concurrency control (MVCC), allowing many readers to proceed without waiting for any writers to finish with their changes, using HTTP clients to interact with the server.


The database stores its records only in memory, organized in a manner intended to reduce the coordination delay imposed on many clients vying to read and write at the same time. The top-level storage—represented by the Go type db.ShardedStore—uses an array of 512 Go map values to track groups of records in separate shards. Callers can specify a particular projection function to determine into which shard a given record key will fall. The default projection function is from the Go standard library's hash.maphash package, but others from third-party libraries would likely serve even better. Splitting up the records into these shards reduces the likelihood that readers and writers for two different records will need to coordinate and avoid interfering with one another.

Once a given record key lands us into a db.recordMap, we need to accommodate multiple readers and writers digging in deeper:

  • Look up a record key in the recordsByKey field's map to find an existing versionedRecord, which may or may contain any valid versions for the record.
  • When trying to insert a record, if no versionedRecord exists in the recordsByKey field's map, add a new one to start storing new versions.

These two needs conflict: We can't allow simultaneous reading and writing of these maps. We must employ some form of locking to exclude mutually these concurrent operations.

Reader/Writer Locks

Though the Go standard library offers the sync.RWMutex type to accommodate groups of readers vying for access against a presumably smaller set of writers, it is difficult to honor a context.Context's "done" status to govern how long a caller will wait trying to acquire a read or write lock. The db.Transaction type's interface uses context.Context parameters throughout, indicating its conscientious effort to interrupt long-running operations upon request. To that end, I introduced the db.rwMutex type for use in guarding our maps.

This db.rwMutex type uses a pair of channels to model requests and claims by calling readers and writers. It can then offer its TryLockUntil and TryRLockUntil methods that accept a context.Context parameter that limit how long the call will block waiting to acquire the respective kind of lock. Unlike sync.RWMutex, though, db.rwMutex does not ensure that a waiting writer will forestall admitting any more readers. Instead, it's possible that waiting writers will be starved as newly arriving readers continue jumping in line ahead of them through the already open gate.

I have not yet tested the performance difference between this db.rwMutex type and sync.RWMutex. I expect that the former type imposes a longer delay. The db.shardedStoreTransaction type's methods surrender both the reading and writing locks as quickly as possible after acquiring them:

  • lock for reading, read an entry from a map, then unlock,
  • and only if there was no such entry and we're trying to insert a new record, lock for writing, insert the map entry, then unlock.

Allowing More Simultaneous Access

Once any two callers have a versionedRecord value in hand—new being passed the need for locking—they can proceed with their reading and writing attempts with no further locking. The database stores multiple versions for each record, with each version indicating the first or earliest transaction in which it's valid as well as an optional latest transaction in which it's no longer valid, establishing a half-open range of a validity period. Retaining these multiple versions allows both more freedom for concurrent reading and writing and better detection and control of concurrent attempts at writing that would conflict. This is a simple implementation of multiversion concurrency control, or MVCC.

When attempting to record new versions within a transaction, the mutation procedure adds a latest version that lacks that earliest valid transaction. This lack of a start of the validity period acts as a sentinel value to indicate that the record version is merely pending and not yet committed. A pending record version may already include a latest bound on its validity period to indicate that it's a deletion. Attempts to read that pending record within the same transaction would find the record effectively absent.

These pending records are effectively invisible to other concurrent readers. They continue to see the database as it was as of when their transaction began, offering a form of snapshot isolation. Only if they then attempt to write to a record for which another writer has already created a pending version will they collide; the later attempt fails with the Go error type db.ErrTransactionInConflict.

There are also cases where a reader finds a record version with a validity period that includes the current transaction, but that ends in a later transaction: The record had been mutated or deleted "in the future" from the current transaction's perspective. In that situation, the current transaction is not allowed to propose changes to this record, because it may be acting on observations that were true and conclusions that were justified in the past, but would no longer be appropriate for the later committed state of the database.

The database uses atomic, lock-free operations to inspect and mutate these in-memory structures. Doing so reduces the delay that concurrent callers would likely suffer with other lock-based techniques. In trade, though, this lock-free techniques makes some required coordination more difficult to accomplish. Removing all the traffic lights makes it harder to stop traffic.

The HTTP server uses a simple interface in order to allow easy use of most HTTP clients—especially readily available tools like curl and wget. To that end, it uses text strings for record keys and values, even though the database's Go programming interface can accommodate arbitrary byte vectors for each.

The server accepts following operations:

  • /record/{key}

    • DELETE
      Delete an existing record with the given key.
      Form parameters:
      • if-absent (optional: abort (default) or ignore)
    • GET
      Retrieve an existing record with the given key.
    • POST
      Create a new record with the given key and value.
      Form parameters:
      • value
    • PUT
      Update an existing record with the given key and new value.
      Form parameters:
      • if-absent (optional: abort (default), insert, or ignore)
      • value
  • /records/batch

    • POST
      Ensure afterward that any number of records with the given keys are either present with the given value or absent, effected by inserting, updating, or deleting records as necessary. Either all the required changes commit successfully or none of them do.
      Form parameters:
      • absent (optional: keys of records of which to ensure are absent)
      • bound (optional: keys and values to which to ensure records are bound, written with the key surrounded by a delimiter character, e.g. :k1:abcd or |k1|abcd)

As with the comparable etcd server, using an application protocol like gRPC—as opposed to etcd's earlier v2 API—would allow for more direct use of byte vectors. Using JSON to convey response messages and possibly request parameters might also be more fruitful for clients.

The HTTP server program is written in the Go programming language, so you can build it using either the the go tool directly or Bazel to integrate it with other generation and compilation needs.

From this repository's root directory, invoke the following command to build the server program:

go build ./cmd/server

When successful, that command will produce an executable file named server in your working directory.

From any directory within this repository, invoke the following command to build the server program using the Bazel configuration:

bazel build //cmd/server

To locate the executable file produced by a successful invocation, invoke the following command:

bazel cquery --output=files //cmd/server

If you've modified the source code such by adding or removing Go source files, or adding or removing Go import declarations, be sure to use Bazel's Gazelle tool to update the configuration files accordingly:

bazel run //:gazelle

When you want to run the program, you can build it first and invoke the resulting executable file:

./server

Invoked like that with no command-line flags, the server listens on all network interfaces on port 80, accepting connections and serving requests over unencrypted HTTP. You can specify a different network address and on which to listen with the --server-address and --server-port command-line flags, respectively.

./server \
  --server-address=127.0.0.1 \
  --server-port=8080

If you'd like serve requests over HTTPS instead, specify files containing an X.509 serving certificate and its accompanying private key:

./server \
  --tls-cert-file=/public/server.crt \
  --tls-private-key-file=/private/server.key

When serving over HTTPS like this, the server listens on port 443 by default rather than port 80.

Note that it is also possible to have Bazel ensure that the program is built per the latest source code changes, and then run it:

bazel run -- //cmd/server

That can be useful during active development to preclude running the previously built program without taking recent changes into account.

I wrote this program in haste, and dedicated more time than prescribed, but even so had to leave it with several apparent deficiencies that follow.

  • The "vacuum" procedure is not implemented, so the database can only grow, even as callers delete the records it stores. There are two levels at which this procedure could collect garbage that accumulates within the database:

    • Record versions (of type db.recordVersion) that are valid only in transactions older than the oldest live transaction.
    • Versioned records (of type db.versionedRecord) referenced by entries in each db.recordMap's recordsByKey field that represent absent records (these either having been deleted or orphaned during insertion by aborted transactions).

    In order to collect the second kind of garbage, this procedure would need to delete entries from a Go map value while other writers may be manipulating record versions linked from that same condemned db.versionedRecord. We will likely need another field or two for bookkeeping in each db.versionedRecord to allow the "vacuum" procedure to delay such concurrent mutation or detect its occurrence and transplant the revived versioned record chain into a replacement db.versionedRecord value. Coordinating those competing efforts without imposing too much delay on writers will be challenging.

    There's some attempt at tracking the oldest extant transaction ID in the db.transactionState type's recordFinished method, but it's doomed. Using a probabilistic structure like a Bloom filter to track which IDs have completed—discarding and rebuilding it occasionally—may help here.

  • The ownership policy for byte vectors returned for record values by the (*db.shardedStoreTransaction).Get method is vague. Within a transaction in which the target record was newly inserted or updated, subsequent calls to the (*db.shardedStoreTransaction).Update method may replace the record value in place, which would still be visible through the byte slice returned by an earlier call to (*db.shardedStoreTransaction).Get. Returning a copy of the byte vector—or demanding a destination slice into which to copy the value—would be safer, but would impose a slight performance tax on callers that don't need that protection.

  • There are many cases in which concurrent transactions attempting to change the same record will suffer calls to the (*db.shardedStoreTransaction).Delete, (*db.shardedStoreTransaction).Insert, (*db.shardedStoreTransaction).Update, and (*db.shardedStoreTransaction).Upsert methods failing with ErrTransactionInConflict, where those calls might succeed without interference if attempted again immediately afterward in a later transaction. We could introduce automatic retry policies that would give up after either some maximum number of failed attempts or after a context.Context reports that it's done.

  • Transactions each have an ID, and we assume that ID increase monotonically over time. Given that we represent transaction IDs as 64-bit-wide unsigned integers, at some point we'll saturate those values and overflow back down to zero, appearing to zoom back in time. As written the program detects this situation and panics, but there may be more graceful way to interrupt the program and either adjust the transaction IDs on the live record versions or wait until all extant transactions complete before resuming doling out these much lower IDs.

  • The HTTP server does not watch for changes to the file storing the X.509 serving certificate and reload it when it changes. If the certificate is due to expire and we issue a replacement, we have to stop and restart the server program to allow it to use the new certificate. We could integrate the controller-runtime library's certwatcher package to address this need.

About

An experimental key-value database using multiversion concurrency control (MVCC)

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published