Skip to content

prakash-p-3121/data-structures-with-spatial-and-temporal-locality-and-lock-free-concurrency-to-build-databases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 

Repository files navigation

Data Structures with Spatial Locality, Temporal Locality And Lock Free Concurrency to Build Fast Caches and Databases

I’m going to discuss in detail about how we can build very fast distributed caches and distributed databases using three key principles. The following are the principles -

 1. Lock Free Concurrent Data Structures
 2. Building Spatial Locality Based Data Structures
    Section 2.1 - Spatial Locality Driven Data Structures - I: HashMap Based Spatial Locality
    Section 2.2 - Spatial Locality Driven Data Structures - II: User Defined Memory Pool and User Defined Cache Line Based Data Structures - with Linked List Example
 3. Building Temporal Locality Based Data Structures

Section 1 - Lock Free Concurrent Data Structures

Building lock-based data structures result in the overheads of having to acquire and release locks and threads having to waiting on each other to complete their tasks. This adds up to the latency and having to deal with thread synchronization issues and deadlocks in caches and databases.

The core building block of creating lock free data structures is the Atomic Compare and Swap functionality with hardware-based locking.

Atomic Compare and Swap involves the following steps -

1. Read the current value in the memory the thread is going to access
2. Compute the new value that has to be updated to the same memory location
3. Attempt to atomically swap if the original value has not changed

I will explain this with a Golang code -

type AtomicCounter struct {
   value int64
}
func (c *AtomicCounter) Increment() *int64 {
   oldVal := atomic.LoadInt64(&c.value)
   newVal := currentVal + 1
   writeSucceeded := atomic.CompareAndSwapInt64(&c.value, oldVal, newVal)
   if writeSucceeded {
   return &newVal
   }
   pushToQueue(c)
   return null
}

If the write does not succeed in the above algorithm, we push the write request into a queue and another thread will poll the queue and execute the failed write requests. ABA issue in Atomic Compare and Swap

ABA issue is a very frequent and error causing issue in atomic compare and swap.

I will explain it with the following example.

type Row struct {
   key uint64
   field1 string
   field2 int64
}
func (v *Row) setRow(newRow *Row) bool {
   oldRow := v
   writeSucceeded := atomic.CompareAndSwapStruct (v, oldRow, newRow)
   if writeSucceeded {
   return true
   }
   pushToQueue(newRow)
   return false
}

Consider the following flow of control

Step 1 -> thread 1 reads the key value as 25
Step 2 -> thread 1 is interrupted
Step 3 -> thread 2 updates the key as 26 along with updates to field1 and
field2
Step 4 -> thread 3 updates the key back to 25 with new updates to field1 and
field2
Step 5-> thread 1 resumes and invokes atomic.CompareAndSwap()

It will succeed now because the latest key is 25.

But thread 1 has missed the intermediate updates made in thread 2 and thread 1 wrongly executes the atomic compareAndSwap function. ABA issue can be solved by versioning every update

type Version struct {
   versionCount uint64
   versionTimestamp uint64
}
type Row struct {
   key uint64
   field1 string
   field2 int64
   version *Version
}

Every set function will update the version counter and version timestamp to current timestamp when it updates key and fields of the row. In this algorithm the atomic compare and swap will succeed if and only if both key and both the version fields version count and version timestamp are equal

For example

Step1 - Thread 1 reads
{
   key: 25,
   field1:”1”,
   field2: “1”,
   version: { versionCount: 1, versionTimestamp: 20 }
}
Step 2 – Thread 1 is interrupted
Step 3 – Thread 2 updates
{
   key: 26,
   field1:”2”,
   field2: “2”,
   version: {versionCount: 2, versionTimestamp: 21}
}
Step 4 – Thread 3 updates
{
   key: 25,
   field1: “3”,
   field3: “3”,
   version: { versionCount:3, versionTimestamp: 22 }
}
Step 4 – Thread 1 resumes and
executes the atomic compare and
swap function. But it will not succeed,
as in this algorithm we also check the versionCount and versionTimestamp.
the key 25 is same in both Step 3 and Step 1.
But, Step 1 has version {versionCount: 1, versionTimestamp: 20}
whereas Step 4 has version {versionCount:3, versionTimestamp: 22} so the
atomic compare and swap will not succeed

Now, the request in thread 1 will be pushed to a queue and retried. So, this is about Atomic Compare and Swap and how we can solve ABA issue.

Section 2.1 - Spatial Locality Driven Data Structures - I: HashMap Based Spatial Locality

In this document I am going to discuss in detail about enhancing the spatial locality of various data structures.

Spatial Locality

Spatial locality is a principle in computer science that describes how programs tend to
access memory locations that are close to each other in address space within a relatively
short interval of time.

If you access memory address Y, you're likely to soon access addresses near Y (like Y+1, Y+2, etc)

Consider the following example -

var array [1000]int
sum = int(0)
for i=0; i<1000; i++ {
   sum += array[i]
}

In this example, First access: array[0] at memory address (let's say) 2000 Next access: array[1] at memory address 2004 Next access: array[2] at memory address 2008 etc Each access is to a memory location very close to the previous one. This creates excellent spatial locality.

Why this matters ? - Cache Line

When the CPU first accesses array[0], the memory system does n’t just load that single
integer, but it loads the entire cache line (example - 64 bytes - the exact byte length will
vary) containing array[0] to array[15].

Now when the loop accesses array[1], array[2] etc..
they are already in the CPU cache, making the code execute very fast
Cache Line is the smallest unit of data that can be transferred between main memory
and the CPU cache. It’s a fixed size chunk of contiguous memory that the cache system
handles as a single unit.

CPU has three levels of caches L1 cache (fastest), L2 cache (second fastest), L3 cache (third fastest) and main memory (4th fastest). By building cache line friendly data structures we will load the elements in L1/L2/L3 cache for fast data access.

In the following example -

var array [1000]int
x := array[10]

When CPU accesses array[10] and it’s not in cache, it finds the entire cache line containing array[10] and loads the entire cache line with array[10] and its surrounding elements and not just array[10]

The cache line is the fundamental unit that determines whether your memory access will be fast (cache hits) or slow (cache misses).

Example with bad spatial locality -

sum := int(0)
for i:=0; i<1000; i+=100 {
   sum += array[i]
}

The above code has bad spatial locality because each access jumps far away in memory across multiple cache lines, thereby missing the cache and requiring slow memory access every time

Arrays - the data structure with the best spatial locality

Arrays are the data structure with the best spatial locality. As the elements of arrays are sequentially stored, they can be loaded to cache lines and accessing near by elements are very fast.

But if you take the case of pointer based dynamic data structures like linked lists, trees, tries, graphs etc, their elements are not stored sequentially but they are scattered across the main memory. S

o, adjacent elements cannot be loaded in to the CPU cache as they are not in the same cache line. This result in cache miss and subsequently slower code

Core Idea Behind Spatial Locality Driven Data Structures -

Build Every Data Structure Using Arrays Instead Of Using Pointers To Next Element/Child Elements

So, what if we build pointer based dynamic data structures like linked lists, trees, graphs, tries etc.. using array to enhance their spatial locality and enhance their cache line hits ?

Yes. It will definitely reduce the request latencies.

But, there is an issue with arrays. Arrays does not support efficient dynamic inserts and efficient dynamic deletes But trees, linked lists, graphs and other dynamic data structures support dynamic deletes are dynamic inserts.

For example, if we delete elements in the middle of the array, we will not be able to delete that memory and allow that memory to be used to build other data structures.

There are two ways to build arrays with dynamic inserts and dynamic deletes -

1. Multi Level Open Addressing and Consistent Hashing Based HashMaps

2. User Defined Memory Pool and User Defined Cache Line Based Data Structures

We will discuss both of these in detail.

1. Multi Level Open Addressing and Consistent Hashing Based HashMaps

Solving Dynamic Inserts and Dynamic Deletes In Arrays -

The answer is to use build arrays using hash maps to support dynamic deletes and dynamic inserts.

Hash maps with separate chaining based collision resolution is bad at spatial locality as it uses linked list for chaining the colliding keys.

Hashmaps with Open Addressing techniques like linear probing, quadratic probing and double hashing are great at spatial locality. So I have decided to select the open addressing based algorithm.

Solving Rehashing in Hashmaps

The problem with hashmaps is rehashing to reduce the read latency. When new elements get added to the hashmap we have to rehash to maintain the read latency. But rehashing is inefficient and takes both memory and CPU time.

Rehashing can be reduced to a great extent using Consistent Hashing based hash maps. Each element in the consistent hash ring is an open addressing based hash map and when the load factor of a hash map increases beyond the threshold, we add a new hashmap as its adjacent clockwise element in the consistent hashing ring and we move the keys in the new hashmap’s key range from the overloaded hashmap to the new hashmap.

Finalizing how we are going to build arrays using hashmaps -

We will create two levels of hashmaps -

Level1 - Open Addressing based HashMap
Level2 - consistent hashing ring with each element in the ring
being an open addressing based hashmap

Algorithm to look up a key in the consistent hashing based open addressing hashmap

int hashVal = hash(key)
// start range of each element in the consistent hash ring
startRangeList = [0, 100, 200, 300, 400, 500, 600]
int nextClockwiseElementIndex = binarySearch(startRangeList, hashVal)
select the open addressing based hashmap at nextClockwiseElementIndex
int nextHashVal = secondHash(key)
return hashTable[nextHashVal]

To look up each element in the array we have to do a binary search in the start range list with a time of 0(logN) where N is the number of hash maps in the consistent hash ring, followed by a second hashing to find the element in the selected hashmap

To avoid the binary search, we can have two levels of hashmaps

Level 1 - Just a single open addressing based hashmap where the most frequently
accessed keys and its value are stored. Just a small number of keys are stored in Level 1
HashMap
Level 2 - Consistent Hash Ring based Open Addressing HashMap, where the lesser
frequently accessed keys are stored as this is slower than Level 1 HashMap due to the
binary search we doing in the startRangeList

Whenever a key is read, we first check the Level1 HashMap.

If it exists in Level1 HashMap, we return it from there. Else, we retrieve it from Level2 hashmap (if the key does not exist in Level2 HashMap also, we return false/null)

Whenever we read the key from Level 2 hash map, it will be written to Level1 hashmap, and during the write to Level1 hashmap the new key we are writing will replace the existing key which was least recently used (LRU).

The existing key will move to the Level 2 hash map.

Conclusion -

Core Building Block of Arrays With Dynamic Inserts and Dynamic Deletes

We are going to build arrays with dynamic inserts and dynamic deletes
using two levels of HashMaps -

Level1 (Open Addressing Based HashMaps) and
Level2 (Consistent HashRing With Open Addressing Based HashMaps)

We will discuss User Defined Memory Pool and User Defined Cache Line Based Data Structures next

Section 2.2 - Spatial Locality Driven Data Structures - II: User Defined Memory Pool and User Defined Cache Line Based Data Structures - with Linked List Example

I will now discuss how to build a linked list with spatial locality using arrays. LinkedList is a dynamic data structure which can have dynamic inserts and deletes. I am going to describe how we can build it using arrays by enhancing cache line hits. Assume that I’m going to build an integer linked list.

Core Idea -


We are going to allocate a memory pool for each data structure
we are going to build and manually manage availability and
unavailability of memory in that memory pool

Memory pool data structure is as follows -

type CacheLine struct {
  elements [16]int // assuming 64 byte cache line
}

type MemoryPool struct {
  cacheLineList *List
}
type Memory struct {
  memoryPoolList *List
}

Non SpatialLocality Based LinkedList for storing LinkedList of Memory Pools and LinkedList of CacheLines

type ListNode struct {
  data interface{}
  next *ListNode
}
type List struct {
  head *ListNode
  tail *ListNode
}

SpatialLocality Based LinkedList for building an array based cache line efficient linked list

type SpatialLinkedListNode struct {
  cacheLineAlignedIntegerList *CacheLine
  next *SpatialLinkedListNode
} 

type SpatialLinkedList struct {
  head *SpatialLinkedListNode
  tail *SpatialLinkedListNode
  deletedMemoryLocations map[unsafe.Pointer]struct{}
}

Challenge

In the above spatial locality based cache line efficient linked list, how can we add new
elements to the middle or in-between in the list ?

Whenever we want to add a new element in the middle of CacheLine struct, we will divide the cache into two cache lines and add the new element to either the end of the first cache line, beginning of the second cache line or create a new cache line with the new element in the cache line and allow addition of new incoming elements in the new cache line.

Building a Generic Cache Line Efficient Linked List vs Usecase Specific Cache Line Efficient Linked List

In the above example, I had described about the use case of adding a new element in the middle of the linked list. This requires dedicated handling and dedicated algorithms to ensure that the data structure is cache line efficient.

But, in cases where we don’t need dedicated handling, we should not make that flow go through the code that handles these dedicated cases. This reduces the number of instructions that has to be executed for the incoming request to the cache/database server and reduces the latency.

We will have to build different linked list for different use cases, but the core data structures

type CacheLine struct,
type MemoryPool struct,
type Memory struct,
type SpatialLinkedListNode struct,
type SpatialLinkedList struct

will remain the same. But we will augment the SpatialLinkedListNode/SpatialLinkedList data structure with new fields to handle those use cases like we introduced a deletedMemoryLocations map[unsafe.Pointer]struct{} in SpatialLinkedList struct.

We will never use a single linked list implementation for all the use cases unless all the use cases has the same behavior

While we develop different linked list data structures by augmenting existing structs and while we develop new algorithms to build and query those new structs, we have ensure that the code is not duplicated.

Section 3 - Temporal Locality Driven Data Structures

In this document we will discuss about how to enhance temporal locality in data structures.

Temporal Locality

Temporal Locality is a principle in computer systems that refers to the tendency of a
program to access the same memory locations or data repeatedly within a short interval
of time.

Least Recently Used (LRU) Cache Based Algorithm to Enhance Temporal Locality

Take the case of a balanced binary search tree, a search function in balanced binary search tree has time complexity of O(logN). To improve the temporal locality of binary search trees, we can build a LRU cache along with the binary search tree to store the frequently accessed items in the binary search tree.

The LRU cache will have a fixed number of elements, let’s say 10.

If the current size is 0, and a new element comes in it is added to the cache. Now, the cache size becomes 1, the next element will be added to the cache.

This happens until number of elements in the cache is 10. When the next element comes in the least recently used element in the LRU cache will be deleted and the new element will be added to the cache.

LRU cache can be implemented using doubly linked lists and hashmaps. Caching frequently accessed data in hash maps reduces the search latency in binary search tree from 0(logN) to O(1)

We can implement LRU caches for different data structures to enhance temporal locality based on the algorithm we are implementing.

Section 4 - Generic Ideas

1. Sharding data structures with in a single database/cache server 

   Sharding data structures with in a single database/cache server based on range
   sharding/consistent hashing/other sharding algorithms to reduce the search space /
   number of elements the algorithm has to traverse to return the result.
2. Synchronous writing to memtable / in memory data structures + Write Ahead Log/Bin Log and
   async writes to disk based data structures with reconciliation algorithms to sync the data in
   in-memory, WAL/Bin logs and disk

3. Use batching of writes to speed up writes. Use the memtable and wal based algorithm.
4. Use append only logs / append only files to write data. Shard the file into pages/shards
5. Use I/O processor to reduce the load in CPU. Disk I/O and Network I/O will be handled in I/0 processors
   allowing CPU to handle the remaining tasks
6.  Trie based memtable and sstable for writing data structures to files and storing in-memory
7. Zero-Copy Networking: 
   This technique minimizes data copying between the kernel and user space during network transfers, 
   allowing applications to directly access data in network buffers.
8. RDMA (Remote Direct Memory Access): RDMA-enabled network protocols like InfiniBand,
   iWARP, and RoCE allow direct data transfer between the memory of connected machines, 
   bypassing the CPU and TCP/IP stack.
9. Multi Version Concurrency Control (MVCC): Databases use MVCC, but new versions and optimizations further enhance its efficiency.
   By keeping multiple versions of data, readers don't block writers, and vice-versa,
   improving read/write concurrency and reducing latency
10. Using GPUs to reduce latency of database queries

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors