Data Structures with Spatial Locality, Temporal Locality And Lock Free Concurrency to Build Fast Caches and Databases
I’m going to discuss in detail about how we can build very fast distributed caches and distributed databases using three key principles. The following are the principles -
1. Lock Free Concurrent Data Structures
2. Building Spatial Locality Based Data Structures
Section 2.1 - Spatial Locality Driven Data Structures - I: HashMap Based Spatial Locality
Section 2.2 - Spatial Locality Driven Data Structures - II: User Defined Memory Pool and User Defined Cache Line Based Data Structures - with Linked List Example
3. Building Temporal Locality Based Data Structures
Building lock-based data structures result in the overheads of having to acquire and release locks and threads having to waiting on each other to complete their tasks. This adds up to the latency and having to deal with thread synchronization issues and deadlocks in caches and databases.
The core building block of creating lock free data structures is the Atomic Compare and Swap functionality with hardware-based locking.
Atomic Compare and Swap involves the following steps -
1. Read the current value in the memory the thread is going to access
2. Compute the new value that has to be updated to the same memory location
3. Attempt to atomically swap if the original value has not changed
I will explain this with a Golang code -
type AtomicCounter struct {
value int64
}
func (c *AtomicCounter) Increment() *int64 {
oldVal := atomic.LoadInt64(&c.value)
newVal := currentVal + 1
writeSucceeded := atomic.CompareAndSwapInt64(&c.value, oldVal, newVal)
if writeSucceeded {
return &newVal
}
pushToQueue(c)
return null
}
If the write does not succeed in the above algorithm, we push the write request into a queue and another thread will poll the queue and execute the failed write requests. ABA issue in Atomic Compare and Swap
ABA issue is a very frequent and error causing issue in atomic compare and swap.
I will explain it with the following example.
type Row struct {
key uint64
field1 string
field2 int64
}
func (v *Row) setRow(newRow *Row) bool {
oldRow := v
writeSucceeded := atomic.CompareAndSwapStruct (v, oldRow, newRow)
if writeSucceeded {
return true
}
pushToQueue(newRow)
return false
}
Consider the following flow of control
Step 1 -> thread 1 reads the key value as 25
Step 2 -> thread 1 is interrupted
Step 3 -> thread 2 updates the key as 26 along with updates to field1 and
field2
Step 4 -> thread 3 updates the key back to 25 with new updates to field1 and
field2
Step 5-> thread 1 resumes and invokes atomic.CompareAndSwap()
It will succeed now because the latest key is 25.
But thread 1 has missed the intermediate updates made in thread 2 and thread 1 wrongly executes the atomic compareAndSwap function. ABA issue can be solved by versioning every update
type Version struct {
versionCount uint64
versionTimestamp uint64
}
type Row struct {
key uint64
field1 string
field2 int64
version *Version
}
Every set function will update the version counter and version timestamp to current timestamp when it updates key and fields of the row. In this algorithm the atomic compare and swap will succeed if and only if both key and both the version fields version count and version timestamp are equal
For example
Step1 - Thread 1 reads
{
key: 25,
field1:”1”,
field2: “1”,
version: { versionCount: 1, versionTimestamp: 20 }
}
Step 2 – Thread 1 is interrupted
Step 3 – Thread 2 updates
{
key: 26,
field1:”2”,
field2: “2”,
version: {versionCount: 2, versionTimestamp: 21}
}
Step 4 – Thread 3 updates
{
key: 25,
field1: “3”,
field3: “3”,
version: { versionCount:3, versionTimestamp: 22 }
}
Step 4 – Thread 1 resumes and
executes the atomic compare and
swap function. But it will not succeed,
as in this algorithm we also check the versionCount and versionTimestamp.
the key 25 is same in both Step 3 and Step 1.
But, Step 1 has version {versionCount: 1, versionTimestamp: 20}
whereas Step 4 has version {versionCount:3, versionTimestamp: 22} so the
atomic compare and swap will not succeed
Now, the request in thread 1 will be pushed to a queue and retried. So, this is about Atomic Compare and Swap and how we can solve ABA issue.
In this document I am going to discuss in detail about enhancing the spatial locality of various data structures.
Spatial locality is a principle in computer science that describes how programs tend to
access memory locations that are close to each other in address space within a relatively
short interval of time.
If you access memory address Y, you're likely to soon access addresses near Y (like Y+1, Y+2, etc)
Consider the following example -
var array [1000]int
sum = int(0)
for i=0; i<1000; i++ {
sum += array[i]
}
In this example, First access: array[0] at memory address (let's say) 2000 Next access: array[1] at memory address 2004 Next access: array[2] at memory address 2008 etc Each access is to a memory location very close to the previous one. This creates excellent spatial locality.
When the CPU first accesses array[0], the memory system does n’t just load that single
integer, but it loads the entire cache line (example - 64 bytes - the exact byte length will
vary) containing array[0] to array[15].
Now when the loop accesses array[1], array[2] etc..
they are already in the CPU cache, making the code execute very fast
Cache Line is the smallest unit of data that can be transferred between main memory
and the CPU cache. It’s a fixed size chunk of contiguous memory that the cache system
handles as a single unit.
CPU has three levels of caches L1 cache (fastest), L2 cache (second fastest), L3 cache (third fastest) and main memory (4th fastest). By building cache line friendly data structures we will load the elements in L1/L2/L3 cache for fast data access.
In the following example -
var array [1000]int
x := array[10]
When CPU accesses array[10] and it’s not in cache, it finds the entire cache line containing array[10] and loads the entire cache line with array[10] and its surrounding elements and not just array[10]
The cache line is the fundamental unit that determines whether your memory access will be fast (cache hits) or slow (cache misses).
sum := int(0)
for i:=0; i<1000; i+=100 {
sum += array[i]
}
The above code has bad spatial locality because each access jumps far away in memory across multiple cache lines, thereby missing the cache and requiring slow memory access every time
Arrays are the data structure with the best spatial locality. As the elements of arrays are sequentially stored, they can be loaded to cache lines and accessing near by elements are very fast.
But if you take the case of pointer based dynamic data structures like linked lists, trees, tries, graphs etc, their elements are not stored sequentially but they are scattered across the main memory. S
o, adjacent elements cannot be loaded in to the CPU cache as they are not in the same cache line. This result in cache miss and subsequently slower code
So, what if we build pointer based dynamic data structures like linked lists, trees, graphs, tries etc.. using array to enhance their spatial locality and enhance their cache line hits ?
Yes. It will definitely reduce the request latencies.
But, there is an issue with arrays.
Arrays does not support efficient dynamic inserts and efficient dynamic deletes
But trees, linked lists, graphs and other dynamic data
structures support dynamic deletes are dynamic inserts.
For example, if we delete elements in the middle of the array, we will not be able to delete that memory and allow that memory to be used to build other data structures.
There are two ways to build arrays with dynamic inserts and dynamic deletes -
We will discuss both of these in detail.
Solving Dynamic Inserts and Dynamic Deletes In Arrays -
The answer is to use build arrays using hash maps to support dynamic deletes and dynamic inserts.
Hash maps with separate chaining based collision resolution is bad at spatial locality as it uses linked list for chaining the colliding keys.
Hashmaps with Open Addressing techniques like linear probing, quadratic probing and double hashing are great at spatial locality. So I have decided to select the open addressing based algorithm.
The problem with hashmaps is rehashing to reduce the read latency. When new elements get added to the hashmap we have to rehash to maintain the read latency. But rehashing is inefficient and takes both memory and CPU time.
Rehashing can be reduced to a great extent using Consistent Hashing based hash maps. Each element in the consistent hash ring is an open addressing based hash map and when the load factor of a hash map increases beyond the threshold, we add a new hashmap as its adjacent clockwise element in the consistent hashing ring and we move the keys in the new hashmap’s key range from the overloaded hashmap to the new hashmap.
Finalizing how we are going to build arrays using hashmaps -
We will create two levels of hashmaps -
Level1 - Open Addressing based HashMap
Level2 - consistent hashing ring with each element in the ring
being an open addressing based hashmap
int hashVal = hash(key)
// start range of each element in the consistent hash ring
startRangeList = [0, 100, 200, 300, 400, 500, 600]
int nextClockwiseElementIndex = binarySearch(startRangeList, hashVal)
select the open addressing based hashmap at nextClockwiseElementIndex
int nextHashVal = secondHash(key)
return hashTable[nextHashVal]
To look up each element in the array we have to do a binary search in the start range list with a time of 0(logN) where N is the number of hash maps in the consistent hash ring, followed by a second hashing to find the element in the selected hashmap
To avoid the binary search, we can have two levels of hashmaps
Level 1 - Just a single open addressing based hashmap where the most frequently
accessed keys and its value are stored. Just a small number of keys are stored in Level 1
HashMap
Level 2 - Consistent Hash Ring based Open Addressing HashMap, where the lesser
frequently accessed keys are stored as this is slower than Level 1 HashMap due to the
binary search we doing in the startRangeList
Whenever a key is read, we first check the Level1 HashMap.
If it exists in Level1 HashMap, we return it from there. Else, we retrieve it from Level2 hashmap (if the key does not exist in Level2 HashMap also, we return false/null)
Whenever we read the key from Level 2 hash map, it will be written to Level1 hashmap, and during the write to Level1 hashmap the new key we are writing will replace the existing key which was least recently used (LRU).
The existing key will move to the Level 2 hash map.
Conclusion -
We are going to build arrays with dynamic inserts and dynamic deletes
using two levels of HashMaps -
Level1 (Open Addressing Based HashMaps) and
Level2 (Consistent HashRing With Open Addressing Based HashMaps)
We will discuss User Defined Memory Pool and User Defined Cache Line Based Data Structures next
Section 2.2 - Spatial Locality Driven Data Structures - II: User Defined Memory Pool and User Defined Cache Line Based Data Structures - with Linked List Example
I will now discuss how to build a linked list with spatial locality using arrays. LinkedList is a dynamic data structure which can have dynamic inserts and deletes. I am going to describe how we can build it using arrays by enhancing cache line hits. Assume that I’m going to build an integer linked list.
We are going to allocate a memory pool for each data structure
we are going to build and manually manage availability and
unavailability of memory in that memory pool
Memory pool data structure is as follows -
type CacheLine struct {
elements [16]int // assuming 64 byte cache line
}
type MemoryPool struct {
cacheLineList *List
}
type Memory struct {
memoryPoolList *List
}
Non SpatialLocality Based LinkedList for storing LinkedList of Memory Pools and LinkedList of CacheLines
type ListNode struct {
data interface{}
next *ListNode
}
type List struct {
head *ListNode
tail *ListNode
}
type SpatialLinkedListNode struct {
cacheLineAlignedIntegerList *CacheLine
next *SpatialLinkedListNode
}
type SpatialLinkedList struct {
head *SpatialLinkedListNode
tail *SpatialLinkedListNode
deletedMemoryLocations map[unsafe.Pointer]struct{}
}
In the above spatial locality based cache line efficient linked list, how can we add new
elements to the middle or in-between in the list ?
Whenever we want to add a new element in the middle of CacheLine struct, we will divide the cache into two cache lines and add the new element to either the end of the first cache line, beginning of the second cache line or create a new cache line with the new element in the cache line and allow addition of new incoming elements in the new cache line.
Building a Generic Cache Line Efficient Linked List vs Usecase Specific Cache Line Efficient Linked List
In the above example, I had described about the use case of adding a new element in the middle of the linked list. This requires dedicated handling and dedicated algorithms to ensure that the data structure is cache line efficient.
But, in cases where we don’t need dedicated handling, we should not make that flow go through the code that handles these dedicated cases. This reduces the number of instructions that has to be executed for the incoming request to the cache/database server and reduces the latency.
We will have to build different linked list for different use cases, but the core data structures
type CacheLine struct,
type MemoryPool struct,
type Memory struct,
type SpatialLinkedListNode struct,
type SpatialLinkedList struct
will remain the same. But we will augment the SpatialLinkedListNode/SpatialLinkedList data structure with new fields to handle those use cases like we introduced a deletedMemoryLocations map[unsafe.Pointer]struct{} in SpatialLinkedList struct.
We will never use a single linked list implementation for all the use cases unless all the use cases has the same behavior
While we develop different linked list data structures by augmenting existing structs and while we develop new algorithms to build and query those new structs, we have ensure that the code is not duplicated.
In this document we will discuss about how to enhance temporal locality in data structures.
Temporal Locality is a principle in computer systems that refers to the tendency of a
program to access the same memory locations or data repeatedly within a short interval
of time.
Least Recently Used (LRU) Cache Based Algorithm to Enhance Temporal Locality
Take the case of a balanced binary search tree, a search function in balanced binary search tree has time complexity of O(logN). To improve the temporal locality of binary search trees, we can build a LRU cache along with the binary search tree to store the frequently accessed items in the binary search tree.
The LRU cache will have a fixed number of elements, let’s say 10.
If the current size is 0, and a new element comes in it is added to the cache. Now, the cache size becomes 1, the next element will be added to the cache.
This happens until number of elements in the cache is 10. When the next element comes in the least recently used element in the LRU cache will be deleted and the new element will be added to the cache.
LRU cache can be implemented using doubly linked lists and hashmaps. Caching frequently accessed data in hash maps reduces the search latency in binary search tree from 0(logN) to O(1)
We can implement LRU caches for different data structures to enhance temporal locality based on the algorithm we are implementing.
1. Sharding data structures with in a single database/cache server
Sharding data structures with in a single database/cache server based on range
sharding/consistent hashing/other sharding algorithms to reduce the search space /
number of elements the algorithm has to traverse to return the result.
2. Synchronous writing to memtable / in memory data structures + Write Ahead Log/Bin Log and
async writes to disk based data structures with reconciliation algorithms to sync the data in
in-memory, WAL/Bin logs and disk
3. Use batching of writes to speed up writes. Use the memtable and wal based algorithm.
4. Use append only logs / append only files to write data. Shard the file into pages/shards
5. Use I/O processor to reduce the load in CPU. Disk I/O and Network I/O will be handled in I/0 processors
allowing CPU to handle the remaining tasks
6. Trie based memtable and sstable for writing data structures to files and storing in-memory
7. Zero-Copy Networking:
This technique minimizes data copying between the kernel and user space during network transfers,
allowing applications to directly access data in network buffers.
8. RDMA (Remote Direct Memory Access): RDMA-enabled network protocols like InfiniBand,
iWARP, and RoCE allow direct data transfer between the memory of connected machines,
bypassing the CPU and TCP/IP stack.
9. Multi Version Concurrency Control (MVCC): Databases use MVCC, but new versions and optimizations further enhance its efficiency.
By keeping multiple versions of data, readers don't block writers, and vice-versa,
improving read/write concurrency and reducing latency
10. Using GPUs to reduce latency of database queries