Skip to content

Commit a9dfe15

Browse files
committed
Add first draft
1 parent b75c360 commit a9dfe15

File tree

8 files changed

+796
-0
lines changed

8 files changed

+796
-0
lines changed

01-introduction/README.md

+76
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Introduction
2+
3+
A hash table is a data structure which offers a fast implementation of the
4+
associative array [API](#api). As the terminology around hash tables can be
5+
confusing, I've added a summary [below](#terminology).
6+
7+
A hash table consists of an array of 'buckets', each of which stores a key-value
8+
pair. In order to locate the bucket where a key-value pair should be stored, the
9+
key is passed through a hashing function. This function returns an integer which
10+
is used as the pair's index in the array of buckets. When we want to retrieve a
11+
key-value pair, we supply the key to the same hashing function, receive its
12+
index, and use the index to find it in the array.
13+
14+
Array indexing has algorithmic complexity `O(1)`, making hash tables fast at
15+
storing and retrieving data.
16+
17+
Our hash table will map string keys to string values, but the principals
18+
given here are applicable to hash tables which map arbitrary key types to
19+
arbitrary value types. Only ASCII strings will be supported, as supporting
20+
unicode is non-trivial and out of scope of this tutorial.
21+
22+
## API
23+
24+
Associative arrays are a collection of unordered key-value pairs. Duplicate keys
25+
are not permitted. The following operations are supported:
26+
27+
- `search(a, k)`: return the value `v` associated with key `k` from the
28+
associative array `a`, or `NULL` if the key does not exist.
29+
- `insert(a, k, v)`: store the pair `k:v` in the associative array `a`.
30+
- `delete(a, k)`: delete the `k:v` pair associated with `k`, or do nothing if
31+
`k` does not exist.
32+
33+
## Setup
34+
35+
To set up C on your computer, please consult [Daniel Holden's](@orangeduck)
36+
guide in the [Build Your Own
37+
Lisp](http://www.buildyourownlisp.com/chapter2_installation) book. Build Your
38+
Own Lisp is a great book, and I recommend working through it.
39+
40+
## Code structure
41+
42+
Code should be laid out in the following directory structure.
43+
44+
```
45+
.
46+
├── build
47+
└── src
48+
├── hash_table.c
49+
├── hash_table.h
50+
├── prime.c
51+
└── prime.h
52+
```
53+
54+
`src` will contain our code, `build` will contain our compiled binaries.
55+
56+
## Terminology
57+
58+
There are lots of names which are used interchangeably. In this article, we'll
59+
use the following:
60+
61+
- Associative array: an abstract data structure which implements the
62+
[API](#api) described above. Also called a map, symbol table or
63+
dictionary.
64+
65+
- Hash table: a fast implementation of the associative array API which makes
66+
use of a hash function. Also called a hash map, map, hash or
67+
dictionary.
68+
69+
Associative arrays can be implemented with many different underlying data
70+
structures. A (non-performant) one can be implemented by simply storing items in
71+
an array, and iterating through the array when searching. Associative arrays and
72+
hash tables are often confused because associative arrays are so often
73+
implemented as hash tables.
74+
75+
Next section: [Hash table structure](/hash-table)
76+
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)

02-hash-table/README.md

+104
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Hash table structure
2+
3+
Our key-value pairs (items) will each be stored in a `struct`:
4+
5+
```c
6+
// hash_table.h
7+
typedef struct ht_item {
8+
char* key;
9+
char* value;
10+
} ht_item;
11+
```
12+
13+
Our hash table stores an array of pointers to items, and some details about its
14+
size and how full it is:
15+
16+
```c
17+
// hash_table.h
18+
typedef struct {
19+
int size;
20+
int count;
21+
ht_item** items;
22+
} ht_hash_table;
23+
```
24+
25+
## Initialising and deleting
26+
27+
We need to define initialisation functions for `ht_item`s. This function
28+
allocates a chunk of memory the size of an `ht_item`, and saves a copy of the
29+
strings `k` and `v` in the new chunk of memory. The function is marked as
30+
`static` because it will only ever be called by code internal to the hash table.
31+
32+
```c
33+
// hash_table.c
34+
#include <stdlib.h>
35+
#include <string.h>
36+
37+
#include "hash_table.h"
38+
39+
static ht_item* ht_new_item(const char* k, const char* v) {
40+
ht_item* i = malloc(sizeof(ht_item));
41+
i->key = strdup(k);
42+
i->value = strdup(v);
43+
return i;
44+
}
45+
```
46+
47+
`ht_new` initialises a new hash table. `size` defines how many items we can
48+
store. This is fixed at 53 for now. We'll expand this in the section on
49+
[resizing](/resizing). We initialise the array of items with `calloc`, which
50+
fills the allocated memory with `NULL` bytes. A `NULL` entry in the array
51+
indicates that the bucket is empty.
52+
53+
```c
54+
// hash_table.c
55+
ht_hash_table* ht_new() {
56+
ht_hash_table* ht = malloc(sizeof(ht_hash_table));
57+
58+
ht->size = 53;
59+
ht->count = 0;
60+
ht->items = calloc((size_t)ht->size, sizeof(ht_item*));
61+
return ht;
62+
}
63+
```
64+
65+
We also need functions for deleting `ht_item`s and `ht_hash_tables`, which
66+
`free` the memory we've allocated, so we don't cause [memory
67+
leaks](https://en.wikipedia.org/wiki/Memory_leak).
68+
69+
```c
70+
// hash_table.c
71+
static void ht_del_item(ht_item* i) {
72+
free(i->key);
73+
free(i->value);
74+
free(i);
75+
}
76+
77+
78+
void ht_del_hash_table(ht_hash_table* ht) {
79+
for (int i = 0; i < ht->size; i++) {
80+
ht_item* item = ht->items[i];
81+
if (item != NULL) {
82+
ht_del_item(item);
83+
}
84+
}
85+
free(ht->items);
86+
free(ht);
87+
}
88+
```
89+
90+
We have written code which defines a hash table, and lets us create and destroy
91+
one. Although it doesn't do much at this point, we can still try it out.
92+
93+
```c
94+
// main.c
95+
#include "hash_table.h"
96+
97+
int main() {
98+
ht_hash_table* ht = ht_new();
99+
ht_del_hash_table(ht);
100+
}
101+
```
102+
103+
Next section: [Hash functions](/hashing)
104+
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)

03-hashing/README.md

+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Hash function
2+
3+
In this section, we'll write our hash function.
4+
5+
The hash function we choose should:
6+
7+
- Take a string as its input and return a number between `0` and `m`, our
8+
desired bucket array length.
9+
- Return an even distribution of bucket indexes for an average set of inputs. If
10+
our hash function is unevenly distributed, it will put more items in some
11+
buckets than others. This will lead to a higher rate of
12+
[collisions](#collisions). Collisions reduce the efficiency of our hash table.
13+
14+
## Algorithm
15+
16+
We'll make use of a generic string hashing function, expressed below in
17+
pseudocode.
18+
19+
```
20+
function hash(string, a, num_buckets):
21+
hash = 0
22+
string_len = length(string)
23+
for i = 0, 1, ..., string_len:
24+
hash += (a ** string_len - (i+1)) * char_code(string[i])
25+
hash = hash % num_buckets
26+
return hash
27+
```
28+
29+
This hash function has two steps:
30+
31+
1. Convert the string to a large integer
32+
2. Reduce the size of the integer to a fixed range by taking its remainder `mod`
33+
`m`
34+
35+
The variable `a` should be a prime number larger than the size of the alphabet.
36+
We're hashing ASCII strings, which has an alphabet size of 128, so we should
37+
choose a prime larger than that.
38+
39+
`char_code` is a function which returns an integer which represents the
40+
character. We'll use ASCII character codes for this.
41+
42+
Let's try the hash function out:
43+
44+
```
45+
hash("cat", 151, 53)
46+
47+
hash = 151**2 * 99 + 151**1 * 97 + 151**0 * 116 % 53
48+
hash = 2257299 + 14647 + 116 % 53
49+
hash = 2272062 % 53
50+
hash = 5
51+
```
52+
53+
Changing the value of `a` give us a different hash function.
54+
55+
```
56+
hash("cat", 163, 53) = 3
57+
```
58+
59+
## Implementation
60+
61+
```c
62+
// hash_table.c
63+
static int ht_hash(const char* s, const int a, const int m) {
64+
long hash = 0;
65+
const int len_s = strlen(s);
66+
for (int i = 0; i < len_s; i++) {
67+
hash += (long)pow(a, len_s - (i+1)) * s[i];
68+
hash = hash % m;
69+
}
70+
return (int)hash;
71+
}
72+
```
73+
74+
## Pathological data
75+
76+
An ideal hash function would always return an even distribution. However, for
77+
any hash function, there is a 'pathological' set of inputs, which all hash to
78+
the same value. To find this set of inputs, run a large set of inputs through
79+
the function. All inputs which hash to a particular bucket form a pathological
80+
set.
81+
82+
The existence of pathological input sets means there are no perfect hash
83+
functions for all inputs. The best we can do is to create a function which
84+
performs well for the expected data set.
85+
86+
Pathological inputs also poses a security issue. If a hash table is fed a set of
87+
colliding keys by some malicious user, then searches for those keys will take
88+
much longer (`O(n)`) than normal (`O(1)`). This can be used as a denial of
89+
service attack against systems which are underpinned by hash tables, such as DNS
90+
and certain web services.
91+
92+
Next section: [Handling collisions](/collisions)
93+
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)

04-collisions/README.md

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
## Handling collisions
2+
3+
Hash functions map an infinitely large number of inputs to a finite number of
4+
outputs. Different input keys will map to the same array index, causing
5+
bucket collisions. Hash tables must implement some method of dealing with
6+
collisions.
7+
8+
Our hash table will handle collisions using a technique called open addressing
9+
with double hashing. Double hashing makes use of two hash functions to
10+
calculate the index an item should be stored at after `i` collisions.
11+
12+
For an overview of other types of collision resolution, see the
13+
[appendix](/07-appendix).
14+
15+
## Double hashing
16+
17+
The index that should be used after `i` collisions is given by:
18+
19+
```
20+
index = hash_a(string) + i * hash_b(string) % num_buckets
21+
```
22+
23+
We see that if no collisions have occurred, `i = 0`, so the index is just
24+
`hash_a` of the string. If a collision happens, the index is modified by the
25+
`hash_b`.
26+
27+
It is possible that `hash_b` will return 0, reducing the second term to 0. This
28+
will cause the hash table to try to insert the item into the same bucket over
29+
and over. We can mitigate this by adding 1 to the result of the second hash,
30+
making sure it's never 0.
31+
32+
```
33+
index = hash_a(string) + i * (hash_b(string) + 1) % num_buckets
34+
```
35+
36+
## Implementation
37+
38+
```c
39+
// hash_table.c
40+
static int ht_get_hash(
41+
const char* s, const int num_buckets, const int attempt
42+
) {
43+
const int hash_a = ht_generic_hash(s, HT_PRIME_1, num_buckets);
44+
const int hash_b = ht_generic_hash(s, HT_PRIME_2, num_buckets);
45+
return (hash_a + (attempt * (hash_b + 1))) % num_buckets;
46+
}
47+
```
48+
49+
Next section: [Hash table methods](/methods)
50+
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)

0 commit comments

Comments
 (0)