You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+32-40Lines changed: 32 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,71 +1,63 @@
1
1
BorgHash
2
-
=========
2
+
========
3
3
4
-
Memory-efficient hashtable implementations as a Python library,
5
-
implemented in Cython.
4
+
Memory-efficient hashtable implementations as a Python library implemented in Cython.
6
5
7
6
HashTable
8
7
---------
9
8
10
-
``HashTable`` is a rather low-level implementation, usually one rather wants to
11
-
use the ``HashTableNT`` wrapper. But read on to get the basics...
9
+
``HashTable`` is a fairly low-level implementation; usually one will want to use the ``HashTableNT`` wrapper. Read on for the basics...
12
10
13
11
Keys and Values
14
12
~~~~~~~~~~~~~~~
15
13
16
-
The keys MUST be perfectly random ``bytes`` of arbitrary, but constant length,
17
-
like from a cryptographic hash (sha256, hmac-sha256, ...).
18
-
The implementation relies on this "perfectly random" property and does not
19
-
implement an own hash function, but just takes 32 bits from the given key.
14
+
The keys MUST be perfectly random ``bytes`` of arbitrary but fixed length, like from a cryptographic hash (SHA-256, HMAC-SHA-256, ...).
15
+
The implementation relies on this "perfectly random" property and does not implement its own hash function; it just takes 32 bits from the given key.
20
16
21
-
The values are binary ``bytes`` of arbitrary, but constant length.
17
+
The values are ``bytes`` of arbitrary but fixed length.
22
18
23
-
The length of the keys and values is defined when creating a ``HashTable``
24
-
instance (after that, the length must always match that defined length).
19
+
The lengths of the keys and values are defined when creating a ``HashTable`` instance; thereafter, the lengths must always match the defined size.
25
20
26
21
Implementation details
27
22
~~~~~~~~~~~~~~~~~~~~~~
28
23
29
-
To have little memory overhead overall, the hashtable only stores uint32_t
30
-
indexes into separate keys and values arrays (short: kv arrays).
24
+
To have little memory overhead overall, the hashtable only stores ``uint32_t``
25
+
indices into separate keys and values arrays (short: kv arrays).
31
26
32
-
A new key just gets appended to the keys array. The corresponding value gets
33
-
appended to the values array. After that, the key and value do not change their
27
+
A new key is appended to the keys array. The corresponding value is appended to the values array. After that, the key and value do not change their
34
28
index as long as they exist in the hashtable and the ht and kv arrays are in
35
29
memory. Even when kv pairs are deleted from ``HashTable``, the kv arrays never
36
-
shrink and the indexes of other kv pairs don't change.
30
+
shrink and the indices of other kv pairs don't change.
37
31
38
-
This is because we want to have stable array indexes for the keys/values so the
39
-
indexes can be used outside of ``HashTable`` as memory-efficient references.
32
+
This is because we want to have stable array indices for the keys/values, so the
33
+
indices can be used outside of ``HashTable`` as memory-efficient references.
40
34
41
35
Memory allocated
42
36
~~~~~~~~~~~~~~~~
43
37
44
-
For a hashtable load factor of 0.1 - 0.5, a kv array grow factor of 1.3 and
38
+
For a hashtable load factor of 0.1 – 0.5, a kv array growth factor of 1.3, and
45
39
N kv pairs, memory usage in bytes is approximately:
46
40
47
41
- Hashtable: from ``N * 4 / 0.5`` to ``N * 4 / 0.1``
48
-
- Keys/Values: from ``N * len(key+value) * 1.0`` to ``N * len(key+value) * 1.3``
49
-
- Overall: from ``N * (8 + len(key+value))`` to ``N * (40 + len(key+value) * 1.3)``
42
+
- Keys/Values: from ``N * len(key + value) * 1.0`` to ``N * len(key + value) * 1.3``
43
+
- Overall: from ``N * (8 + len(key + value))`` to ``N * (40 + len(key + value) * 1.3)``
50
44
51
-
When the hashtable or the kv arrays are resized, there will be short memory
52
-
usage spikes. For the kv arrays, ``realloc()`` is used to avoid copying of
53
-
data and memory usage spikes, if possible.
45
+
When the hashtable or the kv arrays are resized, there will be brief memory-usage spikes. For the kv arrays, ``realloc()`` is used to avoid copying data and to minimize memory-usage spikes, if possible.
54
46
55
47
HashTableNT
56
48
-----------
57
49
58
50
``HashTableNT`` is a convenience wrapper around ``HashTable``:
59
51
60
-
- accepts and returns ``namedtuple`` values
61
-
- implements persistence: can read (write) the hashtable from (to) a file.
52
+
- Accepts and returns ``namedtuple`` values.
53
+
- Implements persistence: can read the hashtable from a file and write it to a file.
62
54
63
55
Keys and Values
64
56
~~~~~~~~~~~~~~~
65
57
66
58
Keys: ``bytes``, see ``HashTable``.
67
59
68
-
Values: any fixed type of ``namedtuple`` that can be serialized to ``bytes``
60
+
Values: any fixed ``namedtuple`` type that can be serialized to ``bytes``
69
61
by Python's ``struct`` module using a given format string.
70
62
71
63
When setting a value, it is automatically serialized. When a value is returned,
@@ -75,11 +67,11 @@ Persistence
75
67
~~~~~~~~~~~
76
68
77
69
``HashTableNT`` has ``.write()`` and ``.read()`` methods to save/load its
78
-
content to/from a file, using an efficient binary format.
70
+
contents to/from a file, using an efficient binary format.
79
71
80
72
When a ``HashTableNT`` is saved to disk, only the non-deleted entries are
81
-
persisted and when it is loaded from disk, a new hashtable and new, dense
82
-
kv arrays are built - thus, kv indexes will be different!
73
+
persisted. When it is loaded from disk, a new hashtable and new, dense
74
+
kv arrays are built; thus, kv indices will be different!
83
75
84
76
API
85
77
---
@@ -96,15 +88,15 @@ Example code
96
88
97
89
::
98
90
99
-
# HashTableNT mapping 256bit key [bytes] --> Chunk value [namedtuple]
91
+
# HashTableNT mapping 256-bit key [bytes] --> Chunk value [namedtuple]
# realloc is already highly optimized (in Linux). By using mremap internally only the peak address space usage is "old size" + "new size", while the peak memory usage is only "new size".
@@ -270,8 +270,8 @@ cdef class HashTable:
270
270
271
271
defk_to_idx(self, key: bytes) -> int:
272
272
"""
273
-
return the key's index in the keys array (index is stable while in memory).
274
-
this can be used to "abbreviate" a known key (e.g. 256bit key -> 32bitindex).
273
+
Return the key's index in the keys array (index is stable while in memory).
274
+
This can be used to "abbreviate" a known key (e.g., 256-bit key -> 32-bitindex).
275
275
"""
276
276
if len(key) != self.ksize:
277
277
raise ValueError("Key size does not match the defined size")
@@ -283,16 +283,16 @@ cdef class HashTable:
283
283
284
284
defidx_to_k(self, idx: int) -> bytes:
285
285
"""
286
-
for a given index, return the key stored at that index in the keys array.
287
-
this is the reverse of k_to_idx (e.g. 32bit index -> 256bitkey).
286
+
For a given index, return the key stored at that index in the keys array.
287
+
This is the reverse of k_to_idx (e.g., 32-bit index -> 256-bitkey).
0 commit comments