Skip to content

Commit edd9665

Browse files
Merge pull request #38 from borgbackup/fix-typos-grammar
Fix typos and grammar
2 parents 02b9563 + 906da7d commit edd9665

File tree

11 files changed

+91
-99
lines changed

11 files changed

+91
-99
lines changed

CHANGES.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
ChangeLog
1+
Changelog
22
=========
33

44
Version 0.1.0 2024-11-18
55
------------------------
66

7-
- HashTableNT: deal with byte_order separately
8-
- HashTableNT: give separate formats in value_format namedtuple
7+
- HashTableNT: handle ``byte_order`` separately.
8+
- HashTableNT: provide separate formats in the ``value_format`` namedtuple.
99

1010
Version 0.0.2 2024-11-10
1111
------------------------
1212

13-
- Fixed "KV array is full" crash on 32bit platforms (and maybe also some other
14-
int-size related issues), #27.
15-
- Added a .update method to HashTableNT (like dict.update), #28.
13+
- Fixed "KV array is full" crash on 32-bit platforms (and maybe also some other
14+
integer-size related issues), #27.
15+
- Added an ``.update()`` method to HashTableNT (like ``dict.update()``), #28.
1616

1717
Version 0.0.1 2024-10-31
1818
------------------------

README.rst

Lines changed: 32 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,63 @@
11
BorgHash
2-
=========
2+
========
33

4-
Memory-efficient hashtable implementations as a Python library,
5-
implemented in Cython.
4+
Memory-efficient hashtable implementations as a Python library implemented in Cython.
65

76
HashTable
87
---------
98

10-
``HashTable`` is a rather low-level implementation, usually one rather wants to
11-
use the ``HashTableNT`` wrapper. But read on to get the basics...
9+
``HashTable`` is a fairly low-level implementation; usually one will want to use the ``HashTableNT`` wrapper. Read on for the basics...
1210

1311
Keys and Values
1412
~~~~~~~~~~~~~~~
1513

16-
The keys MUST be perfectly random ``bytes`` of arbitrary, but constant length,
17-
like from a cryptographic hash (sha256, hmac-sha256, ...).
18-
The implementation relies on this "perfectly random" property and does not
19-
implement an own hash function, but just takes 32 bits from the given key.
14+
The keys MUST be perfectly random ``bytes`` of arbitrary but fixed length, like from a cryptographic hash (SHA-256, HMAC-SHA-256, ...).
15+
The implementation relies on this "perfectly random" property and does not implement its own hash function; it just takes 32 bits from the given key.
2016

21-
The values are binary ``bytes`` of arbitrary, but constant length.
17+
The values are ``bytes`` of arbitrary but fixed length.
2218

23-
The length of the keys and values is defined when creating a ``HashTable``
24-
instance (after that, the length must always match that defined length).
19+
The lengths of the keys and values are defined when creating a ``HashTable`` instance; thereafter, the lengths must always match the defined size.
2520

2621
Implementation details
2722
~~~~~~~~~~~~~~~~~~~~~~
2823

29-
To have little memory overhead overall, the hashtable only stores uint32_t
30-
indexes into separate keys and values arrays (short: kv arrays).
24+
To have little memory overhead overall, the hashtable only stores ``uint32_t``
25+
indices into separate keys and values arrays (short: kv arrays).
3126

32-
A new key just gets appended to the keys array. The corresponding value gets
33-
appended to the values array. After that, the key and value do not change their
27+
A new key is appended to the keys array. The corresponding value is appended to the values array. After that, the key and value do not change their
3428
index as long as they exist in the hashtable and the ht and kv arrays are in
3529
memory. Even when kv pairs are deleted from ``HashTable``, the kv arrays never
36-
shrink and the indexes of other kv pairs don't change.
30+
shrink and the indices of other kv pairs don't change.
3731

38-
This is because we want to have stable array indexes for the keys/values so the
39-
indexes can be used outside of ``HashTable`` as memory-efficient references.
32+
This is because we want to have stable array indices for the keys/values, so the
33+
indices can be used outside of ``HashTable`` as memory-efficient references.
4034

4135
Memory allocated
4236
~~~~~~~~~~~~~~~~
4337

44-
For a hashtable load factor of 0.1 - 0.5, a kv array grow factor of 1.3 and
38+
For a hashtable load factor of 0.1 0.5, a kv array growth factor of 1.3, and
4539
N kv pairs, memory usage in bytes is approximately:
4640

4741
- Hashtable: from ``N * 4 / 0.5`` to ``N * 4 / 0.1``
48-
- Keys/Values: from ``N * len(key+value) * 1.0`` to ``N * len(key+value) * 1.3``
49-
- Overall: from ``N * (8 + len(key+value))`` to ``N * (40 + len(key+value) * 1.3)``
42+
- Keys/Values: from ``N * len(key + value) * 1.0`` to ``N * len(key + value) * 1.3``
43+
- Overall: from ``N * (8 + len(key + value))`` to ``N * (40 + len(key + value) * 1.3)``
5044

51-
When the hashtable or the kv arrays are resized, there will be short memory
52-
usage spikes. For the kv arrays, ``realloc()`` is used to avoid copying of
53-
data and memory usage spikes, if possible.
45+
When the hashtable or the kv arrays are resized, there will be brief memory-usage spikes. For the kv arrays, ``realloc()`` is used to avoid copying data and to minimize memory-usage spikes, if possible.
5446

5547
HashTableNT
5648
-----------
5749

5850
``HashTableNT`` is a convenience wrapper around ``HashTable``:
5951

60-
- accepts and returns ``namedtuple`` values
61-
- implements persistence: can read (write) the hashtable from (to) a file.
52+
- Accepts and returns ``namedtuple`` values.
53+
- Implements persistence: can read the hashtable from a file and write it to a file.
6254

6355
Keys and Values
6456
~~~~~~~~~~~~~~~
6557

6658
Keys: ``bytes``, see ``HashTable``.
6759

68-
Values: any fixed type of ``namedtuple`` that can be serialized to ``bytes``
60+
Values: any fixed ``namedtuple`` type that can be serialized to ``bytes``
6961
by Python's ``struct`` module using a given format string.
7062

7163
When setting a value, it is automatically serialized. When a value is returned,
@@ -75,11 +67,11 @@ Persistence
7567
~~~~~~~~~~~
7668

7769
``HashTableNT`` has ``.write()`` and ``.read()`` methods to save/load its
78-
content to/from a file, using an efficient binary format.
70+
contents to/from a file, using an efficient binary format.
7971

8072
When a ``HashTableNT`` is saved to disk, only the non-deleted entries are
81-
persisted and when it is loaded from disk, a new hashtable and new, dense
82-
kv arrays are built - thus, kv indexes will be different!
73+
persisted. When it is loaded from disk, a new hashtable and new, dense
74+
kv arrays are built; thus, kv indices will be different!
8375

8476
API
8577
---
@@ -96,15 +88,15 @@ Example code
9688

9789
::
9890

99-
# HashTableNT mapping 256bit key [bytes] --> Chunk value [namedtuple]
91+
# HashTableNT mapping 256-bit key [bytes] --> Chunk value [namedtuple]
10092
Chunk = namedtuple("Chunk", ["refcount", "size"])
10193
ChunkFormat = namedtuple("ChunkFormat", ["refcount", "size"])
10294
chunk_format = ChunkFormat(refcount="I", size="I")
10395

104-
# 256bit (32Byte) key, 2x 32bit (4Byte) values
96+
# 256-bit (32-byte) key, 2x 32-bit (4-byte) values
10597
ht = HashTableNT(key_size=32, value_type=Chunk, value_format=chunk_format)
10698

107-
key = b"x" * 32 # the key is usually from a cryptographic hash fn
99+
key = b"x" * 32 # the key is usually from a cryptographic hash function
108100
value = Chunk(refcount=1, size=42)
109101
ht[key] = value
110102
assert ht[key] == value
@@ -131,9 +123,9 @@ Want a demo?
131123

132124
Run ``borghash-demo`` after installing the ``borghash`` package.
133125

134-
It will show you the demo code, run it and print the results for your machine.
126+
It will show you the demo code, run it, and print the results for your machine.
135127

136-
Results on an Apple MacBook Pro (M3 Pro CPU) are like:
128+
Results on an Apple MacBook Pro (M3 Pro CPU) look like:
137129

138130
::
139131

@@ -144,18 +136,18 @@ Results on an Apple MacBook Pro (M3 Pro CPU) are like:
144136
State of this project
145137
---------------------
146138

147-
**API is still unstable and expected to change as development goes on.**
139+
**API is still unstable and expected to change as development continues.**
148140

149141
**As long as the API is unstable, there will be no data migration tools,
150-
like e.g. for reading an existing serialized hashtable.**
142+
e.g., for reading an existing serialized hashtable.**
151143

152-
There might be missing features or optimization potential, feedback welcome!
144+
There might be missing features or optimization potential; feedback is welcome!
153145

154146
Borg?
155147
-----
156148

157149
Please note that this code is currently **not** used by the stable release of
158-
BorgBackup (aka "borg"), but might be used by borg master branch in the future.
150+
BorgBackup (aka "borg"), but it might be used by Borg's master branch in the future.
159151

160152
License
161153
-------

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
try:
44
from Cython.Build import cythonize
55
except ImportError:
6-
cythonize = None # we don't have cython installed
6+
cythonize = None # we don't have Cython installed
77

88
ext = '.pyx' if cythonize else '.c'
99

src/borghash/HashTable.pyx

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
"""
2-
HashTable: low-level ht mapping fully random bytes keys to bytes values.
3-
key and value length can be chosen, but is fixed afterwards.
4-
the keys and values are stored in arrays separate from the hashtable.
5-
the hashtable only stores the 32bit indexes into the key/value arrays.
2+
HashTable: low-level hash table mapping fully random bytes keys to bytes values.
3+
Key and value lengths can be chosen, but are fixed thereafter.
4+
The keys and values are stored in arrays separate from the hashtable.
5+
The hashtable only stores the 32-bit indices into the key/value arrays.
66
"""
77
from __future__ import annotations
88
from typing import BinaryIO, Iterator, Any
@@ -49,7 +49,7 @@ cdef class HashTable:
4949
shrink_factor: float = 0.4, grow_factor: float = 2.0,
5050
kv_grow_factor: float = 1.3) -> None:
5151
# the load of the ht (.table) shall be between 0.25 and 0.5, so it is fast and has few collisions.
52-
# it is cheap to have a low hash table load, because .table only stores uint32_t indexes into the
52+
# it is cheap to have a low hash table load, because .table only stores uint32_t indices into the
5353
# .keys and .values array.
5454
# the keys/values arrays have bigger elements and are not hash tables, thus collisions and load
5555
# factor are no concern there. the kv_grow_factor can be relatively small.
@@ -96,7 +96,7 @@ cdef class HashTable:
9696
free(self.values)
9797

9898
def clear(self) -> None:
99-
"""empty HashTable, start from scratch"""
99+
"""Empty the HashTable and start from scratch."""
100100
self.capacity = 0
101101
self.used = 0
102102
self._resize_table(self.initial_capacity)
@@ -107,7 +107,7 @@ cdef class HashTable:
107107
return self.used
108108

109109
cdef size_t _get_index(self, uint8_t* key):
110-
"""key must be perfectly random distributed bytes, so we don't need a hash function here."""
110+
"""Key must be perfectly random bytes, so we don't need a hash function here."""
111111
cdef uint32_t key32 = (key[0] << 24) | (key[1] << 16) | (key[2] << 8) | key[3]
112112
return key32 % self.capacity
113113

@@ -149,7 +149,7 @@ cdef class HashTable:
149149
self._resize_kv(int(self.kv_capacity * self.kv_grow_factor))
150150
if self.kv_used >= self.kv_capacity:
151151
# Should never happen. See "RESERVED" constant - we allow almost 4Gi kv entries.
152-
# For a typical 256bit key and a small 32bit value that would already consume 176GiB+
152+
# For a typical 256-bit key and a small 32-bit value that would already consume 176GiB+
153153
# memory (plus spikes to even more when hashtable or kv arrays get resized).
154154
raise RuntimeError("KV array is full")
155155

@@ -260,7 +260,7 @@ cdef class HashTable:
260260
self.tombstones = 0
261261

262262
cdef void _resize_kv(self, size_t new_capacity):
263-
# We must never use kv indexes >= RESERVED, thus we'll never need more capacity either.
263+
# We must never use kv indices >= RESERVED; thus, we'll never need more capacity either.
264264
cdef size_t capacity = min(new_capacity, <size_t> RESERVED - 1)
265265
self.stats_resize_kv += 1
266266
# realloc is already highly optimized (in Linux). By using mremap internally only the peak address space usage is "old size" + "new size", while the peak memory usage is only "new size".
@@ -270,8 +270,8 @@ cdef class HashTable:
270270

271271
def k_to_idx(self, key: bytes) -> int:
272272
"""
273-
return the key's index in the keys array (index is stable while in memory).
274-
this can be used to "abbreviate" a known key (e.g. 256bit key -> 32bit index).
273+
Return the key's index in the keys array (index is stable while in memory).
274+
This can be used to "abbreviate" a known key (e.g., 256-bit key -> 32-bit index).
275275
"""
276276
if len(key) != self.ksize:
277277
raise ValueError("Key size does not match the defined size")
@@ -283,16 +283,16 @@ cdef class HashTable:
283283

284284
def idx_to_k(self, idx: int) -> bytes:
285285
"""
286-
for a given index, return the key stored at that index in the keys array.
287-
this is the reverse of k_to_idx (e.g. 32bit index -> 256bit key).
286+
For a given index, return the key stored at that index in the keys array.
287+
This is the reverse of k_to_idx (e.g., 32-bit index -> 256-bit key).
288288
"""
289289
cdef uint32_t kv_index = <uint32_t> idx
290290
return self.keys[kv_index * self.ksize:(kv_index + 1) * self.ksize]
291291

292292
def kv_to_idx(self, key: bytes, value: bytes) -> int:
293293
"""
294-
return the key's/value's index in the keys/values array (index is stable while in memory).
295-
this can be used to "abbreviate" a known key/value pair. (e.g. 256bit key + 32bit value -> 32bit index).
294+
Return the key's/value's index in the keys/values array (index is stable while in memory).
295+
This can be used to "abbreviate" a known key/value pair (e.g., 256-bit key + 32-bit value -> 32-bit index).
296296
"""
297297
if len(key) != self.ksize:
298298
raise ValueError("Key size does not match the defined size")
@@ -309,8 +309,8 @@ cdef class HashTable:
309309

310310
def idx_to_kv(self, idx: int) -> tuple[bytes, bytes]:
311311
"""
312-
for a given index, return the key/value stored at that index in the keys/values array.
313-
this is the reverse of kv_to_idx (e.g. 32bit index -> 256bit key + 32bit value).
312+
For a given index, return the key/value stored at that index in the keys/values array.
313+
This is the reverse of kv_to_idx (e.g., 32-bit index -> 256-bit key + 32-bit value).
314314
"""
315315
cdef uint32_t kv_index = <uint32_t> idx
316316
key = self.keys[kv_index * self.ksize:(kv_index + 1) * self.ksize]

src/borghash/HashTableNT.pyx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ cdef class HashTableNT:
3636
if not all(isinstance(fmt, str) and len(fmt) > 0 for fmt in value_format):
3737
raise ValueError("value_format's elements must be str and non-empty.")
3838
if byte_order not in BYTE_ORDER:
39-
raise ValueError("byte_order must be one of: {','.join(BYTE_ORDER.keys())}")
39+
raise ValueError(f"byte_order must be one of: {', '.join(BYTE_ORDER.keys())}")
4040
self.key_size = key_size
4141
self.value_type = value_type
4242
self.value_format = value_format
@@ -124,7 +124,7 @@ cdef class HashTableNT:
124124
return self._to_namedtuple_value(binary_value)
125125

126126
def update(self, other=(), /, **kwds):
127-
"""Like dict.update, but other can also be a HashTableNT instance."""
127+
"""Like dict.update(), but 'other' can also be a HashTableNT instance."""
128128
if isinstance(other, HashTableNT):
129129
for key, value in other.items():
130130
self[key] = value
@@ -228,9 +228,9 @@ cdef class HashTableNT:
228228

229229
def size(self) -> int:
230230
"""
231-
do a rough worst-case estimate of the on-disk size when using .write().
231+
Do a rough worst-case estimate of the on-disk size when using .write().
232232

233-
the serialized size of the metadata is a bit hard to predict, but we cover that with one_time_overheads.
233+
The serialized size of the metadata is a bit hard to predict, but we cover that with one_time_overheads.
234234
"""
235235
one_time_overheads = 4096 # very rough
236236
N = self.inner.used

src/borghash/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
borghash - hashtable implementations in cython.
2+
borghash - hashtable implementations in Cython.
33
"""
44
from .HashTable import HashTable
55
from .HashTableNT import HashTableNT

src/borghash/__main__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
Demonstration of borghash.
2+
Demonstration of BorgHash.
33
"""
44

55
def demo():
@@ -17,12 +17,12 @@ def demo():
1717
value_type = namedtuple("Chunk", ["refcount", "size"])
1818
value_format_t = namedtuple("ChunkFormat", ["refcount", "size"])
1919
value_format = value_format_t(refcount="I", size="I")
20-
# 256bit (32Byte) key, 2x 32bit (4Byte) values
20+
# 256-bit (32-byte) key, 2x 32-bit (4-byte) values
2121
ht = HashTableNT(key_size=32, value_type=value_type, value_format=value_format)
2222
2323
t0 = time()
2424
for i in range(count):
25-
# make up a 256bit key from i, first 32bits need to be well distributed.
25+
# Make up a 256-bit key from i; the first 32 bits need to be well distributed.
2626
key = f"{i:4x}{' '*28}".encode()
2727
value = value_type(refcount=i, size=i * 2)
2828
ht[key] = value
@@ -50,7 +50,7 @@ def demo():
5050
5151
t4 = time()
5252
for i in range(count):
53-
# make up a 256bit key from i, first 32bits need to be well distributed.
53+
# Make up a 256-bit key from i; the first 32 bits need to be well distributed.
5454
key = f"{i:4x}{' '*28}".encode()
5555
expected_value = value_type(refcount=i, size=i * 2)
5656
assert ht_read.pop(key) == expected_value

0 commit comments

Comments
 (0)