the memory footprint of HashTableNT can get rather large overall, it is a sum of this:
- the hashtable itself. to work efficiently, it needs to have a lot of unused buckets, but that does not need too much memory as we only store the key idx and the value idx into the hashtable, not the full key and full value. to work efficiently, the ht should for sure be always in RAM completely.
- the kv array stores all the full keys and full values (key can e.g. be 256bit from a hash function, value can be of any size) in a dense memory block pointed to by uint8_t * self.kv and arranged like key0, value0, key1, value1, ...
Guess mmap under tight memory conditions only performs well if there is some locality for the memory accesses. If accesses would just randomly spread over the complete arrays it would not perform great, but maybe still better than swapping (with mmap, pages can be just discarded and reloaded instead of having to be swapped out).
The locality could be achieved by sorting the KV arrays by application-specific value tuple members (e.g. for borgbackup: sort by pack_id). Sorting the kv array shall not be part of the initial implementation, but could be done in a future optimization.
Update: when borgbackup writes new chunks to a new pack, all kv pairs appended to the kv array will belong to same pack_id. At restore time extracting files will be in archive order and also access the hashtable in order of pack_ids. So, initially we have some locality here naturally. Over time, when chunks get deleted and packs get compacted, the ordering will get mixed up though.
the memory footprint of
HashTableNTcan get rather large overall, it is a sum of this:Guess mmap under tight memory conditions only performs well if there is some locality for the memory accesses. If accesses would just randomly spread over the complete arrays it would not perform great, but maybe still better than swapping (with mmap, pages can be just discarded and reloaded instead of having to be swapped out).
The locality could be achieved by sorting the KV arrays by application-specific value tuple members (e.g. for borgbackup: sort by pack_id). Sorting the kv array shall not be part of the initial implementation, but could be done in a future optimization.
Update: when borgbackup writes new chunks to a new pack, all kv pairs appended to the kv array will belong to same pack_id. At restore time extracting files will be in archive order and also access the hashtable in order of pack_ids. So, initially we have some locality here naturally. Over time, when chunks get deleted and packs get compacted, the ordering will get mixed up though.