Goal: fast and easy lookup in "Have I Been Pwned" Passwords local database.
Download (torrent or direct) pwned-passwords-sha1-ordered-by-hash-v7.7z from hibp-password, then extract pwned-passwords-sha1-ordered-by-hash-v7.txt:
7z x pwned-passwords-sha1-ordered-by-hash-v7.7z
This will take a while.
Then build the index hibp-sha1.index:
cargo run --release --bin hibp-create-sha1-index
Unless the upstream data changed this should result in the following files (only hibp-sha1.index is required for hibp-lookup):
$ stat -c '%12s %n' *sha1*
11048552093 hibp-sha1.index
11803081225 pwned-passwords-sha1-ordered-by-hash-v7.7z
27038651229 pwned-passwords-sha1-ordered-by-hash-v7.txt
So the index is even smaller than the compressed download!
Download (torrent or direct) pwned-passwords-ntlm-ordered-by-hash-v7.7z from hibp-password, then extract pwned-passwords-ntlm-ordered-by-hash-v7.txt:
7z x pwned-passwords-ntlm-ordered-by-hash-v7.7z
This will take a while.
Then build the index hibp-ntlm.index:
cargo run --release --bin hibp-create-ntlm-index
Unless the upstream data changed this should result in the following files (only hibp-ntlm.index is required for hibp-lookup)::
$ stat -c '%12s %n' *ntlm*
8594181042 hibp-ntlm.index
9175932407 pwned-passwords-ntlm-ordered-by-hash-v7.7z
22129977261 pwned-passwords-ntlm-ordered-by-hash-v7.txt
Requires at least one of hibp-sha1.index and hibp-ntlm.index; if you want to lookup hashes of a specific type the corresponding index is always required.
Start the lookup process:
cargo run --release --bin hibp-lookup
And then enter passwords or SHA1/NT-hashes on stdin (you could also send input via pipe); it will then tell you whether the hash is contained in the database or not.
If you enter a password it will prefer doing SHA1 lookups; if only hibp-ntlm.index is present it will use NT hashes for the lookup.
- The hashes are sorted into buckets
- The bucket index is a fixed-length (bitstring) prefix of the hash
- Right now 20 bits are used; i.e. 2^20 buckets (about 1 million), with about 600 entries per bucket; the first 20 bit of the hash are used as bucket index.
- In each bucket only the bytes that are not (fully) part of the bucket index are stored for each entry
- In the current confiuration this means 2 bytes less to store per entry
- For each bucket store the file offset where the bucket starts; also store where the last bucket ends
- This index is stored at the end of the file and compressed with
DEFLATE; it is loaded into memory when opening the index.
- This index is stored at the end of the file and compressed with
- To search for an entry simply lookup the bucket start and end (the start of the following bucket!) file offsets
- Theoretically could do binary or even interpolation search as the entries are sorted; right now using a linear search (with early abort).
- More complicated search would require advanced
BufReadimplementation (the default rust one resets the buffer on every seek), and linear seems fast enough.
The file format also allows for a fixed size "payload" ("value") per entry; not used right now.
Given the hashes should be distributed evenly there is no reason to implement "optionally" nested indices.
The selection of the index size ("20 bits") is more an optimization to compact common prefixes across many entries and to get away with linear search. If the database grows a lot more the index size can be increased to 24 bits (the implementation rejects larger indices, as memory usage will grow to store it: 24 bits already requires 128M); afterwards binary / interpolation search needs to be implemented.
The file starts with a short header:
- UTF-8 line:
hash-index-v0 - UTF-8 line: the content key type (i.e. type of indexed data).
sha1orntfor this application. - UTF-8 line: free-form description of the data / data source
- all of the above lines are terminated by the (first)
\n - key size in bytes (as single byte); must not be zero
- payload size in bytes (as single byte); can be zero
The complete header must be at most 4096 bytes big.
Now the buckets (i.e. their entries) follow; technically they could be anywhere in the file, and there can be unused parts in the file (but there can't be any space between buckets).
The location of the buckets is described in the "table"; the DEFLATE-compressed table size is stored as big-endian unsigned 32-bit number in the last 4 bytes of the index. The (compressed) table itself is stored directly before that.
The uncompressed table contains:
- the table "depth": a single byte, describing the length of the bitstring prefix to use as index (i.e. in bits, not in bytes!)
- must not exceed 24 (otherwise table gets rather large)
- could be zero - resulting in a single bucket
- for each bucket (2^depth+1) the file offset (big-endian unsigned 64-bit number) where its entries start
- the following entry is the file offset where the entries end; that is why an additional entry at the end is included to mark the end of the last bucket.
- must not contain any other data
The original text file uses hexadecimal representation of the hashes; the 7z compression will mostly undo that (i.e. use about half the number of bytes to store the hash as binary); it should also be able to compress shared prefixes for sequential entries. The original text file also includes a "prevalence" number after the entry.
The created index probably can't be compressed much - within a bucket the entries might still have some "structure" in the first few bits in each entry (as they are sorted and should be distributed quite evenly), but then high entropy should follow - after all, compressing a cryptographic hash should be difficult :)