Skip to content

guard front_hash key lookup against keys shorter than the prefix#2641

Closed
uwezkhan wants to merge 1 commit into
stephenberry:mainfrom
uwezkhan:front-hash-short-key-bounds
Closed

guard front_hash key lookup against keys shorter than the prefix#2641
uwezkhan wants to merge 1 commit into
stephenberry:mainfrom
uwezkhan:front-hash-short-key-bounds

Conversation

@uwezkhan

Copy link
Copy Markdown
Contributor

decode_hash_with_size selects a front_hash path for key sets distinguished by their first 2, 4, or 8 bytes, and that specialization reads front_hash_bytes from the key with memcpy without checking the key length, unlike every sibling hash type which guards its read against end. BSON, MessagePack, CBOR, CSV, and TOML decode in place and call it with end = key.data() + key.size(), so a document key shorter than front_hash_bytes reads past the key and off the end of the input buffer. A one-byte key at the tail of a minimal BSON document reads seven bytes past the allocation, which ASAN flags as a heap-buffer-overflow read of size 8.

The guard rejects keys shorter than front_hash_bytes before the read and returns the not-found index. front_hash is only chosen when every key is at least front_hash_bytes long, so before the change a valid key always carried enough bytes, and after it the only keys turned away are ones too short to match any field, which leaves correct lookups unchanged. Putting the bound in the shared hash callee covers all five readers in one place rather than each reader, and it mirrors the decode_hash<JSON> front_hash variant that already guards against end.

The size-aware decode_hash_with_size front_hash path reads front_hash_bytes from a key with memcpy without checking the key length. The non-padded readers pass end = key.data() + key.size(), so a key shorter than that prefix reads off the end of the input buffer.
@stephenberry

Copy link
Copy Markdown
Owner

Thanks for catching this and for the clean ASAN repro.

The same short-key over-read turns out to affect the sibling key-hash readers too — the sized unique_index path, mod4/xor_mod4/minus_mod4, and three_element_unique_index (with unique_index == 0). So rather than guard front_hash alone, I've opened #2643, which pre-screens the key length once in the shared decode_hash_with_size entry point: a key whose length is outside [min_length, max_length] can't match any field, and rejecting it there bounds every reader (including front_hash) in one place. That supersedes the per-reader guard here.

stephenberry added a commit that referenced this pull request Jun 17, 2026
* Pre-screen key length in the shared hash dispatch

The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) hand
decode_hash_with_size::op the raw key view (op(key.data(), key.data() +
key.size(), key.size())). Several of the per-type readers dereference key bytes
without bounding the read, so a foreign key shorter than the bytes a reader
touches runs past the input buffer (ASAN heap-buffer-overflow): front_hash reads
front_hash_bytes, the sized unique_index path reads it[unique_index], the mod4
family reads *it, and three_element_unique_index reads it[0].

Split decode_hash_with_size into a thin wrapper and per-type _impl readers. The
wrapper rejects any key whose length is outside [min_length, max_length] before
dispatching, which bounds every reader's key access in one place: each reader
only dereferences offsets below min_length (front_hash_bytes <= min_length,
unique_index < min_length, mod4 reads offset 0 with min_length >= 1) or reads at
most n bytes, so an out-of-range key can never be hashed out of bounds. The
individual readers therefore carry no per-read bounds checks; unique_per_length
keeps its end check because its length-indexed table maps absent lengths to 255.

Generalizes the front_hash bound from #2641 to the whole reader set. Adds BSON
tests for each reader: a short/empty key reproduces the over-read under ASAN and
valid documents still round-trip; static_asserts pin the selected hash.

* Drop CBOR's redundant call-site key-length pre-filter

decode_hash_with_size now pre-screens the key length against
[min_length, max_length] for every format, so CBOR's call-site
key_len < min_length || key_len > max_length ternary is redundant. The buffer
bound (end - it < key_len) above it stays, since that is what guarantees the key
bytes exist before the hash reads them.

* Drop stencilcount's redundant call-site key-length pre-filter

Same redundancy as the CBOR cleanup: decode_hash_with_size now pre-screens the
key length, so stencilcount's key.size() < min_length || key.size() > max_length
filter is no longer needed. This also matches the sibling stencil.hpp call sites,
which already call op() directly.
@uwezkhan

Copy link
Copy Markdown
Contributor Author

Agreed, #2643 is the better fix. Since front_hash_bytes <= min_length, the single [min_length, max_length] screen subsumes the guard here and also bounds the unique_index, mod4, and three_element readers I didn't touch, so one check instead of five per reader. Good call leaving unique_per_length its own end check, since its length-indexed table maps absent lengths to 255 and that can land in a gap the range filter won't reject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants