guard front_hash key lookup against keys shorter than the prefix#2641
guard front_hash key lookup against keys shorter than the prefix#2641uwezkhan wants to merge 1 commit into
Conversation
The size-aware decode_hash_with_size front_hash path reads front_hash_bytes from a key with memcpy without checking the key length. The non-padded readers pass end = key.data() + key.size(), so a key shorter than that prefix reads off the end of the input buffer.
|
Thanks for catching this and for the clean ASAN repro. The same short-key over-read turns out to affect the sibling key-hash readers too — the sized |
* Pre-screen key length in the shared hash dispatch The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) hand decode_hash_with_size::op the raw key view (op(key.data(), key.data() + key.size(), key.size())). Several of the per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow): front_hash reads front_hash_bytes, the sized unique_index path reads it[unique_index], the mod4 family reads *it, and three_element_unique_index reads it[0]. Split decode_hash_with_size into a thin wrapper and per-type _impl readers. The wrapper rejects any key whose length is outside [min_length, max_length] before dispatching, which bounds every reader's key access in one place: each reader only dereferences offsets below min_length (front_hash_bytes <= min_length, unique_index < min_length, mod4 reads offset 0 with min_length >= 1) or reads at most n bytes, so an out-of-range key can never be hashed out of bounds. The individual readers therefore carry no per-read bounds checks; unique_per_length keeps its end check because its length-indexed table maps absent lengths to 255. Generalizes the front_hash bound from #2641 to the whole reader set. Adds BSON tests for each reader: a short/empty key reproduces the over-read under ASAN and valid documents still round-trip; static_asserts pin the selected hash. * Drop CBOR's redundant call-site key-length pre-filter decode_hash_with_size now pre-screens the key length against [min_length, max_length] for every format, so CBOR's call-site key_len < min_length || key_len > max_length ternary is redundant. The buffer bound (end - it < key_len) above it stays, since that is what guarantees the key bytes exist before the hash reads them. * Drop stencilcount's redundant call-site key-length pre-filter Same redundancy as the CBOR cleanup: decode_hash_with_size now pre-screens the key length, so stencilcount's key.size() < min_length || key.size() > max_length filter is no longer needed. This also matches the sibling stencil.hpp call sites, which already call op() directly.
|
Agreed, #2643 is the better fix. Since front_hash_bytes <= min_length, the single [min_length, max_length] screen subsumes the guard here and also bounds the unique_index, mod4, and three_element readers I didn't touch, so one check instead of five per reader. Good call leaving unique_per_length its own end check, since its length-indexed table maps absent lengths to 255 and that can land in a gap the range filter won't reject. |
decode_hash_with_sizeselects afront_hashpath for key sets distinguished by their first 2, 4, or 8 bytes, and that specialization readsfront_hash_bytesfrom the key withmemcpywithout checking the key length, unlike every sibling hash type which guards its read againstend. BSON, MessagePack, CBOR, CSV, and TOML decode in place and call it withend = key.data() + key.size(), so a document key shorter thanfront_hash_bytesreads past the key and off the end of the input buffer. A one-byte key at the tail of a minimal BSON document reads seven bytes past the allocation, which ASAN flags as a heap-buffer-overflow read of size 8.The guard rejects keys shorter than
front_hash_bytesbefore the read and returns the not-found index.front_hashis only chosen when every key is at leastfront_hash_byteslong, so before the change a valid key always carried enough bytes, and after it the only keys turned away are ones too short to match any field, which leaves correct lookups unchanged. Putting the bound in the shared hash callee covers all five readers in one place rather than each reader, and it mirrors thedecode_hash<JSON>front_hashvariant that already guards againstend.