guard front_hash key lookup against keys shorter than the prefix by uwezkhan · Pull Request #2641 · stephenberry/glaze

uwezkhan · 2026-06-17T13:59:34Z

decode_hash_with_size selects a front_hash path for key sets distinguished by their first 2, 4, or 8 bytes, and that specialization reads front_hash_bytes from the key with memcpy without checking the key length, unlike every sibling hash type which guards its read against end. BSON, MessagePack, CBOR, CSV, and TOML decode in place and call it with end = key.data() + key.size(), so a document key shorter than front_hash_bytes reads past the key and off the end of the input buffer. A one-byte key at the tail of a minimal BSON document reads seven bytes past the allocation, which ASAN flags as a heap-buffer-overflow read of size 8.

The guard rejects keys shorter than front_hash_bytes before the read and returns the not-found index. front_hash is only chosen when every key is at least front_hash_bytes long, so before the change a valid key always carried enough bytes, and after it the only keys turned away are ones too short to match any field, which leaves correct lookups unchanged. Putting the bound in the shared hash callee covers all five readers in one place rather than each reader, and it mirrors the decode_hash<JSON> front_hash variant that already guards against end.

The size-aware decode_hash_with_size front_hash path reads front_hash_bytes from a key with memcpy without checking the key length. The non-padded readers pass end = key.data() + key.size(), so a key shorter than that prefix reads off the end of the input buffer.

stephenberry · 2026-06-17T18:59:32Z

Thanks for catching this and for the clean ASAN repro.

The same short-key over-read turns out to affect the sibling key-hash readers too — the sized unique_index path, mod4/xor_mod4/minus_mod4, and three_element_unique_index (with unique_index == 0). So rather than guard front_hash alone, I've opened #2643, which pre-screens the key length once in the shared decode_hash_with_size entry point: a key whose length is outside [min_length, max_length] can't match any field, and rejecting it there bounds every reader (including front_hash) in one place. That supersedes the per-reader guard here.

* Pre-screen key length in the shared hash dispatch The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) hand decode_hash_with_size::op the raw key view (op(key.data(), key.data() + key.size(), key.size())). Several of the per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow): front_hash reads front_hash_bytes, the sized unique_index path reads it[unique_index], the mod4 family reads *it, and three_element_unique_index reads it[0]. Split decode_hash_with_size into a thin wrapper and per-type _impl readers. The wrapper rejects any key whose length is outside [min_length, max_length] before dispatching, which bounds every reader's key access in one place: each reader only dereferences offsets below min_length (front_hash_bytes <= min_length, unique_index < min_length, mod4 reads offset 0 with min_length >= 1) or reads at most n bytes, so an out-of-range key can never be hashed out of bounds. The individual readers therefore carry no per-read bounds checks; unique_per_length keeps its end check because its length-indexed table maps absent lengths to 255. Generalizes the front_hash bound from #2641 to the whole reader set. Adds BSON tests for each reader: a short/empty key reproduces the over-read under ASAN and valid documents still round-trip; static_asserts pin the selected hash. * Drop CBOR's redundant call-site key-length pre-filter decode_hash_with_size now pre-screens the key length against [min_length, max_length] for every format, so CBOR's call-site key_len < min_length || key_len > max_length ternary is redundant. The buffer bound (end - it < key_len) above it stays, since that is what guarantees the key bytes exist before the hash reads them. * Drop stencilcount's redundant call-site key-length pre-filter Same redundancy as the CBOR cleanup: decode_hash_with_size now pre-screens the key length, so stencilcount's key.size() < min_length || key.size() > max_length filter is no longer needed. This also matches the sibling stencil.hpp call sites, which already call op() directly.

uwezkhan · 2026-06-18T09:43:00Z

Agreed, #2643 is the better fix. Since front_hash_bytes <= min_length, the single [min_length, max_length] screen subsumes the guard here and also bounds the unique_index, mod4, and three_element readers I didn't touch, so one check instead of five per reader. Good call leaving unique_per_length its own end check, since its length-indexed table maps absent lengths to 255 and that can land in a gap the range filter won't reject.

stephenberry mentioned this pull request Jun 17, 2026

Pre-screen key length in the shared hash dispatch #2643

Merged

stephenberry closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guard front_hash key lookup against keys shorter than the prefix#2641

guard front_hash key lookup against keys shorter than the prefix#2641
uwezkhan wants to merge 1 commit into
stephenberry:mainfrom
uwezkhan:front-hash-short-key-bounds

uwezkhan commented Jun 17, 2026

Uh oh!

stephenberry commented Jun 17, 2026

Uh oh!

uwezkhan commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uwezkhan commented Jun 17, 2026

Uh oh!

stephenberry commented Jun 17, 2026

Uh oh!

uwezkhan commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants