Pre-screen key length in the shared hash dispatch#2643
Merged
Conversation
The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) hand decode_hash_with_size::op the raw key view (op(key.data(), key.data() + key.size(), key.size())). Several of the per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow): front_hash reads front_hash_bytes, the sized unique_index path reads it[unique_index], the mod4 family reads *it, and three_element_unique_index reads it[0]. Split decode_hash_with_size into a thin wrapper and per-type _impl readers. The wrapper rejects any key whose length is outside [min_length, max_length] before dispatching, which bounds every reader's key access in one place: each reader only dereferences offsets below min_length (front_hash_bytes <= min_length, unique_index < min_length, mod4 reads offset 0 with min_length >= 1) or reads at most n bytes, so an out-of-range key can never be hashed out of bounds. The individual readers therefore carry no per-read bounds checks; unique_per_length keeps its end check because its length-indexed table maps absent lengths to 255. Generalizes the front_hash bound from #2641 to the whole reader set. Adds BSON tests for each reader: a short/empty key reproduces the over-read under ASAN and valid documents still round-trip; static_asserts pin the selected hash.
decode_hash_with_size now pre-screens the key length against [min_length, max_length] for every format, so CBOR's call-site key_len < min_length || key_len > max_length ternary is redundant. The buffer bound (end - it < key_len) above it stays, since that is what guarantees the key bytes exist before the hash reads them.
Same redundancy as the CBOR cleanup: decode_hash_with_size now pre-screens the key length, so stencilcount's key.size() < min_length || key.size() > max_length filter is no longer needed. This also matches the sibling stencil.hpp call sites, which already call op() directly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Splits
decode_hash_with_sizeinto a thin wrapper that pre-screens the key length once and a per-typedecode_hash_with_size_implreader. The wrapper rejects any key whose length is outside[min_length, max_length]before dispatching:Why
The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) call
op(key.data(), key.data() + key.size(), key.size())with the raw, unpadded key view. Several per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow):front_hashreadsfront_hash_bytesunique_indexpath readsit[unique_index]mod4/xor_mod4/minus_mod4read*itthree_element_unique_indexreadsit[0]Why one pre-screen suffices
Every reader's maximum read offset is provably below
min_length(front_hash_bytes <= min_length,unique_index < min_length, and the mod4 family reads offset 0 wheremin_length >= 1), or it reads at mostnbytes. A key whose length is outside[min_length, max_length]can never match a reflected key, so rejecting it up front bounds every reader's key access in one place. The individual readers therefore drop their per-read bounds checks.The one exception is
unique_per_length: its read offset is indexed by key length andunique_per_length_infomaps absent lengths to 255, so a foreign key whose length falls in a gap of[min_length, max_length]would readit[255]. The length pre-screen can't catch that, so that reader keeps its ownendcheck (commented as such).Relationship to #2641
This generalizes the
front_hashbound from #2641 to the whole reader set with a single mechanism, and supersedes that PR's front_hash-specific guard. If #2641 lands first I'll rebase to drop the now-redundant guard. CBOR's call-site length pre-filter (cbor/read.hpp) also becomes redundant and can be removed as a follow-up.Tests / verification
Adds BSON tests for each reader (
front_hash, sizedunique_index,mod4,three_element_unique_index): a short/empty key on an exact-size heap allocation reproduces the over-read under ASAN, and valid documents still round-trip.static_asserts pin the selected hash so each case keeps exercising a distinct reader.Verified locally under ASAN + UBSAN with zero regressions: bson 117, json 681, cbor 233, toml 384, csv 126, msgpack 51, plus the reflection/hashing suites. Neutering the pre-screen makes ASAN fire
heap-buffer-overflow READ of size 4(the front_hash reader), confirming the wrapper is the single bound.