Skip to content

Pre-screen key length in the shared hash dispatch#2643

Merged
stephenberry merged 3 commits into
mainfrom
prescreen-key-hash-length
Jun 17, 2026
Merged

Pre-screen key length in the shared hash dispatch#2643
stephenberry merged 3 commits into
mainfrom
prescreen-key-hash-length

Conversation

@stephenberry

Copy link
Copy Markdown
Owner

What

Splits decode_hash_with_size into a thin wrapper that pre-screens the key length once and a per-type decode_hash_with_size_impl reader. The wrapper rejects any key whose length is outside [min_length, max_length] before dispatching:

template <uint32_t Format, class T, auto HashInfo, hash_type Type>
struct decode_hash_with_size {
   static constexpr auto N = reflect<T>::size;
   GLZ_ALWAYS_INLINE static constexpr size_t op(auto&& it, auto&& end, const size_t n) noexcept {
      if (n < HashInfo.min_length || n > HashInfo.max_length) [[unlikely]] return N;
      return decode_hash_with_size_impl<Format, T, HashInfo, Type>::op(it, end, n);
   }
};

Why

The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) call op(key.data(), key.data() + key.size(), key.size()) with the raw, unpadded key view. Several per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow):

  • front_hash reads front_hash_bytes
  • the sized unique_index path reads it[unique_index]
  • mod4 / xor_mod4 / minus_mod4 read *it
  • three_element_unique_index reads it[0]

Why one pre-screen suffices

Every reader's maximum read offset is provably below min_length (front_hash_bytes <= min_length, unique_index < min_length, and the mod4 family reads offset 0 where min_length >= 1), or it reads at most n bytes. A key whose length is outside [min_length, max_length] can never match a reflected key, so rejecting it up front bounds every reader's key access in one place. The individual readers therefore drop their per-read bounds checks.

The one exception is unique_per_length: its read offset is indexed by key length and unique_per_length_info maps absent lengths to 255, so a foreign key whose length falls in a gap of [min_length, max_length] would read it[255]. The length pre-screen can't catch that, so that reader keeps its own end check (commented as such).

Relationship to #2641

This generalizes the front_hash bound from #2641 to the whole reader set with a single mechanism, and supersedes that PR's front_hash-specific guard. If #2641 lands first I'll rebase to drop the now-redundant guard. CBOR's call-site length pre-filter (cbor/read.hpp) also becomes redundant and can be removed as a follow-up.

Tests / verification

Adds BSON tests for each reader (front_hash, sized unique_index, mod4, three_element_unique_index): a short/empty key on an exact-size heap allocation reproduces the over-read under ASAN, and valid documents still round-trip. static_asserts pin the selected hash so each case keeps exercising a distinct reader.

Verified locally under ASAN + UBSAN with zero regressions: bson 117, json 681, cbor 233, toml 384, csv 126, msgpack 51, plus the reflection/hashing suites. Neutering the pre-screen makes ASAN fire heap-buffer-overflow READ of size 4 (the front_hash reader), confirming the wrapper is the single bound.

The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) hand
decode_hash_with_size::op the raw key view (op(key.data(), key.data() +
key.size(), key.size())). Several of the per-type readers dereference key bytes
without bounding the read, so a foreign key shorter than the bytes a reader
touches runs past the input buffer (ASAN heap-buffer-overflow): front_hash reads
front_hash_bytes, the sized unique_index path reads it[unique_index], the mod4
family reads *it, and three_element_unique_index reads it[0].

Split decode_hash_with_size into a thin wrapper and per-type _impl readers. The
wrapper rejects any key whose length is outside [min_length, max_length] before
dispatching, which bounds every reader's key access in one place: each reader
only dereferences offsets below min_length (front_hash_bytes <= min_length,
unique_index < min_length, mod4 reads offset 0 with min_length >= 1) or reads at
most n bytes, so an out-of-range key can never be hashed out of bounds. The
individual readers therefore carry no per-read bounds checks; unique_per_length
keeps its end check because its length-indexed table maps absent lengths to 255.

Generalizes the front_hash bound from #2641 to the whole reader set. Adds BSON
tests for each reader: a short/empty key reproduces the over-read under ASAN and
valid documents still round-trip; static_asserts pin the selected hash.
decode_hash_with_size now pre-screens the key length against
[min_length, max_length] for every format, so CBOR's call-site
key_len < min_length || key_len > max_length ternary is redundant. The buffer
bound (end - it < key_len) above it stays, since that is what guarantees the key
bytes exist before the hash reads them.
Same redundancy as the CBOR cleanup: decode_hash_with_size now pre-screens the
key length, so stencilcount's key.size() < min_length || key.size() > max_length
filter is no longer needed. This also matches the sibling stencil.hpp call sites,
which already call op() directly.
@stephenberry stephenberry merged commit f2d1a07 into main Jun 17, 2026
53 checks passed
@stephenberry stephenberry deleted the prescreen-key-hash-length branch June 17, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant