Pre-screen key length in the shared hash dispatch by stephenberry · Pull Request #2643 · stephenberry/glaze

stephenberry · 2026-06-17T18:53:44Z

What

Splits decode_hash_with_size into a thin wrapper that pre-screens the key length once and a per-type decode_hash_with_size_impl reader. The wrapper rejects any key whose length is outside [min_length, max_length] before dispatching:

template <uint32_t Format, class T, auto HashInfo, hash_type Type>
struct decode_hash_with_size {
   static constexpr auto N = reflect<T>::size;
   GLZ_ALWAYS_INLINE static constexpr size_t op(auto&& it, auto&& end, const size_t n) noexcept {
      if (n < HashInfo.min_length || n > HashInfo.max_length) [[unlikely]] return N;
      return decode_hash_with_size_impl<Format, T, HashInfo, Type>::op(it, end, n);
   }
};

Why

The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) call op(key.data(), key.data() + key.size(), key.size()) with the raw, unpadded key view. Several per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow):

front_hash reads front_hash_bytes
the sized unique_index path reads it[unique_index]
mod4 / xor_mod4 / minus_mod4 read *it
three_element_unique_index reads it[0]

Why one pre-screen suffices

Every reader's maximum read offset is provably below min_length (front_hash_bytes <= min_length, unique_index < min_length, and the mod4 family reads offset 0 where min_length >= 1), or it reads at most n bytes. A key whose length is outside [min_length, max_length] can never match a reflected key, so rejecting it up front bounds every reader's key access in one place. The individual readers therefore drop their per-read bounds checks.

The one exception is unique_per_length: its read offset is indexed by key length and unique_per_length_info maps absent lengths to 255, so a foreign key whose length falls in a gap of [min_length, max_length] would read it[255]. The length pre-screen can't catch that, so that reader keeps its own end check (commented as such).

Relationship to #2641

This generalizes the front_hash bound from #2641 to the whole reader set with a single mechanism, and supersedes that PR's front_hash-specific guard. If #2641 lands first I'll rebase to drop the now-redundant guard. CBOR's call-site length pre-filter (cbor/read.hpp) also becomes redundant and can be removed as a follow-up.

Tests / verification

Adds BSON tests for each reader (front_hash, sized unique_index, mod4, three_element_unique_index): a short/empty key on an exact-size heap allocation reproduces the over-read under ASAN, and valid documents still round-trip. static_asserts pin the selected hash so each case keeps exercising a distinct reader.

Verified locally under ASAN + UBSAN with zero regressions: bson 117, json 681, cbor 233, toml 384, csv 126, msgpack 51, plus the reflection/hashing suites. Neutering the pre-screen makes ASAN fire heap-buffer-overflow READ of size 4 (the front_hash reader), confirming the wrapper is the single bound.

The in-place decoders (BSON, MessagePack, CBOR, CSV, TOML) hand decode_hash_with_size::op the raw key view (op(key.data(), key.data() + key.size(), key.size())). Several of the per-type readers dereference key bytes without bounding the read, so a foreign key shorter than the bytes a reader touches runs past the input buffer (ASAN heap-buffer-overflow): front_hash reads front_hash_bytes, the sized unique_index path reads it[unique_index], the mod4 family reads *it, and three_element_unique_index reads it[0]. Split decode_hash_with_size into a thin wrapper and per-type _impl readers. The wrapper rejects any key whose length is outside [min_length, max_length] before dispatching, which bounds every reader's key access in one place: each reader only dereferences offsets below min_length (front_hash_bytes <= min_length, unique_index < min_length, mod4 reads offset 0 with min_length >= 1) or reads at most n bytes, so an out-of-range key can never be hashed out of bounds. The individual readers therefore carry no per-read bounds checks; unique_per_length keeps its end check because its length-indexed table maps absent lengths to 255. Generalizes the front_hash bound from #2641 to the whole reader set. Adds BSON tests for each reader: a short/empty key reproduces the over-read under ASAN and valid documents still round-trip; static_asserts pin the selected hash.

decode_hash_with_size now pre-screens the key length against [min_length, max_length] for every format, so CBOR's call-site key_len < min_length || key_len > max_length ternary is redundant. The buffer bound (end - it < key_len) above it stays, since that is what guarantees the key bytes exist before the hash reads them.

Same redundancy as the CBOR cleanup: decode_hash_with_size now pre-screens the key length, so stencilcount's key.size() < min_length || key.size() > max_length filter is no longer needed. This also matches the sibling stencil.hpp call sites, which already call op() directly.

stephenberry added 2 commits June 17, 2026 13:53

stephenberry mentioned this pull request Jun 17, 2026

guard front_hash key lookup against keys shorter than the prefix #2641

Closed

stephenberry merged commit f2d1a07 into main Jun 17, 2026
53 checks passed

stephenberry deleted the prescreen-key-hash-length branch June 17, 2026 19:36

BrewTestBot mentioned this pull request Jun 23, 2026

glaze 7.8.3 Homebrew/homebrew-core#289469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-screen key length in the shared hash dispatch#2643

Pre-screen key length in the shared hash dispatch#2643
stephenberry merged 3 commits into
mainfrom
prescreen-key-hash-length

stephenberry commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stephenberry commented Jun 17, 2026

What

Why

Why one pre-screen suffices

Relationship to #2641

Tests / verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant