perf(alp-rd): optimise RDEncoder dictionary search and split()#65
Open
jrmoynihan wants to merge 3 commits into
Open
perf(alp-rd): optimise RDEncoder dictionary search and split()#65jrmoynihan wants to merge 3 commits into
jrmoynihan wants to merge 3 commits into
Conversation
Four improvements to the ALP-RD hot path, all contained in alp_rd/mod.rs. **Dictionary search (find_best_dictionary / build_left_parts_dictionary)** Replace HashMap<u16, usize> frequency counting with a direct-addressed [u32; 65536] array. Left-bit patterns are u16 values (0–65535), so the value itself is a valid array index — no hashing, no collision, O(1) insert. The 256 KB buffer is allocated once outside the 16-iteration cut-point loop and reset with fill(0) per call, avoiding 15 redundant heap alloc/free cycles. Additionally, when the input to RDEncoder::new() is larger than MAX_SAMPLE (4096), stride through it with step_by() so that dictionary search costs O(MAX_SAMPLE × 16) rather than O(N × 16). The dominant left-bit patterns are stable across large datasets; 4096 samples is sufficient to identify the dictionary with negligible quality loss. Together these reduce RDEncoder::new() from O(N) HashMap operations to a fixed cost independent of N. **Reverse lookup table (RDEncoder)** Pre-compute a [u8; 65536] reverse mapping in new(): left_raw_u16 → code + 1 (0 = not in dictionary). The +1 sentinel avoids an Option, keeping the table at 64 KB (fits in L2). This replaces the O(dict_size) codes.iter().position() linear scan called for every element in split(). **Single-pass split()** Fuse the two loops in split() into one: compute bits once per element, extract left_raw, do a single table lookup, and push the code or exception in the same iteration. Previously the loop computed left_parts and right_parts in one pass, then re-read left_parts to dict-encode in a second pass. **Inline dictionary in Split** Change Split.left_dict from Vec<u16> to [u16; 8] + left_dict_len: u8. This eliminates one Vec heap allocation per split() call (one per 1024-element chunk). into_parts() materialises a Vec<u16> on demand; decode() uses the inline slice directly. Benchmarked on 30 M f32 values: - RDEncoder::new(): ~3.2 s → ~720 µs (≈4 500× faster) - split() encode throughput: ~187 ms → ~66 ms (≈2.8× faster) - Full encode (new + split): ~3.4 s → ~67 ms (≈51× faster) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
- Replace `(code + 1) as u8` with `u8::try_from(...).expect(...)` to surface truncation bugs at the point of occurrence rather than silently. - Add `#[cfg(debug_assertions)]` loop in `RDEncoder::new()` that verifies the lookup table round-trips: `lookup[codes[i]] == i as u8 + 1`. - Add `debug_assert!(codes.len() <= MAX_DICT_SIZE)` in `split()` before the `copy_from_slice` into the inline `[u16; MAX_DICT_SIZE]` array. - Add `debug_assert!(left_dict_len <= MAX_DICT_SIZE)` guards in `into_parts()` and `decode()`. - Add bounds assertion in `alp_rd_decode` before indexing `dict[code]`, so callers with mismatched dict/left_parts get a clear panic instead of UB. - Fix `MAX_SAMPLE` doc: striding activates at `>= 2 * MAX_SAMPLE`, not `> MAX_SAMPLE`. - Fix `find_best_dictionary` doc: clarify that the counting pass is O(MAX_SAMPLE) but the collection pass over the 65 536-entry array is O(65536 × CUT_LIMIT). - Fix `lookup` field comment: remove CPU-specific "fits in L2" claim; note heap alloc. - Use `MAX_DICT_SIZE` constant in `Split.left_dict` type annotation. - Fix `RDEncoder::new()` doc: correct the stride-threshold description. Four new tests in `alp_rd::test`: - `test_exception_path_roundtrip`: encodes a value outside the dictionary and verifies the exception path reconstructs the original bits exactly. - `test_large_input_roundtrip`: exercises striding (N > 2*MAX_SAMPLE) and checks that every chunk decodes bit-for-bit correctly. - `test_into_parts_dict_materialisation`: calls `into_parts()`, manually decodes via the public `alp_rd_decode` function, and asserts equality. - `test_subsampling_matches_full_cut_point`: concrete proof of "negligible quality loss" — an encoder built on an unstrided MAX_SAMPLE prefix of pseudo-random log-normal data chooses the same `right_bit_width` as an encoder built on the full 3×MAX_SAMPLE dataset with stride=3. Roundtrip is also verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Restore field doc comments on `ALPRDDictionary` (dictionary, left_bit_width, right_bit_width) that were stripped during the HashMap→array refactor. - Add field doc comments on `RDEncoder` (right_bit_width, codes) which were previously undocumented. - Restore inline comments on `EXC_POSITION_SIZE`/`EXC_SIZE` constants clarifying that the unit is bits, not bytes. - Restore and update section-marker comments in `build_left_parts_dictionary` that guide readers through the counting → sorting → dict-assignment → exception counting → bit-width derivation steps. - Restore and update section-marker comments in `alp_rd_decode` for the dict-decode, exception-patch, and recombine steps. - Restore the right-mask and split-loop comments in `RDEncoder::split()`. - Expand `into_parts()` doc to name every element of the returned tuple. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four targeted improvements to the ALP-RD hot path, all contained in
src/alp_rd/mod.rs. No public API changes —RDEncoder::new(),RDEncoder::split(),Split::decode(), andSplit::into_parts()all preserve their existing signatures.Direct-addressed frequency array in
build_left_parts_dictionaryReplace
HashMap<u16, usize>with a&mut [u32; 65536]passed in from the caller. Left-bit patterns are u16 values (0–65535), so the value itself is a valid array index — no hashing, no collision, O(1) insert. The buffer is allocated once perfind_best_dictionarycall and reset withfill(0)per cut-point iteration, avoiding 15 redundant heap alloc/free cycles across the 16-iteration search.Input subsampling in
find_best_dictionaryWhen
sample.len() > MAX_SAMPLE(4096), stride through the input withstep_by(). Dictionary search then costs O(MAX_SAMPLE × 16) rather than O(N × 16). The dominant left-bit patterns are stable across large inputs; 4096 samples is sufficient to identify the top-8 dictionary entries with negligible quality loss.Reverse lookup table in
RDEncoderPre-compute a
Box<[u8; 65536]>innew():left_raw_u16 → code + 1(0 = not in dictionary). The +1 sentinel avoids a separateOption, keeping the table at 64 KB (fits in L2 cache). This replaces thecodes.iter().position()O(dict_size) linear scan called once per element insplit().Single-pass
split()and inline dict inSplitFuse the two loops in
split()into one: compute bits once per element, extractleft_raw, do a single table lookup, push code or exception in the same iteration.Change
Split.left_dictfromVec<u16>to[u16; 8]+left_dict_len: u8. Eliminates oneVecheap allocation persplit()call (one per 1024-element chunk).into_parts()materialises aVec<u16>on demand;decode()uses the inline slice directly.Benchmark results
Measured on 30 M f32 values with a log-normal distribution (high dynamic range, many distinct left-bit patterns — a stress case for the dictionary search):
Test plan