perf(alp-rd): optimise RDEncoder dictionary search and split() by jrmoynihan · Pull Request #65 · spiraldb/alp

jrmoynihan · 2026-05-04T21:20:35Z

Summary

Four targeted improvements to the ALP-RD hot path, all contained in src/alp_rd/mod.rs. No public API changes — RDEncoder::new(), RDEncoder::split(), Split::decode(), and Split::into_parts() all preserve their existing signatures.

Direct-addressed frequency array in `build_left_parts_dictionary`

Replace HashMap<u16, usize> with a &mut [u32; 65536] passed in from the caller. Left-bit patterns are u16 values (0–65535), so the value itself is a valid array index — no hashing, no collision, O(1) insert. The buffer is allocated once per find_best_dictionary call and reset with fill(0) per cut-point iteration, avoiding 15 redundant heap alloc/free cycles across the 16-iteration search.

Input subsampling in `find_best_dictionary`

When sample.len() > MAX_SAMPLE (4096), stride through the input with step_by(). Dictionary search then costs O(MAX_SAMPLE × 16) rather than O(N × 16). The dominant left-bit patterns are stable across large inputs; 4096 samples is sufficient to identify the top-8 dictionary entries with negligible quality loss.

Reverse lookup table in `RDEncoder`

Pre-compute a Box<[u8; 65536]> in new(): left_raw_u16 → code + 1 (0 = not in dictionary). The +1 sentinel avoids a separate Option, keeping the table at 64 KB (fits in L2 cache). This replaces the codes.iter().position() O(dict_size) linear scan called once per element in split().

Single-pass `split()` and inline dict in `Split`

Fuse the two loops in split() into one: compute bits once per element, extract left_raw, do a single table lookup, push code or exception in the same iteration.

Change Split.left_dict from Vec<u16> to [u16; 8] + left_dict_len: u8. Eliminates one Vec heap allocation per split() call (one per 1024-element chunk). into_parts() materialises a Vec<u16> on demand; decode() uses the inline slice directly.

Benchmark results

Measured on 30 M f32 values with a log-normal distribution (high dynamic range, many distinct left-bit patterns — a stress case for the dictionary search):

Operation	Before	After	Speedup
RDEncoder::new()	~3.2 s	~720 us	~4500x
split() encode (30 M values)	~187 ms	~66 ms	~2.8x
Full encode (new + split)	~3.4 s	~67 ms	~51x
Decode	~20 ms	~22 ms	unchanged

Test plan

cargo test passes (all 4 existing tests)
cargo clippy --all-targets clean on nightly-2025-02-24
Roundtrip correctness verified for f32 and f64

Four improvements to the ALP-RD hot path, all contained in alp_rd/mod.rs. **Dictionary search (find_best_dictionary / build_left_parts_dictionary)** Replace HashMap<u16, usize> frequency counting with a direct-addressed [u32; 65536] array. Left-bit patterns are u16 values (0–65535), so the value itself is a valid array index — no hashing, no collision, O(1) insert. The 256 KB buffer is allocated once outside the 16-iteration cut-point loop and reset with fill(0) per call, avoiding 15 redundant heap alloc/free cycles. Additionally, when the input to RDEncoder::new() is larger than MAX_SAMPLE (4096), stride through it with step_by() so that dictionary search costs O(MAX_SAMPLE × 16) rather than O(N × 16). The dominant left-bit patterns are stable across large datasets; 4096 samples is sufficient to identify the dictionary with negligible quality loss. Together these reduce RDEncoder::new() from O(N) HashMap operations to a fixed cost independent of N. **Reverse lookup table (RDEncoder)** Pre-compute a [u8; 65536] reverse mapping in new(): left_raw_u16 → code + 1 (0 = not in dictionary). The +1 sentinel avoids an Option, keeping the table at 64 KB (fits in L2). This replaces the O(dict_size) codes.iter().position() linear scan called for every element in split(). **Single-pass split()** Fuse the two loops in split() into one: compute bits once per element, extract left_raw, do a single table lookup, and push the code or exception in the same iteration. Previously the loop computed left_parts and right_parts in one pass, then re-read left_parts to dict-encode in a second pass. **Inline dictionary in Split** Change Split.left_dict from Vec<u16> to [u16; 8] + left_dict_len: u8. This eliminates one Vec heap allocation per split() call (one per 1024-element chunk). into_parts() materialises a Vec<u16> on demand; decode() uses the inline slice directly. Benchmarked on 30 M f32 values: - RDEncoder::new(): ~3.2 s → ~720 µs (≈4 500× faster) - split() encode throughput: ~187 ms → ~66 ms (≈2.8× faster) - Full encode (new + split): ~3.4 s → ~67 ms (≈51× faster) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CLAassistant · 2026-05-04T21:20:42Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

- Replace `(code + 1) as u8` with `u8::try_from(...).expect(...)` to surface truncation bugs at the point of occurrence rather than silently. - Add `#[cfg(debug_assertions)]` loop in `RDEncoder::new()` that verifies the lookup table round-trips: `lookup[codes[i]] == i as u8 + 1`. - Add `debug_assert!(codes.len() <= MAX_DICT_SIZE)` in `split()` before the `copy_from_slice` into the inline `[u16; MAX_DICT_SIZE]` array. - Add `debug_assert!(left_dict_len <= MAX_DICT_SIZE)` guards in `into_parts()` and `decode()`. - Add bounds assertion in `alp_rd_decode` before indexing `dict[code]`, so callers with mismatched dict/left_parts get a clear panic instead of UB. - Fix `MAX_SAMPLE` doc: striding activates at `>= 2 * MAX_SAMPLE`, not `> MAX_SAMPLE`. - Fix `find_best_dictionary` doc: clarify that the counting pass is O(MAX_SAMPLE) but the collection pass over the 65 536-entry array is O(65536 × CUT_LIMIT). - Fix `lookup` field comment: remove CPU-specific "fits in L2" claim; note heap alloc. - Use `MAX_DICT_SIZE` constant in `Split.left_dict` type annotation. - Fix `RDEncoder::new()` doc: correct the stride-threshold description. Four new tests in `alp_rd::test`: - `test_exception_path_roundtrip`: encodes a value outside the dictionary and verifies the exception path reconstructs the original bits exactly. - `test_large_input_roundtrip`: exercises striding (N > 2*MAX_SAMPLE) and checks that every chunk decodes bit-for-bit correctly. - `test_into_parts_dict_materialisation`: calls `into_parts()`, manually decodes via the public `alp_rd_decode` function, and asserts equality. - `test_subsampling_matches_full_cut_point`: concrete proof of "negligible quality loss" — an encoder built on an unstrided MAX_SAMPLE prefix of pseudo-random log-normal data chooses the same `right_bit_width` as an encoder built on the full 3×MAX_SAMPLE dataset with stride=3. Roundtrip is also verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Restore field doc comments on `ALPRDDictionary` (dictionary, left_bit_width, right_bit_width) that were stripped during the HashMap→array refactor. - Add field doc comments on `RDEncoder` (right_bit_width, codes) which were previously undocumented. - Restore inline comments on `EXC_POSITION_SIZE`/`EXC_SIZE` constants clarifying that the unit is bits, not bytes. - Restore and update section-marker comments in `build_left_parts_dictionary` that guide readers through the counting → sorting → dict-assignment → exception counting → bit-width derivation steps. - Restore and update section-marker comments in `alp_rd_decode` for the dict-decode, exception-patch, and recombine steps. - Restore the right-mask and split-loop comments in `RDEncoder::split()`. - Expand `into_parts()` doc to name every element of the returned tuple. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jrmoynihan and others added 2 commits May 4, 2026 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(alp-rd): optimise RDEncoder dictionary search and split()#65

perf(alp-rd): optimise RDEncoder dictionary search and split()#65
jrmoynihan wants to merge 3 commits into
spiraldb:developfrom
jrmoynihan:perf/rd-encoder-optimisations

jrmoynihan commented May 4, 2026

Uh oh!

CLAassistant commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jrmoynihan commented May 4, 2026

Summary

Direct-addressed frequency array in build_left_parts_dictionary

Input subsampling in find_best_dictionary

Reverse lookup table in RDEncoder

Single-pass split() and inline dict in Split

Benchmark results

Test plan

Uh oh!

CLAassistant commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Direct-addressed frequency array in `build_left_parts_dictionary`

Input subsampling in `find_best_dictionary`

Reverse lookup table in `RDEncoder`

Single-pass `split()` and inline dict in `Split`