Skip to content

perf(alp-rd): optimise RDEncoder dictionary search and split()#65

Open
jrmoynihan wants to merge 3 commits into
spiraldb:developfrom
jrmoynihan:perf/rd-encoder-optimisations
Open

perf(alp-rd): optimise RDEncoder dictionary search and split()#65
jrmoynihan wants to merge 3 commits into
spiraldb:developfrom
jrmoynihan:perf/rd-encoder-optimisations

Conversation

@jrmoynihan

Copy link
Copy Markdown

Summary

Four targeted improvements to the ALP-RD hot path, all contained in src/alp_rd/mod.rs. No public API changes — RDEncoder::new(), RDEncoder::split(), Split::decode(), and Split::into_parts() all preserve their existing signatures.

Direct-addressed frequency array in build_left_parts_dictionary

Replace HashMap<u16, usize> with a &mut [u32; 65536] passed in from the caller. Left-bit patterns are u16 values (0–65535), so the value itself is a valid array index — no hashing, no collision, O(1) insert. The buffer is allocated once per find_best_dictionary call and reset with fill(0) per cut-point iteration, avoiding 15 redundant heap alloc/free cycles across the 16-iteration search.

Input subsampling in find_best_dictionary

When sample.len() > MAX_SAMPLE (4096), stride through the input with step_by(). Dictionary search then costs O(MAX_SAMPLE × 16) rather than O(N × 16). The dominant left-bit patterns are stable across large inputs; 4096 samples is sufficient to identify the top-8 dictionary entries with negligible quality loss.

Reverse lookup table in RDEncoder

Pre-compute a Box<[u8; 65536]> in new(): left_raw_u16 → code + 1 (0 = not in dictionary). The +1 sentinel avoids a separate Option, keeping the table at 64 KB (fits in L2 cache). This replaces the codes.iter().position() O(dict_size) linear scan called once per element in split().

Single-pass split() and inline dict in Split

Fuse the two loops in split() into one: compute bits once per element, extract left_raw, do a single table lookup, push code or exception in the same iteration.

Change Split.left_dict from Vec<u16> to [u16; 8] + left_dict_len: u8. Eliminates one Vec heap allocation per split() call (one per 1024-element chunk). into_parts() materialises a Vec<u16> on demand; decode() uses the inline slice directly.

Benchmark results

Measured on 30 M f32 values with a log-normal distribution (high dynamic range, many distinct left-bit patterns — a stress case for the dictionary search):

Operation Before After Speedup
RDEncoder::new() ~3.2 s ~720 us ~4500x
split() encode (30 M values) ~187 ms ~66 ms ~2.8x
Full encode (new + split) ~3.4 s ~67 ms ~51x
Decode ~20 ms ~22 ms unchanged

Test plan

  • cargo test passes (all 4 existing tests)
  • cargo clippy --all-targets clean on nightly-2025-02-24
  • Roundtrip correctness verified for f32 and f64

Four improvements to the ALP-RD hot path, all contained in alp_rd/mod.rs.

**Dictionary search (find_best_dictionary / build_left_parts_dictionary)**

Replace HashMap<u16, usize> frequency counting with a direct-addressed
[u32; 65536] array. Left-bit patterns are u16 values (0–65535), so the
value itself is a valid array index — no hashing, no collision, O(1)
insert. The 256 KB buffer is allocated once outside the 16-iteration
cut-point loop and reset with fill(0) per call, avoiding 15 redundant
heap alloc/free cycles.

Additionally, when the input to RDEncoder::new() is larger than
MAX_SAMPLE (4096), stride through it with step_by() so that dictionary
search costs O(MAX_SAMPLE × 16) rather than O(N × 16). The dominant
left-bit patterns are stable across large datasets; 4096 samples is
sufficient to identify the dictionary with negligible quality loss.

Together these reduce RDEncoder::new() from O(N) HashMap operations to
a fixed cost independent of N.

**Reverse lookup table (RDEncoder)**

Pre-compute a [u8; 65536] reverse mapping in new(): left_raw_u16 →
code + 1 (0 = not in dictionary). The +1 sentinel avoids an Option,
keeping the table at 64 KB (fits in L2). This replaces the O(dict_size)
codes.iter().position() linear scan called for every element in split().

**Single-pass split()**

Fuse the two loops in split() into one: compute bits once per element,
extract left_raw, do a single table lookup, and push the code or
exception in the same iteration. Previously the loop computed left_parts
and right_parts in one pass, then re-read left_parts to dict-encode in a
second pass.

**Inline dictionary in Split**

Change Split.left_dict from Vec<u16> to [u16; 8] + left_dict_len: u8.
This eliminates one Vec heap allocation per split() call (one per
1024-element chunk). into_parts() materialises a Vec<u16> on demand;
decode() uses the inline slice directly.

Benchmarked on 30 M f32 values:
- RDEncoder::new(): ~3.2 s → ~720 µs  (≈4 500× faster)
- split() encode throughput: ~187 ms → ~66 ms  (≈2.8× faster)
- Full encode (new + split): ~3.4 s → ~67 ms  (≈51× faster)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

jrmoynihan and others added 2 commits May 4, 2026 18:41
- Replace `(code + 1) as u8` with `u8::try_from(...).expect(...)` to surface
  truncation bugs at the point of occurrence rather than silently.
- Add `#[cfg(debug_assertions)]` loop in `RDEncoder::new()` that verifies the
  lookup table round-trips: `lookup[codes[i]] == i as u8 + 1`.
- Add `debug_assert!(codes.len() <= MAX_DICT_SIZE)` in `split()` before the
  `copy_from_slice` into the inline `[u16; MAX_DICT_SIZE]` array.
- Add `debug_assert!(left_dict_len <= MAX_DICT_SIZE)` guards in `into_parts()`
  and `decode()`.
- Add bounds assertion in `alp_rd_decode` before indexing `dict[code]`, so
  callers with mismatched dict/left_parts get a clear panic instead of UB.
- Fix `MAX_SAMPLE` doc: striding activates at `>= 2 * MAX_SAMPLE`, not `> MAX_SAMPLE`.
- Fix `find_best_dictionary` doc: clarify that the counting pass is O(MAX_SAMPLE)
  but the collection pass over the 65 536-entry array is O(65536 × CUT_LIMIT).
- Fix `lookup` field comment: remove CPU-specific "fits in L2" claim; note heap alloc.
- Use `MAX_DICT_SIZE` constant in `Split.left_dict` type annotation.
- Fix `RDEncoder::new()` doc: correct the stride-threshold description.

Four new tests in `alp_rd::test`:
- `test_exception_path_roundtrip`: encodes a value outside the dictionary and
  verifies the exception path reconstructs the original bits exactly.
- `test_large_input_roundtrip`: exercises striding (N > 2*MAX_SAMPLE) and checks
  that every chunk decodes bit-for-bit correctly.
- `test_into_parts_dict_materialisation`: calls `into_parts()`, manually decodes
  via the public `alp_rd_decode` function, and asserts equality.
- `test_subsampling_matches_full_cut_point`: concrete proof of "negligible quality
  loss" — an encoder built on an unstrided MAX_SAMPLE prefix of pseudo-random
  log-normal data chooses the same `right_bit_width` as an encoder built on the
  full 3×MAX_SAMPLE dataset with stride=3.  Roundtrip is also verified.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Restore field doc comments on `ALPRDDictionary` (dictionary, left_bit_width,
  right_bit_width) that were stripped during the HashMap→array refactor.
- Add field doc comments on `RDEncoder` (right_bit_width, codes) which were
  previously undocumented.
- Restore inline comments on `EXC_POSITION_SIZE`/`EXC_SIZE` constants clarifying
  that the unit is bits, not bytes.
- Restore and update section-marker comments in `build_left_parts_dictionary`
  that guide readers through the counting → sorting → dict-assignment → exception
  counting → bit-width derivation steps.
- Restore and update section-marker comments in `alp_rd_decode` for the
  dict-decode, exception-patch, and recombine steps.
- Restore the right-mask and split-loop comments in `RDEncoder::split()`.
- Expand `into_parts()` doc to name every element of the returned tuple.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants