Skip to content

Conversation

@s3arthak
Copy link

This PR implements suffix-frequency pruning for ZeroAsciiDenseSparse2dTrieOwned as described in issue #7302.

Summary

The existing builder included all suffixes in the dense matrix, even if a suffix appeared in only one or very few prefixes. This often expanded the dense matrix unnecessarily and increased data size.
This PR introduces a heuristic to select only high-frequency suffixes for the dense representation.

New behavior

A suffix is included in the dense matrix only if it appears in:

max(2, ceil(total_prefixes * 2 / 100))

distinct prefixes.

If no suffix meets this threshold, the builder falls back to selecting the top 64 most frequent suffixes (deterministically sorted).
Final dense suffix ordering is lexicographic, preserving stability with BTreeSet-based iteration.

Implementation details

  • Added two module-level constants:
    • MIN_DENSE_PERCENT = 2
    • FALLBACK_TOP_K = 64
  • Computed suffix frequency across prefixes using BTreeMap<&str, usize> for deterministic ordering.
  • Performed delimiter validation before suffix selection.
  • Populated builder.suffixes with only the filtered suffix set before invoking add_prefix.
  • Maintained lexicographic ordering of suffixes to ensure reproducible binary output.
  • Ensures empty input remains supported and behaves consistently with previous logic.

Tests

Added dense_suffix_filter_test.rs validating:

  • Dense suffix count decreases when low-frequency suffixes dominate.
  • Lookup semantics remain correct for both dense and sparse suffixes.
  • Fallback behavior when no suffix reaches threshold.
  • Empty-input behavior remains valid.

All existing and new tests pass:
cargo test
cargo test --all-features
cargo quick

Rationale

The dense representation is beneficial only when many prefixes share suffixes. Low-frequency suffixes greatly increase the dense matrix size without yielding lookup benefits.
Pruning such suffixes leads to more compact, efficient serialized data while preserving correctness.
This change stays internal and does not modify public API semantics.

Notes for Reviewers

  • Behavior is deterministic due to use of BTreeMap/BTreeSet and explicit sorting.
  • The heuristic is conservative (2% or at least 2 prefixes) and aligns with the problem described in Filter low-frequency suffixes from Dense ZeroTrie #7302.
  • The fallback (top-64 suffixes) prevents degenerate dense matrices on small datasets.

This PR is ready for review.

@s3arthak s3arthak requested a review from a team as a code owner December 11, 2025 20:54
@CLAassistant
Copy link

CLAassistant commented Dec 11, 2025

CLA assistant check
All committers have signed the CLA.

@Manishearth
Copy link
Member

Your PR does not contain any code

@sffc sffc marked this pull request as draft December 11, 2025 21:51
@s3arthak s3arthak marked this pull request as ready for review December 11, 2025 22:34
@s3arthak
Copy link
Author

Note: Updated Cargo.toml to add the dense-prune feature flag required for optional pruning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants