zerotrie: prune low-frequency suffixes from dense matrix (fix #7302) #7307
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements suffix-frequency pruning for
ZeroAsciiDenseSparse2dTrieOwnedas described in issue #7302.Summary
The existing builder included all suffixes in the dense matrix, even if a suffix appeared in only one or very few prefixes. This often expanded the dense matrix unnecessarily and increased data size.
This PR introduces a heuristic to select only high-frequency suffixes for the dense representation.
New behavior
A suffix is included in the dense matrix only if it appears in:
distinct prefixes.
If no suffix meets this threshold, the builder falls back to selecting the top 64 most frequent suffixes (deterministically sorted).
Final dense suffix ordering is lexicographic, preserving stability with BTreeSet-based iteration.
Implementation details
MIN_DENSE_PERCENT = 2FALLBACK_TOP_K = 64BTreeMap<&str, usize>for deterministic ordering.builder.suffixeswith only the filtered suffix set before invokingadd_prefix.Tests
Added
dense_suffix_filter_test.rsvalidating:All existing and new tests pass:
cargo test
cargo test --all-features
cargo quick
Rationale
The dense representation is beneficial only when many prefixes share suffixes. Low-frequency suffixes greatly increase the dense matrix size without yielding lookup benefits.
Pruning such suffixes leads to more compact, efficient serialized data while preserving correctness.
This change stays internal and does not modify public API semantics.
Notes for Reviewers
This PR is ready for review.