-
Notifications
You must be signed in to change notification settings - Fork 255
Feature / Filter-low-frequency suffixes from Dense ZeroTrie #7439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature / Filter-low-frequency suffixes from Dense ZeroTrie #7439
Conversation
sffc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your interest in icu4x!
This MIN_DENSE_PERCENT approach seems like a decent heuristic to get us going. As usual, please make sure to add test coverage.
utils/zerotrie/src/dense.rs
Outdated
| if cursor.is_empty() { | ||
| return row_index; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This is an unnecessary change that I believe is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I'll remove this
utils/zerotrie/src/dense.rs
Outdated
| // The row and column indexes should be in-range | ||
| debug_assert!(false); | ||
| return None; | ||
| let index = row_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Using checked_add and checked_mul is beneficial, but not related to the subject of this PR. Please split those changes to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i've created #7442 , working their on this.
utils/zerotrie/src/builder/dense.rs
Outdated
| .map(|(&suffix, &count)| (suffix, count)) | ||
| .collect(); | ||
|
|
||
| // If none meet the threshold, fallback to picking top-K by frequency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If none meet the threshold, then I think we should just leave the dense matrix empty.
|
Hey @sffc thanks for the review, i've tested the code (didnt push the tests) |
|
Hey @sffc , i've commited the changes, PTAL. |
This PR addresses #7302
Implements suffix-frequency pruning to exclude rare suffixes from the dense matrix representation of ZeroAsciiDenseSparse2dTrie.