Replace regex splitting with hand-rolled SplitByCharacterClass#144
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #144 +/- ##
==========================================
+ Coverage 80.58% 80.94% +0.35%
==========================================
Files 9 9
Lines 1628 1679 +51
==========================================
+ Hits 1312 1359 +47
- Misses 316 320 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merging this PR will improve performance by 37.78%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ⚡ | cut_for_search |
59.5 µs | 50.4 µs | +18.03% |
| ⚡ | textrank |
76.8 µs | 66.9 µs | +14.84% |
| ⚡ | no_hmm |
37.9 µs | 27.6 µs | +37.15% |
| ⚡ | tfidf |
56.8 µs | 46.5 µs | +22.15% |
| ⚡ | search_mode |
59.6 µs | 50.3 µs | +18.67% |
| ⚡ | with_hmm |
51.9 µs | 41.6 µs | +24.67% |
| ⚡ | tag |
54.9 µs | 44.6 µs | +23.09% |
| ⚡ | default_mode |
51.9 µs | 41.6 µs | +24.72% |
| ⚡ | cut_all |
34.4 µs | 26.3 µs | +30.84% |
| ⚡ | single_thread |
22 ms | 16 ms | +37.78% |
| ⚡ | multi_thread |
22 ms | 16.6 ms | +32.24% |
Comparing replace-regex-with-char-class (9babb6a) with main (2ff6b3e)
There was a problem hiding this comment.
Pull request overview
This PR replaces several hot-path regex-based splitters in the Jieba tokenizer with inline character-class predicates and a new SplitByCharacterClass iterator, aiming to reduce CPU time spent in regex scanning during tokenization.
Changes:
- Removed
RE_HAN_DEFAULT,RE_HAN_CUT_ALL,RE_SKIP_DEFAULT, andRE_SKIP_CUT_ALLregex splitters inlib.rs, replacing them withis_*classifiers plusSplitByCharacterClass. - Updated the main
cutandcut_alltop-level tokenization flows to use the new character-class splitter. - Replaced HMM
RE_HANregex splitting with anis_hmm_hanclassifier, keepingRE_SKIPas regex and introducing a dedicatedHmmSkipSplitter.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
jieba/src/lib.rs |
Introduces SplitByCharacterClass and character classifiers; updates cut/cut_all tokenization to avoid regex. |
jieba/src/hmm.rs |
Replaces HMM Han regex with is_hmm_han; keeps regex for RE_SKIP via HmmSkipSplitter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2626a13 to
75f4181
Compare
Replace `RE_HAN_DEFAULT`, `RE_HAN_CUT_ALL`, `RE_SKIP_CUT_ALL`, and `RE_SKIP_DEFAULT` regexes in `lib.rs` with inline character classifiers (`is_han_default`, `is_han_cut_all`, `is_skip_cut_all`) and a generic `SplitByCharacterClass` iterator. Also replace HMM `RE_HAN` with `is_hmm_han` classifier, keeping only `RE_SKIP` as regex due to its complex pattern `([a-zA-Z0-9]+(?:.\d+)?%?)`. Profiling showed regex `find_fwd`/`find_rev` accounted for ~29% of CPU time; this drops to <1% with the character class approach. Benchmark improvements (vs previous commit on `add-byte-positions`): - `no_hmm`: 1.22 µs → 1.02 µs (-16%) - `with_hmm`: 1.82 µs → 1.50 µs (-18%) - `cut_for_search`: 2.31 µs → 2.00 µs (-14%)
75f4181 to
9babb6a
Compare
Replace
RE_HAN_DEFAULT,RE_HAN_CUT_ALL,RE_SKIP_CUT_ALL, andRE_SKIP_DEFAULTregexes inlib.rswith inline character classifiers (is_han_default,is_han_cut_all,is_skip_cut_all) and a genericSplitByCharacterClassiterator.Also replace HMM
RE_HANwithis_hmm_hanclassifier, keeping onlyRE_SKIPas regex due to its complex pattern([a-zA-Z0-9]+(?:.\d+)?%?).Profiling showed regex
find_fwd/find_revaccounted for ~29% of CPU time; this drops to <1% with the character class approach.Benchmark improvements (vs previous commit on
add-byte-positions):no_hmm: 1.22 µs → 1.02 µs (-16%)with_hmm: 1.82 µs → 1.50 µs (-18%)cut_for_search: 2.31 µs → 2.00 µs (-14%)