Skip to content

Replace regex splitting with hand-rolled SplitByCharacterClass#144

Merged
messense merged 1 commit into
mainfrom
replace-regex-with-char-class
Apr 19, 2026
Merged

Replace regex splitting with hand-rolled SplitByCharacterClass#144
messense merged 1 commit into
mainfrom
replace-regex-with-char-class

Conversation

@messense

Copy link
Copy Markdown
Owner

Replace RE_HAN_DEFAULT, RE_HAN_CUT_ALL, RE_SKIP_CUT_ALL, and RE_SKIP_DEFAULT regexes in lib.rs with inline character classifiers (is_han_default, is_han_cut_all, is_skip_cut_all) and a generic SplitByCharacterClass iterator.

Also replace HMM RE_HAN with is_hmm_han classifier, keeping only RE_SKIP as regex due to its complex pattern ([a-zA-Z0-9]+(?:.\d+)?%?).

Profiling showed regex find_fwd/find_rev accounted for ~29% of CPU time; this drops to <1% with the character class approach.

Benchmark improvements (vs previous commit on add-byte-positions):

  • no_hmm: 1.22 µs → 1.02 µs (-16%)
  • with_hmm: 1.82 µs → 1.50 µs (-18%)
  • cut_for_search: 2.31 µs → 2.00 µs (-14%)

@codecov

codecov Bot commented Apr 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.55556% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.94%. Comparing base (2ff6b3e) to head (9babb6a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
jieba/src/lib.rs 95.48% 6 Missing ⚠️
jieba/src/hmm.rs 95.74% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #144      +/-   ##
==========================================
+ Coverage   80.58%   80.94%   +0.35%     
==========================================
  Files           9        9              
  Lines        1628     1679      +51     
==========================================
+ Hits         1312     1359      +47     
- Misses        316      320       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq

codspeed-hq Bot commented Apr 19, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 37.78%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 11 improved benchmarks
✅ 1 untouched benchmark

Performance Changes

Benchmark BASE HEAD Efficiency
cut_for_search 59.5 µs 50.4 µs +18.03%
textrank 76.8 µs 66.9 µs +14.84%
no_hmm 37.9 µs 27.6 µs +37.15%
tfidf 56.8 µs 46.5 µs +22.15%
search_mode 59.6 µs 50.3 µs +18.67%
with_hmm 51.9 µs 41.6 µs +24.67%
tag 54.9 µs 44.6 µs +23.09%
default_mode 51.9 µs 41.6 µs +24.72%
cut_all 34.4 µs 26.3 µs +30.84%
single_thread 22 ms 16 ms +37.78%
multi_thread 22 ms 16.6 ms +32.24%

Comparing replace-regex-with-char-class (9babb6a) with main (2ff6b3e)

Open in CodSpeed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces several hot-path regex-based splitters in the Jieba tokenizer with inline character-class predicates and a new SplitByCharacterClass iterator, aiming to reduce CPU time spent in regex scanning during tokenization.

Changes:

  • Removed RE_HAN_DEFAULT, RE_HAN_CUT_ALL, RE_SKIP_DEFAULT, and RE_SKIP_CUT_ALL regex splitters in lib.rs, replacing them with is_* classifiers plus SplitByCharacterClass.
  • Updated the main cut and cut_all top-level tokenization flows to use the new character-class splitter.
  • Replaced HMM RE_HAN regex splitting with an is_hmm_han classifier, keeping RE_SKIP as regex and introducing a dedicated HmmSkipSplitter.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
jieba/src/lib.rs Introduces SplitByCharacterClass and character classifiers; updates cut/cut_all tokenization to avoid regex.
jieba/src/hmm.rs Replaces HMM Han regex with is_hmm_han; keeps regex for RE_SKIP via HmmSkipSplitter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread jieba/src/lib.rs Outdated
Comment thread jieba/src/hmm.rs Outdated
@messense messense force-pushed the replace-regex-with-char-class branch from 2626a13 to 75f4181 Compare April 19, 2026 05:23
Replace `RE_HAN_DEFAULT`, `RE_HAN_CUT_ALL`, `RE_SKIP_CUT_ALL`, and
`RE_SKIP_DEFAULT` regexes in `lib.rs` with inline character classifiers
(`is_han_default`, `is_han_cut_all`, `is_skip_cut_all`) and a generic
`SplitByCharacterClass` iterator.

Also replace HMM `RE_HAN` with `is_hmm_han` classifier, keeping only
`RE_SKIP` as regex due to its complex pattern `([a-zA-Z0-9]+(?:.\d+)?%?)`.

Profiling showed regex `find_fwd`/`find_rev` accounted for ~29% of CPU
time; this drops to <1% with the character class approach.

Benchmark improvements (vs previous commit on `add-byte-positions`):
- `no_hmm`:         1.22 µs → 1.02 µs (-16%)
- `with_hmm`:       1.82 µs → 1.50 µs (-18%)
- `cut_for_search`: 2.31 µs → 2.00 µs (-14%)
@messense messense force-pushed the replace-regex-with-char-class branch from 75f4181 to 9babb6a Compare April 19, 2026 05:33
@messense messense merged commit 1f4e325 into main Apr 19, 2026
10 checks passed
@messense messense deleted the replace-regex-with-char-class branch April 19, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants