Replace regex splitting with hand-rolled `SplitByCharacterClass` by messense · Pull Request #144 · messense/jieba-rs

messense · 2026-04-19T05:12:39Z

Replace RE_HAN_DEFAULT, RE_HAN_CUT_ALL, RE_SKIP_CUT_ALL, and RE_SKIP_DEFAULT regexes in lib.rs with inline character classifiers (is_han_default, is_han_cut_all, is_skip_cut_all) and a generic SplitByCharacterClass iterator.

Also replace HMM RE_HAN with is_hmm_han classifier, keeping only RE_SKIP as regex due to its complex pattern ([a-zA-Z0-9]+(?:.\d+)?%?).

Profiling showed regex find_fwd/find_rev accounted for ~29% of CPU time; this drops to <1% with the character class approach.

Benchmark improvements (vs previous commit on add-byte-positions):

no_hmm: 1.22 µs → 1.02 µs (-16%)
with_hmm: 1.82 µs → 1.50 µs (-18%)
cut_for_search: 2.31 µs → 2.00 µs (-14%)

codecov · 2026-04-19T05:13:49Z

Codecov Report

❌ Patch coverage is 95.55556% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.94%. Comparing base (2ff6b3e) to head (9babb6a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
jieba/src/lib.rs	95.48%	6 Missing ⚠️
jieba/src/hmm.rs	95.74%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #144      +/-   ##
==========================================
+ Coverage   80.58%   80.94%   +0.35%     
==========================================
  Files           9        9              
  Lines        1628     1679      +51     
==========================================
+ Hits         1312     1359      +47     
- Misses        316      320       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-04-19T05:13:59Z

Merging this PR will improve performance by 37.78%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 11 improved benchmarks
✅ 1 untouched benchmark

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	`cut_for_search`	59.5 µs	50.4 µs	+18.03%
⚡	`textrank`	76.8 µs	66.9 µs	+14.84%
⚡	`no_hmm`	37.9 µs	27.6 µs	+37.15%
⚡	`tfidf`	56.8 µs	46.5 µs	+22.15%
⚡	`search_mode`	59.6 µs	50.3 µs	+18.67%
⚡	`with_hmm`	51.9 µs	41.6 µs	+24.67%
⚡	`tag`	54.9 µs	44.6 µs	+23.09%
⚡	`default_mode`	51.9 µs	41.6 µs	+24.72%
⚡	`cut_all`	34.4 µs	26.3 µs	+30.84%
⚡	`single_thread`	22 ms	16 ms	+37.78%
⚡	`multi_thread`	22 ms	16.6 ms	+32.24%

_{Comparing replace-regex-with-char-class (9babb6a) with main (2ff6b3e)}

Copilot

Pull request overview

This PR replaces several hot-path regex-based splitters in the Jieba tokenizer with inline character-class predicates and a new SplitByCharacterClass iterator, aiming to reduce CPU time spent in regex scanning during tokenization.

Changes:

Removed RE_HAN_DEFAULT, RE_HAN_CUT_ALL, RE_SKIP_DEFAULT, and RE_SKIP_CUT_ALL regex splitters in lib.rs, replacing them with is_* classifiers plus SplitByCharacterClass.
Updated the main cut and cut_all top-level tokenization flows to use the new character-class splitter.
Replaced HMM RE_HAN regex splitting with an is_hmm_han classifier, keeping RE_SKIP as regex and introducing a dedicated HmmSkipSplitter.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`jieba/src/lib.rs`	Introduces `SplitByCharacterClass` and character classifiers; updates `cut`/`cut_all` tokenization to avoid regex.
`jieba/src/hmm.rs`	Replaces HMM Han regex with `is_hmm_han`; keeps regex for `RE_SKIP` via `HmmSkipSplitter`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace `RE_HAN_DEFAULT`, `RE_HAN_CUT_ALL`, `RE_SKIP_CUT_ALL`, and `RE_SKIP_DEFAULT` regexes in `lib.rs` with inline character classifiers (`is_han_default`, `is_han_cut_all`, `is_skip_cut_all`) and a generic `SplitByCharacterClass` iterator. Also replace HMM `RE_HAN` with `is_hmm_han` classifier, keeping only `RE_SKIP` as regex due to its complex pattern `([a-zA-Z0-9]+(?:.\d+)?%?)`. Profiling showed regex `find_fwd`/`find_rev` accounted for ~29% of CPU time; this drops to <1% with the character class approach. Benchmark improvements (vs previous commit on `add-byte-positions`): - `no_hmm`: 1.22 µs → 1.02 µs (-16%) - `with_hmm`: 1.82 µs → 1.50 µs (-18%) - `cut_for_search`: 2.31 µs → 2.00 µs (-14%)

messense requested a review from Copilot April 19, 2026 05:12

Copilot started reviewing on behalf of messense April 19, 2026 05:13 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread jieba/src/lib.rs Outdated

Comment thread jieba/src/hmm.rs Outdated

messense force-pushed the replace-regex-with-char-class branch from 2626a13 to 75f4181 Compare April 19, 2026 05:23

messense force-pushed the replace-regex-with-char-class branch from 75f4181 to 9babb6a Compare April 19, 2026 05:33

messense merged commit 1f4e325 into main Apr 19, 2026
10 checks passed

messense deleted the replace-regex-with-char-class branch April 19, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace regex splitting with hand-rolled `SplitByCharacterClass`#144

Replace regex splitting with hand-rolled `SplitByCharacterClass`#144
messense merged 1 commit into
mainfrom
replace-regex-with-char-class

messense commented Apr 19, 2026

Uh oh!

codecov Bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

messense commented Apr 19, 2026

Uh oh!

codecov Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 37.78%

Performance Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 19, 2026 •

edited

Loading

codspeed-hq Bot commented Apr 19, 2026 •

edited

Loading