Skip to content

feat: add posseg (part-of-speech tagging) HMM for OOV words#146

Merged
messense merged 1 commit into
mainfrom
feat/posseg
Apr 19, 2026
Merged

feat: add posseg (part-of-speech tagging) HMM for OOV words#146
messense merged 1 commit into
mainfrom
feat/posseg

Conversation

@messense

Copy link
Copy Markdown
Owner

Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module.

  • Add posseg.rs with dense 256×256 transition matrix for O(1) lookup
  • Embed probability data (start, trans, emit, char_state_tab) via include_flate, gated on default-dict feature
  • Improve tag() to use posseg HMM for OOV CJK words instead of falling back to "x" (e.g. "张尧" now correctly tagged as "nr"/person name)
  • Add conversion script (scripts/convert_posseg.py) for regenerating posseg.txt from Python jieba's pickle files
  • Add benchmark for tag_with_oov (~2.5µs per call)

Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256
states) for POS-tagging unknown/OOV Chinese words, ported from Python
jieba's posseg module.

- Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup
- Embed probability data (start, trans, emit, char_state_tab) via
  `include_flate`, gated on `default-dict` feature
- Improve `tag()` to use posseg HMM for OOV CJK words instead of falling
  back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name)
- Add conversion script (`scripts/convert_posseg.py`) for regenerating
  `posseg.txt` from Python jieba's pickle files
- Add benchmark for `tag_with_oov` (~2.5µs per call)
@codspeed-hq

codspeed-hq Bot commented Apr 19, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 12 untouched benchmarks
🆕 1 new benchmark

Performance Changes

Benchmark BASE HEAD Efficiency
🆕 tag_with_oov N/A 78.2 µs N/A

Comparing feat/posseg (18a9fff) with main (9e3965b)

Open in CodSpeed

@codecov

codecov Bot commented Apr 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.01993% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.85%. Comparing base (9e3965b) to head (18a9fff).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
jieba/src/posseg.rs 94.92% 14 Missing ⚠️
jieba/src/lib.rs 84.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #146      +/-   ##
==========================================
+ Coverage   80.99%   82.85%   +1.85%     
==========================================
  Files           9       10       +1     
  Lines        1684     1971     +287     
==========================================
+ Hits         1364     1633     +269     
- Misses        320      338      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@messense messense merged commit 87db1f9 into main Apr 19, 2026
10 checks passed
@messense messense deleted the feat/posseg branch April 19, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant