Add non-complex segmenter constructors #7268

robertbastian · 2025-12-03T22:22:11Z

Users already do this with custom data (i.e. linebender/parley#436), we should provide an easier way.

Manishearth

A bit worried about the large number of ctors we have on Segmenter.

@sffc , thoughts?

sffc

There's already an issue for this: #3612

I don't like the name "empty". Maybe it's okay. Since the only thing this PR does is add the new constructors, I don't see value in merging it until we've agreed on the name for it.

sffc · 2025-12-04T00:06:26Z

Also, reviewing the Parley thread, I don't think they should be using these new constructors? One of the reasons they want to use icu_segmenter is because it supports non-rule-based segmentation. Disabling it seems counterproductive. If they are observing a multi-MB size increase, it means they might be linking the dictionary data or something. The LSTM is much less than that.

robertbastian · 2025-12-04T10:17:45Z

I don't like the name "empty"

The "empty" constructor is on the internal ComplexPayloads type, it's not public API.

One of the reasons they want to use icu_segmenter is because it supports non-rule-based segmentation. Disabling it seems counterproductive.

They do want to put complex segmentation behind a feature flag or BYOD though, which will be much easier if they don't have to do this through a single API by conditionally changing data provider behaviour.

sffc

This is good modulo the function name which we can discuss today.

naming resolved

components/segmenter/src/line.rs

components/segmenter/src/word.rs

ffi/capi/src/segmenter_line.rs

ffi/capi/src/segmenter_word.rs

Co-authored-by: Shane F. Carr <[email protected]>

Users already do this with custom data (i.e. linebender/parley#436), we should provide an easier way. Fixes unicode-org#3612

sffc · 2025-12-18T21:27:14Z

Would you mind listing the new APIs added by this PR?

I think it is the new_for_non_complex_scripts constructors and some default fns on options structs that already implemented the Default trait. Right?

robertbastian · 2025-12-19T14:54:50Z

Changed APIs:

new LineSegmenter::new_for_non_complex_scripts + buffer/unstable variants
new WordSegmenter::new_for_non_complex_scripts + buffer/unstable variants
LineBreakOptions::default becomes const
WordBreakOptions::default becomes const
WordBreakInvariantOptions::default becomes const

mihnita · 2025-12-19T16:54:13Z

And that's how a library ends up with inconsistent, untidy APIs.

One of the basic rules for an i18n library should be that all locales are treated the same when I look at the public APIs.
In the implementation a lib can look and say: this locale is for a non-complex script (or requires a dictionary, or has an ML model, whatever) and treat it differently.

As a dev I should not know what kind of a locale something is.
Otherwise I am forced to write code like this:

if is_complex_script(locale)
   seg = WordSegmenter.new_for_non_complex_scripts(locale)
else
   seg = WordSegmenter.new_some_other_kind(locale)

And how do I know if a locale "is_complex_script"?
I hard-code a list somewhere?

The proper solution is to hide all of this in the regular constructor, in the library:

new_universal_constructor(locale)
   if is_complex_script(locale)
      seg = WordSegmenter.new_for_non_complex_scripts(locale)
   else
      seg = WordSegmenter.new_some_other_kind(locale)

private new_for_non_complex_scripts(...)
private new_some_other_kind(...)

robertbastian · 2025-12-19T17:16:14Z

Otherwise I am forced to write code like this

The proper solution is to hide all of this in the regular constructor, in the library

You don't seem to have all the context here. We have the proper solution, our main constructor is called try_new_auto and automatically handles all scripts. This constructor is specifically for clients who want to not pay the binary and data size cost of including LSTMs or dictionaries because they know that they will never need to segment text in complex languages (which are listed in the docs). Those users exist, not every project is localized into all the world's languages or renders arbitrary user input.

new_non_complex

c93669e

robertbastian force-pushed the segmenter branch from 1809310 to c93669e Compare December 3, 2025 22:35

robertbastian marked this pull request as ready for review December 3, 2025 22:41

robertbastian requested review from a team, Manishearth, aethanyc, makotokato and sffc as code owners December 3, 2025 22:41

robertbastian removed request for aethanyc and makotokato December 3, 2025 22:41

Manishearth approved these changes Dec 3, 2025

View reviewed changes

sffc previously requested changes Dec 4, 2025

View reviewed changes

sffc mentioned this pull request Dec 4, 2025

Add a segmenter constructor for pure rule-based line and word segmentation #3612

Closed

robertbastian requested a review from sffc December 4, 2025 10:18

sffc reviewed Dec 4, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into segmenter

41ccd85

robertbastian force-pushed the segmenter branch 2 times, most recently from f98d0a1 to 2f5653e Compare December 9, 2025 14:28

const

f4187e9

robertbastian force-pushed the segmenter branch from 2f5653e to f4187e9 Compare December 9, 2025 14:31

robertbastian mentioned this pull request Dec 9, 2025

Use ICU4X built-in data linebender/parley#482

Draft

rename

b708034

robertbastian force-pushed the segmenter branch from ca2a04b to b708034 Compare December 18, 2025 12:22

robertbastian requested a review from sffc December 18, 2025 12:22

robertbastian added 3 commits December 18, 2025 14:07

Merge branch 'main' into segmenter

7bb46a6

bump

c3ae442

ugh

0d2971d

robertbastian requested a review from Manishearth December 18, 2025 16:53

sffc approved these changes Dec 18, 2025

View reviewed changes

robertbastian and others added 2 commits December 18, 2025 19:51

Apply suggestions from code review

6611d85

Co-authored-by: Shane F. Carr <[email protected]>

gen

9a4d136

Manishearth approved these changes Dec 18, 2025

View reviewed changes

robertbastian merged commit 775be2f into unicode-org:main Dec 18, 2025
31 checks passed

robertbastian deleted the segmenter branch December 18, 2025 19:09

robertbastian added a commit to robertbastian/icu4x that referenced this pull request Dec 18, 2025

Add non-complex segmenter constructors (unicode-org#7268)

09a6d2d

Users already do this with custom data (i.e. linebender/parley#436), we should provide an easier way. Fixes unicode-org#3612

Add non-complex segmenter constructors #7268

Add non-complex segmenter constructors #7268

Uh oh!

Conversation

robertbastian commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

sffc commented Dec 4, 2025

Uh oh!

robertbastian commented Dec 4, 2025

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sffc commented Dec 18, 2025

Uh oh!

robertbastian commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihnita commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertbastian commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robertbastian commented Dec 3, 2025 •

edited

Loading

robertbastian commented Dec 19, 2025 •

edited

Loading

mihnita commented Dec 19, 2025 •

edited

Loading