Skip to content

Conversation

@robertbastian
Copy link
Member

@robertbastian robertbastian commented Dec 3, 2025

Users already do this with custom data (i.e. linebender/parley#436), we should provide an easier way.

Fixes #3612

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit worried about the large number of ctors we have on Segmenter.

@sffc , thoughts?

sffc
sffc previously requested changes Dec 4, 2025
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already an issue for this: #3612

I don't like the name "empty". Maybe it's okay. Since the only thing this PR does is add the new constructors, I don't see value in merging it until we've agreed on the name for it.

@sffc
Copy link
Member

sffc commented Dec 4, 2025

Also, reviewing the Parley thread, I don't think they should be using these new constructors? One of the reasons they want to use icu_segmenter is because it supports non-rule-based segmentation. Disabling it seems counterproductive. If they are observing a multi-MB size increase, it means they might be linking the dictionary data or something. The LSTM is much less than that.

@robertbastian
Copy link
Member Author

I don't like the name "empty"

The "empty" constructor is on the internal ComplexPayloads type, it's not public API.

One of the reasons they want to use icu_segmenter is because it supports non-rule-based segmentation. Disabling it seems counterproductive.

They do want to put complex segmentation behind a feature flag or BYOD though, which will be much easier if they don't have to do this through a single API by conditionally changing data provider behaviour.

@robertbastian robertbastian requested a review from sffc December 4, 2025 10:18
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good modulo the function name which we can discuss today.

@robertbastian robertbastian force-pushed the segmenter branch 2 times, most recently from f98d0a1 to 2f5653e Compare December 9, 2025 14:28
@robertbastian robertbastian dismissed sffc’s stale review December 18, 2025 16:45

naming resolved

@robertbastian robertbastian merged commit 775be2f into unicode-org:main Dec 18, 2025
31 checks passed
@robertbastian robertbastian deleted the segmenter branch December 18, 2025 19:09
robertbastian added a commit to robertbastian/icu4x that referenced this pull request Dec 18, 2025
Users already do this with custom data (i.e.
linebender/parley#436), we should provide an
easier way.

Fixes unicode-org#3612
@sffc
Copy link
Member

sffc commented Dec 18, 2025

Would you mind listing the new APIs added by this PR?

I think it is the new_for_non_complex_scripts constructors and some default fns on options structs that already implemented the Default trait. Right?

@robertbastian
Copy link
Member Author

robertbastian commented Dec 19, 2025

Changed APIs:

  • new LineSegmenter::new_for_non_complex_scripts + buffer/unstable variants
  • new WordSegmenter::new_for_non_complex_scripts + buffer/unstable variants
  • LineBreakOptions::default becomes const
  • WordBreakOptions::default becomes const
  • WordBreakInvariantOptions::default becomes const

@mihnita
Copy link
Contributor

mihnita commented Dec 19, 2025

And that's how a library ends up with inconsistent, untidy APIs.

One of the basic rules for an i18n library should be that all locales are treated the same when I look at the public APIs.
In the implementation a lib can look and say: this locale is for a non-complex script (or requires a dictionary, or has an ML model, whatever) and treat it differently.

As a dev I should not know what kind of a locale something is.
Otherwise I am forced to write code like this:

if is_complex_script(locale)
   seg = WordSegmenter.new_for_non_complex_scripts(locale)
else
   seg = WordSegmenter.new_some_other_kind(locale)

And how do I know if a locale "is_complex_script"?
I hard-code a list somewhere?

The proper solution is to hide all of this in the regular constructor, in the library:

new_universal_constructor(locale)
   if is_complex_script(locale)
      seg = WordSegmenter.new_for_non_complex_scripts(locale)
   else
      seg = WordSegmenter.new_some_other_kind(locale)

private new_for_non_complex_scripts(...)
private new_some_other_kind(...)

@robertbastian
Copy link
Member Author

Otherwise I am forced to write code like this

The proper solution is to hide all of this in the regular constructor, in the library

You don't seem to have all the context here. We have the proper solution, our main constructor is called try_new_auto and automatically handles all scripts. This constructor is specifically for clients who want to not pay the binary and data size cost of including LSTMs or dictionaries because they know that they will never need to segment text in complex languages (which are listed in the docs). Those users exist, not every project is localized into all the world's languages or renders arbitrary user input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a segmenter constructor for pure rule-based line and word segmentation

4 participants