Custom segmentation for words.separated.like.this? #7373

recatek · 2026-01-05T16:18:51Z

recatek
Jan 5, 2026

I'm trying to replicate the behavior seen in code editors, browser bars, etc. where words.separated.like.this can be effectively segmented for navigation and selection, either as words|.|separated|.|like|.|this (most text editors) or as words.|separated.|like.|this (browser bar). However, I noticed that by default WordSegmenter treats this as a single word unless the period is followed by a space. Is there a way to override this behavior?

recatek · 2026-01-05T17:22:35Z

recatek
Jan 5, 2026
Author

Some additional digging on this myself:

It looks like the language ID "en-US-POSIX" exists with this behavior, as demonstrated here: https://icu4c-demos.unicode.org/icu-bin/icusegments#1/en_US_POSIX (it's called "English (United States, Computer)" here).

Typing in words.separated.like.this yields breakpoints words|.|separated|.|like|.|this. However, running this Rust code does not replicate that behavior:

use icu::locale::langid;
use icu::segmenter::WordSegmenter;
use icu::segmenter::options::WordBreakOptions;

fn main() {
    let mut options = WordBreakOptions::default();
    let langid = &langid!("en-US-POSIX");
    options.content_locale = Some(langid);
    let segmenter = WordSegmenter::try_new_auto(options).unwrap();

    let breakpoints: Vec<usize> = segmenter
        .as_borrowed()
        .segment_str("words.separated.like.this")
        .collect();

    println!("{:?}", breakpoints);
}

This prints [0, 25] like other language IDs. Is there something I'm doing wrong here? Or is there a better/alternative way?

1 reply

sffc Jan 5, 2026
Maintainer

CC @macchiati

There's been discussion about segmentation profiles, which could tweak behavior for cases like this. Browser engines like Chrome already implement the behavior you requested:

[...new Intl.Segmenter("en", { granularity: "word" }).segment("hello aaa.bb.ccc world")]
// returns an array with 9 segments

I've been wanting to get Chrome's tailorings upstreamed into CLDR as a segmentation profile. Here is the ticket you can watch:

https://unicode-org.atlassian.net/browse/CLDR-15839

As far as en-US-POSIX, I do see the tailoring here:

https://github.com/unicode-org/cldr/blob/main/common/segments/en_US_POSIX.xml

I'm not sure why ICU4X isn't implementing it. CC @aethanyc @makotokato

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom segmentation for words.separated.like.this? #7373

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Custom segmentation for words.separated.like.this? #7373

Uh oh!

recatek Jan 5, 2026

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

recatek Jan 5, 2026 Author

Uh oh!

sffc Jan 5, 2026 Maintainer

recatek
Jan 5, 2026

Replies: 1 comment 1 reply

recatek
Jan 5, 2026
Author

sffc Jan 5, 2026
Maintainer