Replies: 1 comment 1 reply
-
|
Some additional digging on this myself: It looks like the language ID "en-US-POSIX" exists with this behavior, as demonstrated here: https://icu4c-demos.unicode.org/icu-bin/icusegments#1/en_US_POSIX (it's called "English (United States, Computer)" here). Typing in use icu::locale::langid;
use icu::segmenter::WordSegmenter;
use icu::segmenter::options::WordBreakOptions;
fn main() {
let mut options = WordBreakOptions::default();
let langid = &langid!("en-US-POSIX");
options.content_locale = Some(langid);
let segmenter = WordSegmenter::try_new_auto(options).unwrap();
let breakpoints: Vec<usize> = segmenter
.as_borrowed()
.segment_str("words.separated.like.this")
.collect();
println!("{:?}", breakpoints);
}This prints |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to replicate the behavior seen in code editors, browser bars, etc. where words.separated.like.this can be effectively segmented for navigation and selection, either as
words|.|separated|.|like|.|this(most text editors) or aswords.|separated.|like.|this(browser bar). However, I noticed that by defaultWordSegmentertreats this as a single word unless the period is followed by a space. Is there a way to override this behavior?Beta Was this translation helpful? Give feedback.
All reactions