Skip to content

Data struct with codepoint builder for radicals#7805

Merged
robertbastian merged 8 commits intounicode-org:mainfrom
opnuub:radicals-struct
Apr 20, 2026
Merged

Data struct with codepoint builder for radicals#7805
robertbastian merged 8 commits intounicode-org:mainfrom
opnuub:radicals-struct

Conversation

@opnuub
Copy link
Copy Markdown
Contributor

@opnuub opnuub commented Mar 24, 2026

Issue: #6941

Modified from #7722 and #7646

Changelog

  • icu_segmenter: Add unstable Unihan radical provider data and baked support
    • New types: icu_segmenter::provider::UnihanIrgData<'data>, icu_segmenter::provider::SegmenterUnihanRadicalV1
    • New associated const: icu_segmenter::provider::Baked::SINGLETON_SEGMENTER_UNIHAN_RADICAL_V1
    • The experimental_segmenter example now uses radaboost for the Chinese radical model and adds thadaboost for Thai
  • icu_provider_source: Add Unihan radical trie generation for icu_segmenter::provider::SegmenterUnihanRadicalV1
    • SourceDataProvider can now load this marker from Unihan IRG data
  • icu_provider_registry: Export icu_segmenter::provider::SegmenterUnihanRadicalV1

@robertbastian
Copy link
Copy Markdown
Member

This is the third PR doing the same thing. There is significant discussion in #7646, why do you keep creating new PRs?

@opnuub
Copy link
Copy Markdown
Contributor Author

opnuub commented Apr 7, 2026

This is the third PR doing the same thing. There is significant discussion in #7646, why do you keep creating new PRs?

because there's a merge conflict and I don't think I'm supposed to git rebase

@robertbastian
Copy link
Copy Markdown
Member

Ideally you'd resolve merge conflicts using a merge commit, but rebasing is better than creating a new PR that drops all previous context.

@opnuub
Copy link
Copy Markdown
Contributor Author

opnuub commented Apr 7, 2026

Ideally you'd resolve merge conflicts using a merge commit, but rebasing is better than creating a new PR that drops all previous context.

Got it, thank you. I'll follow this practice in the future.

Comment on lines +18 to +21
mod lstm;
mod radical;
pub use lstm::*;
pub use radical::*;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mod lstm;
mod radical;
pub use lstm::*;
pub use radical::*;
mod lstm;
#[cfg(feature = "unstable")]
pub mod radical;
pub use lstm::*;

);

icu_provider::data_marker!(
/// `SegmenterUnihanRadicalV1`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian commented on the old PR about these docs. Please write what the marker represents.

use std::collections::HashMap;

fn load_irg_from_baked() -> &'static UnihanIrgData<'static> {
Baked::SINGLETON_SEGMENTER_UNIHAN_RADICAL_V1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline. you can even inline this all the way into Predictor::for_test

Comment thread provider/source/src/segmenter/unihan.rs Outdated

let unihan_cache = self.unihan()?;
let ucd = self.ucd()?;
let irg_map = unihan_cache.irg_sources(ucd)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree with the meat of this implementation living outside this module, with the need of this having to be cached, and with the overhead of creating an intermediate LiteMap

Comment thread provider/source/src/segmenter/unihan.rs Outdated

let unihan_cache = self.unihan()?;
let ucd = self.ucd()?;
let irg_map = unihan_cache.irg_sources(ucd)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you still have irg naming everywhere, these variables/functions should be called radicals now

Comment thread provider/source/src/segmenter/unihan.rs Outdated
}

#[test]
fn test_chinese_irg_values_trie() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good test, because it tests the DataProvider<SegmenterUnihanRadicalV1> implementation. the test above test the same, however, but through accessing the internal cache; why?

Comment thread provider/source/src/segmenter/unihan.rs Outdated
Comment on lines +20 to +24
identifier_status: &AbstractFs,
trie_type: crate::TrieType,
) -> Result<UnihanRadicalsData<'static>, DataError> {
let identifier_status = identifier_status.read_to_string("security/IdentifierStatus.txt")?;
let identifier_status = identifier_status
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
identifier_status: &AbstractFs,
trie_type: crate::TrieType,
) -> Result<UnihanRadicalsData<'static>, DataError> {
let identifier_status = identifier_status.read_to_string("security/IdentifierStatus.txt")?;
let identifier_status = identifier_status
ucd: &AbstractFs,
trie_type: crate::TrieType,
) -> Result<UnihanRadicalsData<'static>, DataError> {
let identifier_status = ucd
.read_to_string("security/IdentifierStatus.txt")?

Comment thread provider/source/src/segmenter/unihan.rs Outdated
u32::from_str_radix(end, 16).expect("Invalid IdentifierStatus codepoint format"),
)
})
.collect::<Vec<_>>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should store this in a CodePointInversionList, that's designed to store a set of unicode ranges, and makes the lookup a lot easier than what you're doing with partition_point

Comment thread provider/source/Cargo.toml Outdated
icu_provider_export = { workspace = true, features = ["fs_exporter", "baked_exporter", "rayon"] }
icu_provider = { workspace = true, features = ["deserialize_postcard_1"] }
icu_segmenter = { path = "../../components/segmenter", features = ["lstm"] }
icu_segmenter = { path = "../../components/segmenter", features = ["lstm", "unstable"] }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed? there doesn't seem to be any test-only code that uses unstable segmenter code

you do need to enable the feature when this crate's unstable feature is enabled, which you'll have to do through the icu crate

@opnuub opnuub force-pushed the radicals-struct branch 2 times, most recently from 05d7440 to be766e8 Compare April 16, 2026 18:56
@opnuub opnuub requested review from robertbastian and sffc April 19, 2026 09:53
@robertbastian robertbastian merged commit c6670f4 into unicode-org:main Apr 20, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants