Data struct with codepoint builder for radicals#7805
Data struct with codepoint builder for radicals#7805robertbastian merged 8 commits intounicode-org:mainfrom
Conversation
|
This is the third PR doing the same thing. There is significant discussion in #7646, why do you keep creating new PRs? |
because there's a merge conflict and I don't think I'm supposed to git rebase |
|
Ideally you'd resolve merge conflicts using a merge commit, but rebasing is better than creating a new PR that drops all previous context. |
Got it, thank you. I'll follow this practice in the future. |
| mod lstm; | ||
| mod radical; | ||
| pub use lstm::*; | ||
| pub use radical::*; |
There was a problem hiding this comment.
| mod lstm; | |
| mod radical; | |
| pub use lstm::*; | |
| pub use radical::*; | |
| mod lstm; | |
| #[cfg(feature = "unstable")] | |
| pub mod radical; | |
| pub use lstm::*; |
| ); | ||
|
|
||
| icu_provider::data_marker!( | ||
| /// `SegmenterUnihanRadicalV1` |
There was a problem hiding this comment.
@robertbastian commented on the old PR about these docs. Please write what the marker represents.
| use std::collections::HashMap; | ||
|
|
||
| fn load_irg_from_baked() -> &'static UnihanIrgData<'static> { | ||
| Baked::SINGLETON_SEGMENTER_UNIHAN_RADICAL_V1 |
There was a problem hiding this comment.
inline. you can even inline this all the way into Predictor::for_test
|
|
||
| let unihan_cache = self.unihan()?; | ||
| let ucd = self.ucd()?; | ||
| let irg_map = unihan_cache.irg_sources(ucd)?; |
There was a problem hiding this comment.
I still disagree with the meat of this implementation living outside this module, with the need of this having to be cached, and with the overhead of creating an intermediate LiteMap
|
|
||
| let unihan_cache = self.unihan()?; | ||
| let ucd = self.ucd()?; | ||
| let irg_map = unihan_cache.irg_sources(ucd)?; |
There was a problem hiding this comment.
you still have irg naming everywhere, these variables/functions should be called radicals now
| } | ||
|
|
||
| #[test] | ||
| fn test_chinese_irg_values_trie() { |
There was a problem hiding this comment.
this is a good test, because it tests the DataProvider<SegmenterUnihanRadicalV1> implementation. the test above test the same, however, but through accessing the internal cache; why?
| identifier_status: &AbstractFs, | ||
| trie_type: crate::TrieType, | ||
| ) -> Result<UnihanRadicalsData<'static>, DataError> { | ||
| let identifier_status = identifier_status.read_to_string("security/IdentifierStatus.txt")?; | ||
| let identifier_status = identifier_status |
There was a problem hiding this comment.
| identifier_status: &AbstractFs, | |
| trie_type: crate::TrieType, | |
| ) -> Result<UnihanRadicalsData<'static>, DataError> { | |
| let identifier_status = identifier_status.read_to_string("security/IdentifierStatus.txt")?; | |
| let identifier_status = identifier_status | |
| ucd: &AbstractFs, | |
| trie_type: crate::TrieType, | |
| ) -> Result<UnihanRadicalsData<'static>, DataError> { | |
| let identifier_status = ucd | |
| .read_to_string("security/IdentifierStatus.txt")? |
| u32::from_str_radix(end, 16).expect("Invalid IdentifierStatus codepoint format"), | ||
| ) | ||
| }) | ||
| .collect::<Vec<_>>(); |
There was a problem hiding this comment.
you should store this in a CodePointInversionList, that's designed to store a set of unicode ranges, and makes the lookup a lot easier than what you're doing with partition_point
| icu_provider_export = { workspace = true, features = ["fs_exporter", "baked_exporter", "rayon"] } | ||
| icu_provider = { workspace = true, features = ["deserialize_postcard_1"] } | ||
| icu_segmenter = { path = "../../components/segmenter", features = ["lstm"] } | ||
| icu_segmenter = { path = "../../components/segmenter", features = ["lstm", "unstable"] } |
There was a problem hiding this comment.
is this needed? there doesn't seem to be any test-only code that uses unstable segmenter code
you do need to enable the feature when this crate's unstable feature is enabled, which you'll have to do through the icu crate
05d7440 to
be766e8
Compare
Issue: #6941
Modified from #7722 and #7646
Changelog
icu_segmenter: Add unstable Unihan radical provider data and baked supporticu_segmenter::provider::UnihanIrgData<'data>,icu_segmenter::provider::SegmenterUnihanRadicalV1icu_segmenter::provider::Baked::SINGLETON_SEGMENTER_UNIHAN_RADICAL_V1experimental_segmenterexample now usesradaboostfor the Chinese radical model and addsthadaboostfor Thaiicu_provider_source: Add Unihan radical trie generation foricu_segmenter::provider::SegmenterUnihanRadicalV1SourceDataProvidercan now load this marker from Unihan IRG dataicu_provider_registry: Exporticu_segmenter::provider::SegmenterUnihanRadicalV1