-
-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
Hey, could you add TokenFilter support to LinderaTokenizer so it can wrap something like SimpleTokenizer? That way, we could use it like below, making it better for mixed English-Chinese/Korean/Japanese text without changing LinderaTokenStream.
impl Default for TokenizerManager {
/// Creates an `TokenizerManager` prepopulated with
/// the default pre-configured tokenizers of `tantivy`.
fn default() -> TokenizerManager {
let manager = TokenizerManager::new();
manager.register("raw", RawTokenizer::default());
manager.register(
"default",
TextAnalyzer::builder(SimpleTokenizer::default())
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser)
.build(),
);
manager.register(
"en_stem",
TextAnalyzer::builder(SimpleTokenizer::default())
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser)
.filter(Stemmer::new(Language::English))
.build(),
);
manager.register("whitespace", WhitespaceTokenizer::default());
manager
}
}Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels