Skip to content

add TokenFilter support #87

@rustmailer

Description

@rustmailer

Hey, could you add TokenFilter support to LinderaTokenizer so it can wrap something like SimpleTokenizer? That way, we could use it like below, making it better for mixed English-Chinese/Korean/Japanese text without changing LinderaTokenStream.

impl Default for TokenizerManager {
    /// Creates an `TokenizerManager` prepopulated with
    /// the default pre-configured tokenizers of `tantivy`.
    fn default() -> TokenizerManager {
        let manager = TokenizerManager::new();
        manager.register("raw", RawTokenizer::default());
        manager.register(
            "default",
            TextAnalyzer::builder(SimpleTokenizer::default())
                .filter(RemoveLongFilter::limit(40))
                .filter(LowerCaser)
                .build(),
        );
        manager.register(
            "en_stem",
            TextAnalyzer::builder(SimpleTokenizer::default())
                .filter(RemoveLongFilter::limit(40))
                .filter(LowerCaser)
                .filter(Stemmer::new(Language::English))
                .build(),
        );
        manager.register("whitespace", WhitespaceTokenizer::default());
        manager
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions