Skip to content

Korean tokenizer plugin: garu-minisearch-tokenizerΒ #312

Description

@ongjin

Hi @lucaong πŸ‘‹

I published a Korean morphological tokenizer for MiniSearch: garu-minisearch-tokenizer.

The default whitespace tokenizer doesn't really work for Korean β€” particles glue onto nouns, so a query for 학ꡐ misses 학ꡐ에, 학ꡐλ₯Ό, etc. Verb inflections like λ¨Ήμ—ˆλ‹€ and λ¨ΉλŠ”λ‹€ never match each other either. This plugin replaces the tokenizer with proper morphological analysis via garu-ko, a 1.9MB WASM analyzer.

Usage drops right into your existing tokenize option:

import MiniSearch from 'minisearch'
import { createTokenizer } from 'garu-minisearch-tokenizer'

const tokenize = await createTokenizer()
const ms = new MiniSearch({ fields: ['title', 'body'], tokenize })

Would you consider adding a mention in the README or wiki for Korean-language MiniSearch users? Happy to send a PR if that helps.

Links:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions