Hi @lucaong π
I published a Korean morphological tokenizer for MiniSearch: garu-minisearch-tokenizer.
The default whitespace tokenizer doesn't really work for Korean β particles glue onto nouns, so a query for νκ΅ misses νκ΅μ, νκ΅λ₯Ό, etc. Verb inflections like λ¨Ήμλ€ and λ¨Ήλλ€ never match each other either. This plugin replaces the tokenizer with proper morphological analysis via garu-ko, a 1.9MB WASM analyzer.
Usage drops right into your existing tokenize option:
import MiniSearch from 'minisearch'
import { createTokenizer } from 'garu-minisearch-tokenizer'
const tokenize = await createTokenizer()
const ms = new MiniSearch({ fields: ['title', 'body'], tokenize })
Would you consider adding a mention in the README or wiki for Korean-language MiniSearch users? Happy to send a PR if that helps.
Links:
Hi @lucaong π
I published a Korean morphological tokenizer for MiniSearch: garu-minisearch-tokenizer.
The default whitespace tokenizer doesn't really work for Korean β particles glue onto nouns, so a query for
νκ΅missesνκ΅μ,νκ΅λ₯Ό, etc. Verb inflections likeλ¨Ήμλ€andλ¨Ήλλ€never match each other either. This plugin replaces the tokenizer with proper morphological analysis via garu-ko, a 1.9MB WASM analyzer.Usage drops right into your existing
tokenizeoption:Would you consider adding a mention in the README or wiki for Korean-language MiniSearch users? Happy to send a PR if that helps.
Links: