Korean tokenizer plugin: garu-minisearch-tokenizer

Hi @lucaong 👋

I published a Korean morphological tokenizer for MiniSearch: [garu-minisearch-tokenizer](https://www.npmjs.com/package/garu-minisearch-tokenizer).

The default whitespace tokenizer doesn't really work for Korean — particles glue onto nouns, so a query for `학교` misses `학교에`, `학교를`, etc. Verb inflections like `먹었다` and `먹는다` never match each other either. This plugin replaces the tokenizer with proper morphological analysis via [garu-ko](https://www.npmjs.com/package/garu-ko), a 1.9MB WASM analyzer.

Usage drops right into your existing `tokenize` option:

```ts
import MiniSearch from 'minisearch'
import { createTokenizer } from 'garu-minisearch-tokenizer'

const tokenize = await createTokenizer()
const ms = new MiniSearch({ fields: ['title', 'body'], tokenize })
```

Would you consider adding a mention in the README or wiki for Korean-language MiniSearch users? Happy to send a PR if that helps.

Links:
- npm: https://www.npmjs.com/package/garu-minisearch-tokenizer
- GitHub: https://github.com/ongjin/garu/tree/main/integrations/minisearch-tokenizer
- Garu (the analyzer): https://github.com/ongjin/garu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Korean tokenizer plugin: garu-minisearch-tokenizer #312

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Korean tokenizer plugin: garu-minisearch-tokenizer #312

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions