[RFC] Korean Tokenization: Jamo Decomposition as Pre-tokenizer (A Blueprint for Compositional Scripts)

## API Impact

Korean text currently consumes **2-3x more tokens** than semantically equivalent English text due to how syllable blocks are tokenized. This directly impacts API users: higher costs, faster context window exhaustion, and degraded multilingual performance — all without any gain in semantic resolution. A simple preprocessing step (jamo decomposition) could substantially reduce token overhead for Korean and other compositional scripts.

## Summary

Korean (Hangul) characters should be decomposed into their constituent jamo (consonant/vowel components) before tokenization. While this proposal focuses on Korean, this sub-character preprocessing step serves as a highly scalable blueprint for improving token efficiency and pattern generalization across other compositional scripts (e.g., Tibetan, Devanagari) in multilingual models.

**This is a low-hanging fruit.** Current approaches expand vocabulary tables by tens of thousands of entries to brute-force Korean coverage — at enormous compute and memory cost. A 10-line structural preprocessing step in the pre-tokenizer can achieve better coverage, slash token overhead by half, and improve multilingual performance simultaneously.

## Background: How Hangul Works

Hangul is a compositional writing system. Every syllable block is a combination of 2-3 components:

```
한 = ㅎ (initial) + ㅏ (vowel) + ㄴ (final)
글 = ㄱ (initial) + ㅡ (vowel) + ㄹ (final)
가 = ㄱ (initial) + ㅏ (vowel)          (no final)
```

The entire system uses only **68 jamo**: 19 initials + 21 vowels + 28 finals (including none).

These combine into **11,172 possible syllable blocks** in Unicode (U+AC00–U+D7A3). The composition is pure arithmetic:

```python
code = 0xAC00 + (initial * 588) + (vowel * 28) + final
# Decomposition is the reverse — 3 lines of code
```

## The Problem

Current BPE/Unigram tokenizers treat each of the 11,172 syllable blocks as **independent symbols** with no structural relationship:

```
한 (U+D55C) → token #8234
할 (U+D560) → token #9102
함 (U+D568) → token #7891
```

These three share initial ㅎ + vowel ㅏ (differing only in final consonant), but the tokenizer sees zero relationship. It's like treating "cat", "car", "can" as completely unrelated symbols instead of recognizing the "ca-" prefix.

**Result:**
- Korean text consumes **2-3x more tokens** than equivalent English text.
- Pattern generalization across similar-sounding words is lost.
- Vocabulary table wastes space on 11,172 entries vs 68 jamo.
- Korean users pay more for API usage for the same semantic content.

## Proposed Solution

Add an optional Hangul jamo decomposition step in the normalizer/pre-tokenizer pipeline:

```
Input:    "한글 처리"
Current:  [한] [글] [처] [리]              → 4+ tokens (opaque blocks)
Proposed: [ㅎㅏㄴ] [ㄱㅡㄹ] [ㅊㅓ] [ㄹㅣ]  → jamo sequences (composable)
```

The decomposition is trivial — Unicode arithmetic, ~10 lines of code:

```python
def decompose_hangul(char: str) -> str:
    code = ord(char) - 0xAC00
    if not (0 <= code < 11172):
        return char
    initial = code // 588
    vowel = (code % 588) // 28
    final = code % 28
    result = INITIALS[initial] + VOWELS[vowel]
    if final > 0:
        result += FINALS[final]
    return result
```

## Expected Benefits

| Metric | Current | With Jamo Decomposition |
|--------|---------|------------------------|
| Vocabulary entries for Korean | ~11,172 syllable blocks | 68 jamo |
| Token count for Korean text | 2-3x vs English | Potentially close to parity |
| Cross-word pattern sharing | None (opaque blocks) | Structural (shared jamo = shared tokens) |
| Implementation cost | — | Minimal (Unicode arithmetic) |

## Considerations

- Token count per Korean word may increase (1 syllable → 2-3 jamo tokens), but **vocabulary compression + pattern generalization** should offset this.
- Reconstruction is lossless — jamo → syllable block is deterministic.
- This could be opt-in (language-detected or user-specified).
- **Multilingual Scalability:** This sub-character decomposition approach provides a foundational framework that can be directly adapted to benefit other compositional scripts (e.g., Tibetan, Devanagari, Thai) for broader multilingual token optimization.
- **Empirical Validation:** Research like KR-BERT (Lee et al., 2020) has already proven the empirical superiority of Sub-character BPE for Korean. Modern global tokenizers still rely on inefficient Byte-Level BPE (BBPE) that ignores the mathematical compositionality of scripts. Supporting Jamo decomposition in the pre-tokenizer is a cost-free architectural win.

## References & Empirical Evidence

- **KR-BERT (Lee et al., 2020):** Proves that Sub-character BPE drastically reduces vocabulary size while outperforming character-level models. ([arXiv:2008.03979](https://arxiv.org/abs/2008.03979))
- **Jeon et al., 2023:** "Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition" ([arXiv:2311.03928](https://arxiv.org/abs/2311.03928))
- **ACL 2024 Findings:** "Korean Character Representations Based on the Combination Rules of Subcharacters" ([ACL Anthology](https://aclanthology.org/2024.findings-acl.302/))
- **Unicode Hangul Syllable Decomposition:** Official algorithmic specification. ([Unicode Standard, Chapter 3.12](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf))
- **Reference Implementation:** [MorphSubDecomp-Korean](https://github.com/taeheejeon22/MorphSubDecomp-Korean) — Sub-character decomposition pre-tokenizer pipeline


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Korean Tokenization: Jamo Decomposition as Pre-tokenizer (A Blueprint for Compositional Scripts) #1278

API Impact

Summary

Background: How Hangul Works

The Problem

Proposed Solution

Expected Benefits

Considerations

References & Empirical Evidence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Current	With Jamo Decomposition
Vocabulary entries for Korean	~11,172 syllable blocks	68 jamo
Token count for Korean text	2-3x vs English	Potentially close to parity
Cross-word pattern sharing	None (opaque blocks)	Structural (shared jamo = shared tokens)
Implementation cost	—	Minimal (Unicode arithmetic)

[RFC] Korean Tokenization: Jamo Decomposition as Pre-tokenizer (A Blueprint for Compositional Scripts) #1278

Description

API Impact

Summary

Background: How Hangul Works

The Problem

Proposed Solution

Expected Benefits

Considerations

References & Empirical Evidence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions