Skip to content

[RFC] Korean Tokenization: Jamo Decomposition as Pre-tokenizer (A Blueprint for Compositional Scripts) #1278

@nicezic

Description

@nicezic

API Impact

Korean text currently consumes 2-3x more tokens than semantically equivalent English text due to how syllable blocks are tokenized. This directly impacts API users: higher costs, faster context window exhaustion, and degraded multilingual performance — all without any gain in semantic resolution. A simple preprocessing step (jamo decomposition) could substantially reduce token overhead for Korean and other compositional scripts.

Summary

Korean (Hangul) characters should be decomposed into their constituent jamo (consonant/vowel components) before tokenization. While this proposal focuses on Korean, this sub-character preprocessing step serves as a highly scalable blueprint for improving token efficiency and pattern generalization across other compositional scripts (e.g., Tibetan, Devanagari) in multilingual models.

This is a low-hanging fruit. Current approaches expand vocabulary tables by tens of thousands of entries to brute-force Korean coverage — at enormous compute and memory cost. A 10-line structural preprocessing step in the pre-tokenizer can achieve better coverage, slash token overhead by half, and improve multilingual performance simultaneously.

Background: How Hangul Works

Hangul is a compositional writing system. Every syllable block is a combination of 2-3 components:

한 = ㅎ (initial) + ㅏ (vowel) + ㄴ (final)
글 = ㄱ (initial) + ㅡ (vowel) + ㄹ (final)
가 = ㄱ (initial) + ㅏ (vowel)          (no final)

The entire system uses only 68 jamo: 19 initials + 21 vowels + 28 finals (including none).

These combine into 11,172 possible syllable blocks in Unicode (U+AC00–U+D7A3). The composition is pure arithmetic:

code = 0xAC00 + (initial * 588) + (vowel * 28) + final
# Decomposition is the reverse — 3 lines of code

The Problem

Current BPE/Unigram tokenizers treat each of the 11,172 syllable blocks as independent symbols with no structural relationship:

한 (U+D55C) → token #8234
할 (U+D560) → token #9102
함 (U+D568) → token #7891

These three share initial ㅎ + vowel ㅏ (differing only in final consonant), but the tokenizer sees zero relationship. It's like treating "cat", "car", "can" as completely unrelated symbols instead of recognizing the "ca-" prefix.

Result:

  • Korean text consumes 2-3x more tokens than equivalent English text.
  • Pattern generalization across similar-sounding words is lost.
  • Vocabulary table wastes space on 11,172 entries vs 68 jamo.
  • Korean users pay more for API usage for the same semantic content.

Proposed Solution

Add an optional Hangul jamo decomposition step in the normalizer/pre-tokenizer pipeline:

Input:    "한글 처리"
Current:  [한] [글] [처] [리]              → 4+ tokens (opaque blocks)
Proposed: [ㅎㅏㄴ] [ㄱㅡㄹ] [ㅊㅓ] [ㄹㅣ]  → jamo sequences (composable)

The decomposition is trivial — Unicode arithmetic, ~10 lines of code:

def decompose_hangul(char: str) -> str:
    code = ord(char) - 0xAC00
    if not (0 <= code < 11172):
        return char
    initial = code // 588
    vowel = (code % 588) // 28
    final = code % 28
    result = INITIALS[initial] + VOWELS[vowel]
    if final > 0:
        result += FINALS[final]
    return result

Expected Benefits

Metric Current With Jamo Decomposition
Vocabulary entries for Korean ~11,172 syllable blocks 68 jamo
Token count for Korean text 2-3x vs English Potentially close to parity
Cross-word pattern sharing None (opaque blocks) Structural (shared jamo = shared tokens)
Implementation cost Minimal (Unicode arithmetic)

Considerations

  • Token count per Korean word may increase (1 syllable → 2-3 jamo tokens), but vocabulary compression + pattern generalization should offset this.
  • Reconstruction is lossless — jamo → syllable block is deterministic.
  • This could be opt-in (language-detected or user-specified).
  • Multilingual Scalability: This sub-character decomposition approach provides a foundational framework that can be directly adapted to benefit other compositional scripts (e.g., Tibetan, Devanagari, Thai) for broader multilingual token optimization.
  • Empirical Validation: Research like KR-BERT (Lee et al., 2020) has already proven the empirical superiority of Sub-character BPE for Korean. Modern global tokenizers still rely on inefficient Byte-Level BPE (BBPE) that ignores the mathematical compositionality of scripts. Supporting Jamo decomposition in the pre-tokenizer is a cost-free architectural win.

References & Empirical Evidence

  • KR-BERT (Lee et al., 2020): Proves that Sub-character BPE drastically reduces vocabulary size while outperforming character-level models. (arXiv:2008.03979)
  • Jeon et al., 2023: "Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition" (arXiv:2311.03928)
  • ACL 2024 Findings: "Korean Character Representations Based on the Combination Rules of Subcharacters" (ACL Anthology)
  • Unicode Hangul Syllable Decomposition: Official algorithmic specification. (Unicode Standard, Chapter 3.12)
  • Reference Implementation: MorphSubDecomp-Korean — Sub-character decomposition pre-tokenizer pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions