-
Notifications
You must be signed in to change notification settings - Fork 539
Description
API Impact
Korean text currently consumes 2-3x more tokens than semantically equivalent English text due to how syllable blocks are tokenized. This directly impacts API users: higher costs, faster context window exhaustion, and degraded multilingual performance — all without any gain in semantic resolution. A simple preprocessing step (jamo decomposition) could substantially reduce token overhead for Korean and other compositional scripts.
Summary
Korean (Hangul) characters should be decomposed into their constituent jamo (consonant/vowel components) before tokenization. While this proposal focuses on Korean, this sub-character preprocessing step serves as a highly scalable blueprint for improving token efficiency and pattern generalization across other compositional scripts (e.g., Tibetan, Devanagari) in multilingual models.
This is a low-hanging fruit. Current approaches expand vocabulary tables by tens of thousands of entries to brute-force Korean coverage — at enormous compute and memory cost. A 10-line structural preprocessing step in the pre-tokenizer can achieve better coverage, slash token overhead by half, and improve multilingual performance simultaneously.
Background: How Hangul Works
Hangul is a compositional writing system. Every syllable block is a combination of 2-3 components:
한 = ㅎ (initial) + ㅏ (vowel) + ㄴ (final)
글 = ㄱ (initial) + ㅡ (vowel) + ㄹ (final)
가 = ㄱ (initial) + ㅏ (vowel) (no final)
The entire system uses only 68 jamo: 19 initials + 21 vowels + 28 finals (including none).
These combine into 11,172 possible syllable blocks in Unicode (U+AC00–U+D7A3). The composition is pure arithmetic:
code = 0xAC00 + (initial * 588) + (vowel * 28) + final
# Decomposition is the reverse — 3 lines of codeThe Problem
Current BPE/Unigram tokenizers treat each of the 11,172 syllable blocks as independent symbols with no structural relationship:
한 (U+D55C) → token #8234
할 (U+D560) → token #9102
함 (U+D568) → token #7891
These three share initial ㅎ + vowel ㅏ (differing only in final consonant), but the tokenizer sees zero relationship. It's like treating "cat", "car", "can" as completely unrelated symbols instead of recognizing the "ca-" prefix.
Result:
- Korean text consumes 2-3x more tokens than equivalent English text.
- Pattern generalization across similar-sounding words is lost.
- Vocabulary table wastes space on 11,172 entries vs 68 jamo.
- Korean users pay more for API usage for the same semantic content.
Proposed Solution
Add an optional Hangul jamo decomposition step in the normalizer/pre-tokenizer pipeline:
Input: "한글 처리"
Current: [한] [글] [처] [리] → 4+ tokens (opaque blocks)
Proposed: [ㅎㅏㄴ] [ㄱㅡㄹ] [ㅊㅓ] [ㄹㅣ] → jamo sequences (composable)
The decomposition is trivial — Unicode arithmetic, ~10 lines of code:
def decompose_hangul(char: str) -> str:
code = ord(char) - 0xAC00
if not (0 <= code < 11172):
return char
initial = code // 588
vowel = (code % 588) // 28
final = code % 28
result = INITIALS[initial] + VOWELS[vowel]
if final > 0:
result += FINALS[final]
return resultExpected Benefits
| Metric | Current | With Jamo Decomposition |
|---|---|---|
| Vocabulary entries for Korean | ~11,172 syllable blocks | 68 jamo |
| Token count for Korean text | 2-3x vs English | Potentially close to parity |
| Cross-word pattern sharing | None (opaque blocks) | Structural (shared jamo = shared tokens) |
| Implementation cost | — | Minimal (Unicode arithmetic) |
Considerations
- Token count per Korean word may increase (1 syllable → 2-3 jamo tokens), but vocabulary compression + pattern generalization should offset this.
- Reconstruction is lossless — jamo → syllable block is deterministic.
- This could be opt-in (language-detected or user-specified).
- Multilingual Scalability: This sub-character decomposition approach provides a foundational framework that can be directly adapted to benefit other compositional scripts (e.g., Tibetan, Devanagari, Thai) for broader multilingual token optimization.
- Empirical Validation: Research like KR-BERT (Lee et al., 2020) has already proven the empirical superiority of Sub-character BPE for Korean. Modern global tokenizers still rely on inefficient Byte-Level BPE (BBPE) that ignores the mathematical compositionality of scripts. Supporting Jamo decomposition in the pre-tokenizer is a cost-free architectural win.
References & Empirical Evidence
- KR-BERT (Lee et al., 2020): Proves that Sub-character BPE drastically reduces vocabulary size while outperforming character-level models. (arXiv:2008.03979)
- Jeon et al., 2023: "Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition" (arXiv:2311.03928)
- ACL 2024 Findings: "Korean Character Representations Based on the Combination Rules of Subcharacters" (ACL Anthology)
- Unicode Hangul Syllable Decomposition: Official algorithmic specification. (Unicode Standard, Chapter 3.12)
- Reference Implementation: MorphSubDecomp-Korean — Sub-character decomposition pre-tokenizer pipeline