🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
A focused guide for bugs that originate from IME composition on Windows, macOS, Linux, iOS, and Android. Scope includes CJK IMEs (Pinyin, Wubi, Kana/Kanji, 2-set/3-set), Indic transliteration, RTL keyboards, and mixed fullwidth/halfwidth states. Use this when text looks fine to the eye but retrieval or validation behaves inconsistently across devices.
- Reports say “works on Mac, fails on Windows IME” or “mobile input breaks search.”
- Fields contain invisible marks after copy or composition (ZWJ, ZWNJ, NBSP, RLM/LRM).
- Users toggle fullwidth digits or punctuation and recall suddenly collapses.
- Romanization IMEs produce composed characters that differ from pasted text.
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Multilingual overview: Multilingual Guide
- Tokenizer mismatch: Tokenizer Mismatch
- CJK word breaks: CJK Segmentation & Wordbreak
- RTL markers and controls: RTL & Bidi Controls
- Script mixing: Script Mixing
- Diacritics: Diacritics & Folding
- Snippet schema: Data Contracts · Retrieval Traceability
- ΔS(question, retrieved) ≤ 0.45
- Coverage of target section ≥ 0.70
- λ_observe remains convergent across 3 paraphrases, 2 seeds, and 2 devices
- E_resonance stays flat on long input windows
- “Cannot reproduce” until tester types through an IME rather than pasting.
- Same glyphs, different bytes. Equality checks fail, search misses.
- Index recall drops after mobile users enable fullwidth digits.
- Mixed NBSP and normal space in otherwise identical queries.
- Sporadic RTL flip caused by stray RLM/LRM from bidirectional typing.
-
Normalize early On every input boundary apply NFC, width fold, and punctuation fold. Remove ZWJ, ZWNJ, LRM, RLM unless explicitly allowed by schema.
-
Stabilize tokenization Lock analyzers and tokenizers used for both indexing and querying. If ΔS remains high and flat after IME normalization, revisit metric and analyzer pairing in the store. See Retrieval Playbook.
-
Contract the payload For forms and tool calls, require fields that capture canonical and raw strings:
raw,normalized,locale,ime_mode,width_state. Enforce this in your Data Contracts. -
Probe λ Run the same query by paste, by IME typing, and by mobile. If λ flips only for IME-typed paths, you have an input normalization gap.
Use this contract for any user text that enters retrieval or matching.
{
"text": {
"raw": "<exact keystroke result>",
"normalized": "<NFC + width_fold + punct_fold + bidi_strip>",
"locale": "zh-TW | zh-CN | ja-JP | ko-KR | hi-IN | ...",
"ime_mode": "pinyin | wubi | kana | romaji | 2set | 3set | translit | rtl",
"width_state": "half | full | mixed",
"bidi_marks": ["RLM","LRM","ZWJ","ZWNJ","NBSP"]
}
}Store both raw and normalized. Index normalized. Retain raw for audits and display.
| Issue | Action | Notes |
|---|---|---|
| Composition variance (NFD vs NFC) | Convert to NFC | Prevents byte inequality for identical glyphs |
| Fullwidth digits and Latin | Width fold to ASCII | Keep CJK letters untouched |
| Smart quotes, ellipsis, dashes | Punctuation fold to ASCII set | Avoid tokenizer splits that differ by device |
| Zero-width characters (ZWJ, ZWNJ) | Strip by default | Allow only if explicitly required by language rules |
| Bidi controls (LRM, RLM) | Strip at input for LTR schemas | Keep only in rich text fields, never in keys |
| NBSP, thin space | Map to normal space | Collapse runs of spaces to a single space |
| Kana halfwidth/fullwidth | Fold within script | Keep semantic marks like voiced sound when needed |
| Romanization IMEs | Canonicalize case and spacing | For JP/KR/Indic transliteration paths |
- Triplet equality: paste vs IME vs mobile should produce identical
normalized. - Search parity: same top-k ordering after normalization across devices.
- Width flip test: force fullwidth digits and punctuation, verify recall remains constant.
- Bidi contamination: inject RLM/LRM in the middle, verify strip or deterministic handling.
- ΔS plateaus: if ΔS remains ≥ 0.60 after normalization, suspect metric mismatch or fragmentation and jump to Embedding ≠ Semantic and Vectorstore Fragmentation.
- Tokenizer and analyzer coupling: Tokenizer Mismatch
- Script collisions and mixed runs: Script Mixing
- CJK segmentation: CJK Segmentation & Wordbreak
- RTL handling: RTL & Bidi Controls
- Traceable answers: Retrieval Traceability
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.