Skip to content

Latest commit

 

History

History
163 lines (119 loc) · 10.3 KB

File metadata and controls

163 lines (119 loc) · 10.3 KB

Keyboard Input Methods — Guardrails and Fix Pattern

🧭 Quick Return to Map

You are in a sub-page of LanguageLocale.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused guide for bugs that originate from IME composition on Windows, macOS, Linux, iOS, and Android. Scope includes CJK IMEs (Pinyin, Wubi, Kana/Kanji, 2-set/3-set), Indic transliteration, RTL keyboards, and mixed fullwidth/halfwidth states. Use this when text looks fine to the eye but retrieval or validation behaves inconsistently across devices.

When to use this page

  • Reports say “works on Mac, fails on Windows IME” or “mobile input breaks search.”
  • Fields contain invisible marks after copy or composition (ZWJ, ZWNJ, NBSP, RLM/LRM).
  • Users toggle fullwidth digits or punctuation and recall suddenly collapses.
  • Romanization IMEs produce composed characters that differ from pasted text.

Open these first

Core acceptance

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage of target section ≥ 0.70
  • λ_observe remains convergent across 3 paraphrases, 2 seeds, and 2 devices
  • E_resonance stays flat on long input windows

Failure smells

  • “Cannot reproduce” until tester types through an IME rather than pasting.
  • Same glyphs, different bytes. Equality checks fail, search misses.
  • Index recall drops after mobile users enable fullwidth digits.
  • Mixed NBSP and normal space in otherwise identical queries.
  • Sporadic RTL flip caused by stray RLM/LRM from bidirectional typing.

Fix in 60 seconds

  1. Normalize early On every input boundary apply NFC, width fold, and punctuation fold. Remove ZWJ, ZWNJ, LRM, RLM unless explicitly allowed by schema.

  2. Stabilize tokenization Lock analyzers and tokenizers used for both indexing and querying. If ΔS remains high and flat after IME normalization, revisit metric and analyzer pairing in the store. See Retrieval Playbook.

  3. Contract the payload For forms and tool calls, require fields that capture canonical and raw strings: raw, normalized, locale, ime_mode, width_state. Enforce this in your Data Contracts.

  4. Probe λ Run the same query by paste, by IME typing, and by mobile. If λ flips only for IME-typed paths, you have an input normalization gap.


IME-safe schema (copy block)

Use this contract for any user text that enters retrieval or matching.

{
  "text": {
    "raw": "<exact keystroke result>",
    "normalized": "<NFC + width_fold + punct_fold + bidi_strip>",
    "locale": "zh-TW | zh-CN | ja-JP | ko-KR | hi-IN | ...",
    "ime_mode": "pinyin | wubi | kana | romaji | 2set | 3set | translit | rtl",
    "width_state": "half | full | mixed",
    "bidi_marks": ["RLM","LRM","ZWJ","ZWNJ","NBSP"]
  }
}

Store both raw and normalized. Index normalized. Retain raw for audits and display.


Normalization and folding rules

Issue Action Notes
Composition variance (NFD vs NFC) Convert to NFC Prevents byte inequality for identical glyphs
Fullwidth digits and Latin Width fold to ASCII Keep CJK letters untouched
Smart quotes, ellipsis, dashes Punctuation fold to ASCII set Avoid tokenizer splits that differ by device
Zero-width characters (ZWJ, ZWNJ) Strip by default Allow only if explicitly required by language rules
Bidi controls (LRM, RLM) Strip at input for LTR schemas Keep only in rich text fields, never in keys
NBSP, thin space Map to normal space Collapse runs of spaces to a single space
Kana halfwidth/fullwidth Fold within script Keep semantic marks like voiced sound when needed
Romanization IMEs Canonicalize case and spacing For JP/KR/Indic transliteration paths

Tests you should run

  • Triplet equality: paste vs IME vs mobile should produce identical normalized.
  • Search parity: same top-k ordering after normalization across devices.
  • Width flip test: force fullwidth digits and punctuation, verify recall remains constant.
  • Bidi contamination: inject RLM/LRM in the middle, verify strip or deterministic handling.
  • ΔS plateaus: if ΔS remains ≥ 0.60 after normalization, suspect metric mismatch or fragmentation and jump to Embedding ≠ Semantic and Vectorstore Fragmentation.

Escalate with these pages


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer Page What it’s for
⭐ Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars