🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize retrieval and ranking for Chinese / Japanese / Korean text where whitespace is not a reliable token boundary.
Use this page to localize segmentation failures, choose the right analyzer, and verify the fix with measurable targets.
- High similarity by characters but wrong meaning or empty recall on whole queries.
- BM25 looks random; tiny single-character tokens dominate the index.
- Citations cut through the middle of words; snippet offsets don’t match what users see.
- Mixed CJK + Latin queries split unpredictably across runs or providers.
- Visual map and recovery → rag-architecture-and-recovery.md
- End-to-end retrieval knobs → retrieval-playbook.md
- Why this snippet (traceability) → retrieval-traceability.md
- Payload schema & cite-then-explain → data-contracts.md
- Chunking checklist for semantic boundaries → chunking-checklist.md
- Embedding ≠ meaning (sanity) → embedding-vs-semantic.md
- Related locale pages: tokenizer_mismatch.md · script_mixing.md · digits_width_punctuation.md · diacritics_and_folding.md · locale_drift.md
- ΔS(question, retrieved) ≤ 0.45 on 3 paraphrases
- Coverage of target section ≥ 0.70
- λ remains convergent across 2 seeds
- Tokenization sanity: OOV rate falls by ≥ 40% vs whitespace; tokens/char ≤ 0.7 on CJK pages
- E_resonance flat on long windows
| Symptom | Likely cause | Open this |
|---|---|---|
| Query returns almost nothing; recall jumps when you add spaces | Index built with whitespace/Latin analyzer on CJK | chunking-checklist.md, retrieval-playbook.md |
| Top-k filled with 1-char shards, citations cut mid-word | No CJK word-break at index or search time | retrieval-traceability.md, data-contracts.md |
| BM25 unstable; hybrid worse than single retriever | Search-time analyzer ≠ index-time analyzer | retrieval-playbook.md |
| Romanized terms and CJK compound in one query break apart | Mixed script + width + punctuation rules differ | script_mixing.md, digits_width_punctuation.md |
| High similarity, wrong meaning | Character-level overlap, no semantic units | embedding-vs-semantic.md |
-
Pick the right analyzer and lock it
- Chinese: use a dictionary or statistical segmenter at index + search.
- Japanese: use a MeCab/Kuromoji-class tokenizer with POS; keep base form.
- Korean: use a Nori-class analyzer; index decomp+comp forms consistently.
-
Normalize before segmenting
- Apply NFKC for width and compatibility forms (see page links above).
- Keep punctuation folding consistent across index/search.
-
Unify index-time and query-time configs
- Same language, same tokenizer, same stop/fold rules. No “smart defaults”.
-
Chunk on semantic units, not line breaks
- Respect sentence/phrase boundaries after segmentation.
- Store
offsets,tokens,section_idin the snippet schema.
-
Probe
- Log tokens/char, unique-term ratio, OOV rate, and ΔS before/after.
- If ΔS stays ≥ 0.60 with good segmentation, revisit metric/index mismatch.
-
Elasticsearch / OpenSearch
- CN: install and set a CJK analyzer; index + search use the same analyzer.
- JP: kuromoji with baseform filter; disable random synonyms unless audited.
- KR: nori; keep decompound mode consistent at index+query.
- Verify with
_analyzesamples; reindex after any analyzer change.
-
pgvector / Postgres
- Segmentation happens before embedding. Pre-segment text in ETL.
- Keep the same pipeline for ingestion and live queries.
-
Weaviate / Qdrant / Chroma / Milvus / FAISS
- The vector store won’t fix segmentation. Preprocess: NFKC → CJK segmenter → chunk.
- Log the preprocessing hash in metadata; fail closed on mismatch.
-
Vespa / Typesense / Elastic-compatible
- Use the platform’s CJK tokenizer if available; otherwise pre-segment and index the segmented text as the field value.
-
Three-way segmentation A/B/C
Try 3 segmenters; compute ΔS and tokens/char on a small gold set. Pick the lowest ΔS with stable λ. -
Anchor triangulation
Compare ΔS to the correct anchor vs a decoy section. If both are close, you’re still at char-overlap, not word-level meaning. -
Rerank sanity
After proper segmentation, reranking should lift precision. If not, check analyzer mismatch between index and query path.
You have TXT OS and WFGY Problem Map loaded.
My CJK issue:
* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., tokens/char=..., OOV\_before=..., OOV\_after=...
Tell me:
1. which layer failed (segmentation, normalization, index/search mismatch),
2. which exact WFGY page to open,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test (3 paraphrases × 2 seeds) to verify the fix.
Use BBMC/BBCR/BBPF/BBAM when relevant.
rtl_bidi_directionality.md (Arabic/Hebrew mixing, mirroring, numerals)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.