🧭 Quick Return to Map
You are in a sub-page of Embeddings.
To reorient, go back here:
- Embeddings — vector representations and semantic search
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
A focused guide to remove silent drift caused by mismatched tokenization, casing, and text cleanup. Use this to align query-time and index-time preprocessing, then verify with measurable targets.
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Why this snippet (traceability schema): Retrieval Traceability
- Embedding vs meaning: Embedding ≠ Semantic
- Payload schema for audits: Data Contracts
- Query split and hybrid failures: Pattern: Query Parsing Split
- Reranking and ordering control: Rerankers
- Related: normalization and length scaling: Normalization & Scaling
- High similarity yet wrong meaning after a casing change.
- Same document retrieved for lowercase but not for Title Case.
- Index built with a different tokenizer than the client.
- Unicode variants or punctuation collapse changes results.
- Hybrid retrieval recalls but top-k order flips between runs.
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
- Coverage of the correct section ≥ 0.70.
- λ remains convergent across two seeds.
- Tokenization invariance check: replacing case or applying Unicode NFC does not move the top anchor beyond rank 3 and keeps ΔS shift ≤ 0.10.
-
Wrong-meaning hits only when case changes
→ Lock a single case policy and tokenizer in both paths. Verify with Retrieval Traceability and enforce with Data Contracts. -
HyDE or BM25 path finds it, embedding path misses it
→ Audit query preprocessing vs index preprocessing. If different, unify and retest. See Retrieval Playbook and Pattern: Query Parsing Split. -
Unicode look-alikes or punctuation variants break recall
→ Normalize to NFC and collapse zero-width characters at both index and query. Re-embed affected shards. Cross check with Embedding ≠ Semantic. -
Order flips across runs after a client upgrade
→ Version the tokenizer and preprocessing in the snippet payload. If versions mismatch, rebuild or add a translation shim. Enforce with Data Contracts.
-
Freeze the tokenizer and case policy
- Choose one tokenizer build and one case policy:
lower,preserve, orsmart. - Record both in the contract:
{"tokenizer":"spm-1.2.3","tokenizer_hash":"...","case_policy":"lower"}.
- Choose one tokenizer build and one case policy:
-
Normalize before tokenization
- Apply Unicode NFC, collapse repeated whitespace, remove zero-width, standardize quotes and dashes.
- Apply the same rules to both index and query.
-
Keep segmentation symmetric
- If you split code identifiers (camelCase, snake_case), do it in both paths.
- If you strip stopwords or punctuation, do it in both paths. Prefer not to strip unless you must.
-
Log and test invariance
- Run a three-paraphrase probe per query: original, lowercased, and NFC-normalized.
- Accept only if the anchor section remains within top 3 and ΔS shift ≤ 0.10.
-
Add a rerank safety net
- If invariance is hard to reach, add a lexical or cross-encoder rerank after vector recall. See Rerankers.
Add these to every snippet and to the query audit log. Use them during triage.
{
"preproc_version": "v3",
"unicode_norm": "NFC",
"case_policy": "lower",
"punctuation_policy": "keep",
"identifier_split": "camel+snake",
"tokenizer": "spm-1.2.3",
"tokenizer_hash": "sha256:...",
"ngram": "none"
}-
Case flip test Query A vs lowercase(A). Accept if ΔS shift ≤ 0.10 and anchor rank ≤ 3.
-
Unicode fold test Replace curly quotes with straight quotes, normalize to NFC. Accept if ranks stable.
-
Tokenizer skew test Run the same query through client and index tokenizers and compare token ids. Any mismatch is a fail that requires rebuild or a client shim.
I uploaded TXT OS and the WFGY Problem Map.
My embedding issue:
- symptom: wrong top-k only when casing or unicode changes
- traces: ΔS(question,retrieved)=..., anchor=..., tokenizer=..., case_policy=...
Tell me:
1) which layer is failing and why,
2) which WFGY page to open,
3) the minimal steps to align tokenization and casing,
4) how to verify with a three-paraphrase probe.
Use Data Contracts and Rerankers if needed.
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.