🧭 Quick Return to Map
You are in a sub-page of Embeddings.
To reorient, go back here:
- Embeddings — vector representations and semantic search
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
A hard interface that keeps your chunker and your embedding encoder in semantic lockstep. Use this page when the chunks look fine but retrieval quality wobbles, or when “high-similarity yet wrong meaning” shows up after an index rebuild.
- Visual map and recovery: RAG Architecture & Recovery
- End to end knobs: Retrieval Playbook
- Why this snippet: Retrieval Traceability
- Snippet schema details: Data Contracts
- Chunking checklist: Semantic Chunking Checklist
- OCR quality gate: OCR Parsing Checklist
- Hallucination repair: Hallucination
- Embedding vs meaning: Embedding ≠ Semantic
- Vector store health: Vectorstore Fragmentation
- Query splits and ordering: Query Parsing Split · Rerankers
- Chunks pass manual inspection while top-k is semantically off.
- Index rebuild changes results even with identical data.
- Non-English corpora degrade after “helpful” normalization.
- OCR sources drift due to hyphenation, headers, or artifacts.
- ΔS(question, retrieved) ≤ 0.45
- Coverage of target section ≥ 0.70
- λ remains convergent across three paraphrases and two seeds
- E_resonance stays flat on long windows
The producer (chunker) must write these fields. The consumer (embedder) must read and honor them. Store the object as JSON alongside the vector.
{
"chunk_id": "str, stable and unique",
"parent_id": "str, stable id of page/section/file",
"source_id": "str, canonical source key",
"section_id": "str, logical section anchor if available",
"text": "str, exactly what will be embedded",
"offsets": { "start": 1234, "end": 1678 },
"page_no": 12,
"lang": "ISO 639-1 or -3 code, e.g. 'en', 'zh', 'de'",
"chunk_method": "fixed|sentence|semantic|hybrid",
"window": { "max_tokens": 512, "stride": 384, "overlap": 128 },
"tokenizer": {
"name": "cl100k_base|llama3|... exact label",
"version": "semver or commit",
"case": "preserve|lower",
"unicode_norm": "none|NFC|NFKC",
"strip_punct": false,
"keep_newlines": true
},
"embedder": {
"model": "exact model id",
"revision": "weights or date tag",
"pooling": "cls|mean|last|custom",
"normalize_l2": true
},
"metadata": {
"source_url": "optional canonical link",
"title": "optional",
"breadcrumbs": ["chapter", "section"]
},
"hashes": {
"text_sha256": "sha256 of text pre-embedding",
"contract_sha256": "sha256 of the whole object minus hashes"
}
}Contract rule
Whatever is in text is exactly what gets embedded. If any pre-processing differs between producer and consumer, you must rewrite text and refresh text_sha256.
- Decide the unit first. Page, section, or sentence window. Do not mix units within the same index.
- Emit
textafter final normalization. Never rely on the embedder to repeat normalization. - Preserve citations and code blocks if users will query by them. Remove navigation boilerplate.
- For OCR, fix soft hyphens, line wraps, and column order before writing
text. - Keep overlap explicit in
window. Future rebuilds must not change it silently. - Record tokenizer identity and casing policy.
- Compute
text_sha256and a contract hash. - Assign stable
chunk_idandparent_id. - Add
lang. Use a detector only once during ingestion, then persist. - Store page and section anchors for traceability and UI jumps.
- Embed exactly
text. No extra cleanup. - Use the
embedder.modelandtokenizerfrom the contract. If you change either, rebuild vectors. - Respect
normalize_l2. Keep pooling the same across the whole index. - Refuse to embed when the contract hash or tokenizer name changed.
- Refuse to embed beyond
window.max_tokens. Truncate by tokenizer, not by characters. - Keep the vector dimensionality constant within a store. New dimension means new collection.
- Persist a copy of the full contract next to the vector row for audits.
- Re-tokenize
text, verifytoken_count ≤ window.max_tokens. - Recompute
text_sha256and compare. If mismatch, halt. - Run ΔS(original_page, reconstructed_snippet) on a small gold set. Expect ≤ 0.45.
- Sample fifteen multilingual chunks. Verify casing and unicode flags match contract.
- Check near-duplicate collapse by
text_sha256and by cosine on the vectors. - Probe λ across three paraphrases and two seeds. No flip states after reranking.
-
Wrong-meaning hits with high similarity. → Embedding ≠ Semantic and confirm contract tokenizer aligns with the model.
-
Rebuild changes results although data did not change. → Verify
tokenizer.version,embedder.revision, andwindoware identical; if not, re-embed and re-index. See Retrieval Playbook. -
Non-English drift after “helpful” lowercasing or punctuation stripping. → Switch
tokenizer.case=preserve,unicode_norm=NFC. Re-embed the affected language slice. See Semantic Chunking Checklist. -
OCR sources hallucinate cross-columns or broken words. → Repair with the OCR gate first, then rebuild. See OCR Parsing Checklist.
-
High recall yet unstable top-k order. → Pin query parsing, then add a reranker. See Query Parsing Split and Rerankers.
-
Index feels “holey” near boundaries. → Increase overlap or switch to a sentence or semantic window, then verify coverage. See RAG Architecture & Recovery.
- Freeze writes.
- Export the current contract set.
- Compute diff of
tokenizer,embedder, andwindow. - Re-embed in a new collection.
- Dual-read and A/B for one week of traffic.
- Cut over when ΔS and coverage targets pass on the live eval set.
- Garbage collect the old collection.
# Pseudocode for CI
for chunk in sample_chunks:
tok = load_tokenizer(chunk["tokenizer"]["name"], chunk["tokenizer"]["version"])
ids = tok.encode(chunk["text"])
assert len(ids) <= chunk["window"]["max_tokens"]
assert sha256(chunk["text"]) == chunk["hashes"]["text_sha256"]
vec = embed(chunk["text"], model=chunk["embedder"]["model"], rev=chunk["embedder"]["revision"])
if chunk["embedder"]["normalize_l2"]:
vec = l2norm(vec)
assert len(vec) == expected_dim # fixed per model- Retrieve on a ten-question gold set.
- Expect coverage ≥ 0.70 and ΔS ≤ 0.45.
- λ does not flip across two seeds.
- Repeat after seven days to ensure stability drift did not reappear.
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.