🧭 Quick Return to Map
You are in a sub-page of Retrieval.
To reorient, go back here:
- Retrieval — information access and knowledge lookup
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Evaluation disclaimer (retrieval recipes)
These recipes show how to probe and score retrieval quality under specific assumptions.
The resulting numbers are scenario bound heuristics and should not be presented as general proof of system quality.
A practical kit to score retrieval quality with small but reliable datasets. Use these recipes to detect metric mismatch, ordering variance, hybrid regressions, and chunk misalignment before they leak into answers.
- ΔS(question, retrieved) ≤ 0.45
- Coverage to the intended section ≥ 0.70
- λ remains convergent across 3 paraphrases and 2 seeds
- Citation precision ≥ 0.85 and recall ≥ 0.75 on the gold set
References:
RAG Architecture & Recovery ·
Retrieval Playbook ·
Retrieval Traceability ·
Data Contracts
Create 40 to 120 items. Each item has:
- question and three paraphrases
- target_section and one decoy_section
- anchor_snippet that represents the minimal evidence
- answers_not_allowed for near misses
- expected_citations as
{snippet_id, offsets}list
Chunking guidance:
Chunking Checklist
Data schema example:
{
"qid": "Q037",
"question": "How do I rotate API keys safely?",
"paraphrases": [
"Best practice for API key rotation?",
"Rotate credentials without downtime, how?",
"Safe credential rotation steps?"
],
"target_section": "security/keys/rotation",
"decoy_section": "security/keys/storage",
"anchor_snippet": "Rotate old->new with overlap window and staged revocation...",
"expected_citations": [
{"snippet_id": "S-114", "offsets": [320, 480]}
],
"answers_not_allowed": [
"store keys in env only", "rotate monthly without overlap"
]
}-
ΔS(question, retrieved) and ΔS(retrieved, anchor) Normalized semantic distance in [0,1]. Thresholds: stable < 0.40, transitional 0.40–0.60, risk ≥ 0.60. See: Retrieval Playbook
-
Coverage Tokens from cited spans that overlap the ground anchor divided by tokens in the anchor.
-
Citation precision and recall Precision = correct cited spans over all cited spans. Recall = correct cited spans over all ground spans.
-
λ_convergence Observe λ states across paraphrases and seeds. Divergence flags prompt variance or ordering drift. See: Context Drift
Goal: verify metric and index health before any hybrid tricks.
Steps
- Fix one embedding family and one metric.
- Run k in {5, 10, 20}.
- Log ΔS, coverage, precision, recall, λ for each run.
- If ΔS stays high and flat while coverage is low, suspect metric or index mismatch.
Open next: Embedding ≠ Semantic
Goal: separate recall from ordering stability.
Steps
- Freeze retriever and analyzer.
- Add a deterministic reranker and compare top-k order.
- Measure flip rate of citations and λ under two seeds.
Open next: Rerankers
Goal: prove hybrid helps or remove it.
Steps
- Evaluate sparse only, dense only, and hybrid.
- Compare ΔS and coverage per item.
- If hybrid is worse, split query parsing and rebalance weights.
Open next: pattern_query_parsing_split.md
Goal: ensure anchors match boundaries.
Steps
- For each gold item, compute ΔS to the anchor and to the decoy.
- If both are close, re-chunk with anchor alignment and rebuild.
Open next: Chunking Checklist · chunk_alignment.md
Goal: detect namespace skew and partial ingestion.
Steps
- Run the same question across two namespaces or stores that should be equivalent.
- Compare recall of the anchor snippet.
- If recall is high only in one place, fix ingestion and dedupe.
Open next: pattern_vectorstore_fragmentation.md
# Pseudocode only
def eval_item(store, reranker, item, k, seed):
q = item["question"]
ctx = store.retrieve(q, k=k, seed=seed)
ordered = reranker.rank(q, ctx) if reranker else ctx
cites = extract_citations(ordered)
d_qr = deltaS(q, join_text(ordered))
d_ra = deltaS(join_text(ordered), item["anchor_snippet"])
cov, prec, rec = score_citations(cites, item["expected_citations"], item["anchor_snippet"])
lam = observe_lambda(q, ordered, seed=seed)
return {
"qid": item["qid"], "k": k, "seed": seed,
"ΔS_qr": d_qr, "ΔS_ra": d_ra, "coverage": cov,
"precision": prec, "recall": rec, "λ_state": lam
}
def run_suite(items, stores, rerankers, ks, seeds):
results = []
for it in items:
for s in stores:
for r in rerankers:
for k in ks:
for seed in seeds:
results.append(eval_item(s, r, it, k, seed))
return resultsLog schema
{
"qid": "Q037",
"system": "dense_only",
"reranker": "none",
"k": 10,
"seed": 23,
"ΔS_qr": 0.38,
"ΔS_ra": 0.22,
"coverage": 0.78,
"precision": 0.92,
"recall": 0.81,
"λ_state": "convergent",
"retrieval_order": ["S-114","S-012","S-077"],
"analyzer": "lowercase",
"metric": "cosine",
"prompt_hash": "P-9c1f",
"index_hash": "I-fc21"
}Traceability contracts for fields: Retrieval Traceability · Data Contracts
- ΔS ≤ 0.45 and coverage ≥ 0.70 on three paraphrases per item
- Citation precision ≥ 0.85 and recall ≥ 0.75
- λ convergent on two seeds
- No unresolved items with high ΔS and low coverage
Evaluation math and templates: eval_rag_precision_recall.md
-
High similarity yet wrong meaning → embedding-vs-semantic.md
-
Snippet selected does not match citation → retrieval-traceability.md and data-contracts.md
-
Hybrid worse than single retriever → pattern_query_parsing_split.md and rerankers.md
-
Coverage good offline but collapses online → pattern_vectorstore_fragmentation.md
-
Eval flakiness after deploy → bootstrap-ordering.md and predeploy-collapse.md
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.