🧭 Quick Return to Map
You are in a sub-page of Eval.
To reorient, go back here:
- Eval — model evaluation and benchmarking
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Evaluation disclaimer (eval harness)
This page sketches a harness for running structured evaluations on AI pipelines.
Any metrics or labels that pass through such a harness remain heuristic outputs of models, scripts and annotators.
They do not become scientific proof just because they flow through this structure.
Use the harness to compare variants inside a controlled scenario, and avoid presenting those numbers as universal claims about model quality beyond that scenario.
A minimal yet strict harness to run repeatable evaluations for RAG and agent pipelines. It fixes the two usual failures. First, non-reproducible runs. Second, noisy metrics that cannot explain drift. Everything here maps to WFGY pages with measurable targets.
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Why this snippet schema: Retrieval Traceability
- Payload schema and fences: Data Contracts
- Chunk quality before metrics: Chunking Checklist
- Similarity vs meaning: Embedding ≠ Semantic
- ΔS(question, retrieved) ≤ 0.45 on the gold set
- Coverage of the target section ≥ 0.70
- λ remains convergent across 3 paraphrases and 2 seeds
- Re-runs with identical seed produce metrics drift ≤ 0.5 percentage point
eval/
datasets/
gold/
qa.jsonl # minimal gold set
citations.jsonl # expected snippet anchors
probes/
paraphrases.jsonl # 3 paraphrases per item
runs/
2025-08-29_seed42/
config.yaml
metrics.csv
traces.jsonl
config/
harness.yaml # store, retriever, reranker, seeds, k
datasets/gold/qa.jsonl one JSON per line.
{
"id": "Q_0001",
"question": "How is vector contamination detected in FAISS indexes",
"answer_ref": "PM:vectorstore-metrics-and-faiss-pitfalls#detect-contamination",
"expected_doc": "ProblemMap/vectorstore-metrics-and-faiss-pitfalls.md",
"section_id": "detect-contamination"
}datasets/gold/citations.jsonl
{
"id": "Q_0001",
"snippet_id": "S_18823",
"section_id": "detect-contamination",
"source_url": "https://github.com/onestardao/WFGY/blob/main/ProblemMap/vectorstore-metrics-and-faiss-pitfalls.md",
"offsets": [1380, 1540],
"tokens": [310, 352]
}Contract rules come from Retrieval Traceability and Data Contracts.
seed: integer. Set for the retriever, reranker, and LLM sampler if available.k: top k per retriever. Test 5, 10, 20.λ_observe: record λ state for retrieve, assemble, reason. See lambda_observe.md.- ΔS probe: compute ΔS(question, retrieved) and ΔS(retrieved, expected anchor). See deltaS_thresholds.md.
-
Warm up fence. Verify index hash, vector ready, secrets. If not ready, stop. Open: Bootstrap Ordering.
-
Retrieval step. Run with fixed metric and analyzer. Save raw hits with snippet fields from the contract page.
-
ΔS and λ probes. Log both per item. If ΔS ≥ 0.60 flag as structural risk.
-
Reasoning step. LLM reads TXT OS and uses the cite then explain schema. Refuse answers without citations.
-
Metrics. Compute precision, recall, citation hit, coverage. See eval_rag_precision_recall.md and Retrieval Playbook.
-
Trace sink. Write
traces.jsonlwithid, seed, k, ΔS, λ_state, snippet_id, section_id, INDEX_HASH. -
Gate. If coverage < 0.70 or ΔS > 0.45 fail the run. See regression_gate.md.
- Place a ten item gold set into
datasets/gold/qa.jsonlandcitations.jsonl. - Copy
config/harness.yamlfrom a previous good run. Setseed: 42,k: 10. - Run your script to produce
runs/<date>_seed42/metrics.csvandtraces.jsonl. - Verify the acceptance targets above. If any gate fails jump to the right fix below.
-
Wrong meaning despite high similarity. Open: Embedding ≠ Semantic
-
Citations do not match the referenced section. Open: Retrieval Traceability and Data Contracts
-
Hybrid retrieval worse than single retriever. Open: pattern_query_parsing_split.md and rerankers.md
-
Runs flip across deployments or first run crashes. Open: deployment-deadlock.md, predeploy-collapse.md
-
Long chains collapse. Open: context-drift.md and entropy-collapse.md
-
Block merge if any of these is true
- ΔS median > 0.45 on gold
- Coverage < 0.70
- λ flips on 2 of 3 paraphrases
- Metrics drift from last green run > 0.5 percentage point
-
Store artifacts
metrics.csv,traces.jsonl,harness.yaml,INDEX_HASH,MODEL_HASH.
You have TXTOS and the WFGY Problem Map loaded.
Question: "{question}"
Retrieved snippets: [{snippet_id, section_id, source_url, offsets, tokens}]
Do:
1) Cite then explain. If citation is missing or mismatched, fail fast and return the minimal structural fix.
2) If ΔS(question, retrieved) ≥ 0.60 propose the smallest repair. Use retrieval-playbook, retrieval-traceability, data-contracts, rerankers.
3) Return JSON:
{"citations":[...], "answer":"...", "λ_state":"→|←|<>|×", "ΔS":0.xx, "next_fix":"..."}
Keep it short and auditable.
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.