SeekLink includes a small blind-test framework for retrieval changes. It is not a public leaderboard. It is a release guard: when search behavior changes, run the same labeled queries before and after the change and compare quality and latency.
The bundled fixture lives in:
tests/corpus/ # small bilingual Markdown vault
tests/blind/queries.yaml # labeled queries
tests/blind/queries.filtered.yaml
tests/blind/run.py # runner
tests/blind/results/ # release-quality reference outputs only
Use .scratch/ for local sweeps, downloaded datasets, temporary runs, and
private-vault measurements. Do not commit intermediate experiment output.
| Config | Meaning |
|---|---|
A |
Current product behavior: hybrid search plus the default reranker path when available |
B |
Candidate query-expansion path, reserved for future experiments |
C |
Hand-written expansion reference using the expansion: field in queries.yaml |
Config A is the release baseline. Config C is a reference check for whether
hand-written expansion looks promising; it can underperform A when expansion
drifts or adds latency. Config B should not ship unless it beats A on
quality without breaking the latency budget.
Each query has hard expected paths and optional graded relevance:
- query: "记忆保持力"
intent: "find notes about long-term memory retention techniques"
expected_paths:
- "notes/fsrs-algorithm.md"
- "notes/spaced-repetition.md"
relevance:
"notes/fsrs-algorithm.md": 3
"notes/spaced-repetition.md": 3
"logs/2026-W15.md": 2
tags: [cjk, common]
filters:
folder: "notes"
tags: [memory]
answer_contains:
"notes/spaced-repetition.md": "spacing effect"
expansion:
- "间隔重复 遗忘曲线 FSRS"
- "how to retain memory long term"Rules:
- Use real user queries when possible.
expected_pathsare hard must-hit labels for Recall/MRR.relevancegrades are optional and used for nDCG.- Grades:
3direct answer,2strong supporting context,1related,0irrelevant. - Tags should identify slices such as
cjk,english,mixed,short,long,technical,alias,filtered, orlogs. filters.folderandfilters.tagsare source filters passed to product search. Use them for filtered retrieval checks, not for query slicing.answer_containsis optional. It labels short phrases that should appear in the returned chunk for a path, giving a lightweight answerability signal for agent workflows.
The runner records per-query:
hits,titles,snippets, andscoresrecall_at_10mrrprecision_at_5average_precision_at_10ndcg_at_10answerable_at_10andanswerable_mrrwhenanswer_containslabels existlast_expected_ranklatency_ms- reranker budget metadata
- first-stage channel diagnostics for config
A - a
failure_bucketlabel that classifies each query as a rank-1 hit, top-10 ordering gap, candidate-generation miss, rerank-budget miss, reranker-ordering miss, missing expected source, or not diagnosed
The aggregate output includes mean Recall@10, MRR, nDCG@10, latency, p95
latency, and answerability metrics when labels exist. It also includes
diagnostics.failure_buckets, a compact count of the per-query labels.
Use failure_bucket first, then inspect first_stage when a bucket needs
detail:
- Candidate-generation miss: the expected note never appears in first-stage candidates.
- Filtered-vector miss: the query used folder/tag filters and the expected note did not enter the filtered candidate pool.
- Rerank-budget miss: the expected note appears in first-stage results but not inside the reranker candidate budget.
- Reranker-ordering miss: the expected note reaches the reranker candidate pool but does not land in the top 10.
- First-stage top-10 miss: reranking is disabled and the expected note is below the top-10 output.
Install dev dependencies first:
uv sync --devRun the current product baseline:
uv run python tests/blind/run.py \
--config A \
--queries tests/blind/queries.yaml \
--vault tests/corpus \
--out .scratch/blind/A_current.jsonRun without reranking for diagnostics:
uv run python tests/blind/run.py \
--config A \
--no-rerank \
--queries tests/blind/queries.yaml \
--vault tests/corpus \
--out .scratch/blind/A_no_rerank.jsonRun the filtered-search fixture:
uv run python tests/blind/run.py \
--config A \
--no-rerank \
--queries tests/blind/queries.filtered.yaml \
--vault tests/corpus \
--out .scratch/blind/A_filtered.jsonRun a reranker-budget sweep:
uv run python tests/blind/run.py \
--config A \
--rerank-k 5 \
--queries tests/blind/queries.yaml \
--vault tests/corpus \
--out .scratch/blind/A_rerank5.jsonRun the hand-written expansion reference:
uv run python tests/blind/run.py \
--config C \
--queries tests/blind/queries.yaml \
--vault tests/corpus \
--out .scratch/blind/C_reference.jsonOnly copy a result into tests/blind/results/ when it is the final
release-quality measurement you want users and contributors to read.
Run the blind test when a change can affect ranking or line spans:
seeklink/search.pyseeklink/ingest.pyseeklink/chunker.pyseeklink/tokenizer.py- embedding model defaults
- reranker model defaults or reranker scoring behavior
- schema changes that affect indexed text or metadata
For pure CLI formatting, daemon lifecycle, or docs changes, targeted pytest coverage is usually enough.
Query expansion is not part of the default product path. If a future expansion
candidate uses config B, require all of the following before shipping it:
- Mean Recall@10 improves by at least 10 percentage points over config
A. - Fewer than 20% of queries regress on Recall@10.
- Main query slices do not regress by more than 5 percentage points.
- Human blind review prefers
Bby at least 0.5 points on a 1-5 scale. - p95 latency is at most
min(3 * p95(A), 2500ms).
If config C does not clearly beat A, expansion probably is not the right
lever; look at chunking, metadata, filters, or the embedder instead.
Public repo:
- fixture vault
- labeled fixture queries
- runner code
- final baseline / shipping / expansion-reference results
Private or local only:
- downloaded external datasets
- large public-vault mirrors used for stress tests
- private-vault labels and outputs
- intermediate sweeps
- exploratory research notes
Use .scratch/ for the private/local side.