|
| 1 | +# Query expansion blind test framework (v2, post-review) |
| 2 | + |
| 3 | +## Purpose |
| 4 | + |
| 5 | +Validate that query expansion (planned for v0.4) actually improves real |
| 6 | +search quality against a CJK-first personal vault **before** we commit to |
| 7 | +shipping it. Query expansion introduces: |
| 8 | + |
| 9 | +- A new generative model (Qwen3-0.6B, ~500 MB quantized) |
| 10 | +- New dependency on MLX and/or llama.cpp for inference |
| 11 | +- Query-time latency (~200-500 ms extra per search) |
| 12 | + |
| 13 | +That cost is only justified if the win over baseline is real and consistent. |
| 14 | +If the test shows "indistinguishable from baseline" or "regresses on too |
| 15 | +many queries", **we cancel the feature**, not ship it and hope nobody |
| 16 | +notices. |
| 17 | + |
| 18 | +## Three configurations |
| 19 | + |
| 20 | +| ID | Pipeline | What it measures | |
| 21 | +|----|----------|------------------| |
| 22 | +| **A** | seeklink search + reranker (daemon-path behavior) | Baseline. *Must* match product behavior — the runner constructs a real `Reranker()` and passes it to `search()`, same as `daemon.py` does. | |
| 23 | +| **B** | seeklink + Qwen3-0.6B expansion (v0.4 candidate) | Ship candidate | |
| 24 | +| **C** | seeklink + hand-crafted expansion, RRF-fused | Upper bound | |
| 25 | + |
| 26 | +A and C are fixed points. B's distance from C tells us how much the small |
| 27 | +model is leaving on the table; B's distance from A tells us whether |
| 28 | +shipping B is worth it at all. |
| 29 | + |
| 30 | +**Important**: config A is *not* "raw seeklink search without reranker". |
| 31 | +seeklink's cold-start CLI path has historically omitted the reranker (a |
| 32 | +known bug being fixed alongside v0.3); the daemon path has always used it. |
| 33 | +The blind test measures the daemon-path product behavior, because that's |
| 34 | +what users actually experience. |
| 35 | + |
| 36 | +## Runtime requirement |
| 37 | + |
| 38 | +`tests/blind/run.py` imports `yaml`. PyYAML is not currently in |
| 39 | +`pyproject.toml`. Before running, add it as a dev dependency: |
| 40 | + |
| 41 | +```toml |
| 42 | +[dependency-groups] |
| 43 | +dev = [ |
| 44 | + # ... existing ... |
| 45 | + "pyyaml>=6.0", |
| 46 | +] |
| 47 | +``` |
| 48 | + |
| 49 | +Then `uv sync --dev`. |
| 50 | + |
| 51 | +## Test data format |
| 52 | + |
| 53 | +`tests/blind/queries.yaml`: |
| 54 | + |
| 55 | +```yaml |
| 56 | +- query: "记忆保持力" |
| 57 | + intent: "find notes about long-term memory retention techniques" |
| 58 | + expected_paths: |
| 59 | + - "notes/fsrs-algorithm.md" |
| 60 | + - "notes/spaced-repetition.md" |
| 61 | + - "logs/rhizome-dev/2026-W15.md" |
| 62 | + tags: [cjk, common] |
| 63 | + expansion: |
| 64 | + - "间隔重复 遗忘曲线 FSRS" |
| 65 | + - "how to retain memory long term" |
| 66 | + - "通过间隔算法优化长期记忆保留" |
| 67 | + |
| 68 | +- query: "Zettelkasten vs 卡片盒笔记" |
| 69 | + intent: "compare methodology in user's literature review" |
| 70 | + expected_paths: |
| 71 | + - "notes/zettelkasten.md" |
| 72 | + tags: [cjk-en-mixed] |
| 73 | + expansion: |
| 74 | + - "atomic notes permanent notes" |
| 75 | + - "卡片盒笔记法" |
| 76 | +``` |
| 77 | +
|
| 78 | +### How to build this file |
| 79 | +
|
| 80 | +**20-30 queries total.** Fewer than 15 and single-query noise dominates the |
| 81 | +averages. |
| 82 | +
|
| 83 | +1. Real-user queries only. Pull from shell history, rhizome logs, or |
| 84 | + memory. No synthetic queries. |
| 85 | +2. For each, list 2-5 `expected_paths` you'd be annoyed if not in top 10. |
| 86 | + Hard must-hit semantics — not "would be nice". |
| 87 | +3. **Skip queries where a substring of the query exactly matches a note |
| 88 | + title.** Those hit the title channel trivially and test nothing about |
| 89 | + expansion. Prefer queries where notes use different vocabulary than the |
| 90 | + query itself. |
| 91 | +4. Fill in `expansion:` with 2-3 hand-crafted alternates: lexical form, |
| 92 | + semantic paraphrase, hypothetical answer sentence (HyDE style). |
| 93 | +5. Tag each query for slicing: `cjk`, `english`, `cjk-en-mixed`, `long`, |
| 94 | + `short`, `ambiguous`, `technical`, `common`. |
| 95 | + |
| 96 | +**Ground-truth stability**: commit `queries.yaml` alongside a vault-state |
| 97 | +marker (e.g. the current `rhizome log` head SHA). If you re-run against an |
| 98 | +edited vault, note the drift. |
| 99 | + |
| 100 | +## Metrics |
| 101 | + |
| 102 | +For each `(query, config)` pair (recorded by the runner): |
| 103 | + |
| 104 | +- `hits` — top-10 result paths in rank order |
| 105 | +- `titles` — top-10 titles (for the human blind scorer) |
| 106 | +- `snippets` — top-10 content previews (for the human blind scorer) |
| 107 | +- `scores` — fused scores (not directly compared across configs) |
| 108 | +- `latency_ms` — wall-clock for the full query call chain (model load |
| 109 | + excluded — runner initializes once and warms up) |
| 110 | +- `recall_at_10` — fraction of `expected_paths` in top-10 |
| 111 | +- `mrr` — reciprocal rank of first expected hit in top-10 (0 if none) |
| 112 | + |
| 113 | +Aggregates: |
| 114 | + |
| 115 | +- Mean `recall_at_10`, mean `mrr`, mean `latency_ms`, p95 `latency_ms` |
| 116 | +- Per-query delta (`B - A`, `C - A`, `C - B`) → find where expansion hurts |
| 117 | +- Per-tag breakdown (computed offline from `results` JSON) — especially |
| 118 | + `cjk` vs `english` to catch asymmetric wins/regressions |
| 119 | + |
| 120 | +## Runner |
| 121 | + |
| 122 | +`tests/blind/run.py` loads `queries.yaml`, runs each query against one |
| 123 | +config, writes results JSON. Invocation: |
| 124 | + |
| 125 | +```bash |
| 126 | +# Baseline — works today |
| 127 | +python tests/blind/run.py \ |
| 128 | + --config A \ |
| 129 | + --queries tests/blind/queries.yaml \ |
| 130 | + --vault ~/Rhizome \ |
| 131 | + --out tests/blind/results/A.json |
| 132 | +
|
| 133 | +# Ship candidate — requires v0.4 expansion hook (runner raises until then) |
| 134 | +python tests/blind/run.py --config B ... |
| 135 | +
|
| 136 | +# Upper bound — uses hand-crafted `expansion:` field, fuses by RRF |
| 137 | +python tests/blind/run.py --config C ... |
| 138 | + |
| 139 | +# Diagnostic: baseline without reranker (NOT the official baseline) |
| 140 | +python tests/blind/run.py --config A --no-reranker ... |
| 141 | +``` |
| 142 | + |
| 143 | +Runner: |
| 144 | + |
| 145 | +- Initializes `init_app(vault)` and `Reranker()` exactly **once** per |
| 146 | + invocation (before the query loop). Warms the reranker with a dummy |
| 147 | + call so the first measured latency isn't the model load. |
| 148 | +- Closes the DB once, in a `finally` block. |
| 149 | +- Records per-query latency using `time.perf_counter()`. Model-load time |
| 150 | + is excluded by warmup. |
| 151 | + |
| 152 | +Human blind-scoring pass is a separate script (not yet written): take |
| 153 | +results/A,B,C.json, shuffle per-query, present you one query + 5 results |
| 154 | +(path + title + snippet) at a time without labels, record 1-5 score per |
| 155 | +config. |
| 156 | + |
| 157 | +## Acceptance criteria for shipping B (query expansion feature) |
| 158 | + |
| 159 | +**All five must hold for B to ship:** |
| 160 | + |
| 161 | +1. **Mean Recall@10 of B ≥ Recall@10 of A + 0.10** (at least +10 pp lift) |
| 162 | +2. **B regresses on ≤ 20% of queries** (Recall@10(B) < Recall@10(A)) |
| 163 | +3. **Per-tag protection**: for each of the following tag buckets, B's mean |
| 164 | + Recall@10 within that bucket must be ≥ A's mean within that bucket − 0.05: |
| 165 | + - `cjk` (pure Chinese queries) |
| 166 | + - `cjk-en-mixed` |
| 167 | + - `english` |
| 168 | + - `short` (≤ 2 tokens) |
| 169 | + - `long` (≥ 6 tokens) |
| 170 | + This catches "B crushes English queries, destroys CJK" — not OK for a |
| 171 | + CJK-first vault. |
| 172 | +4. **Human blind score mean of B ≥ A + 0.5** on 1-5 scale |
| 173 | +5. **`p95(B) ≤ min(3 × p95(A), 2500 ms)`** — whichever bound is lower |
| 174 | + binds. On current M3 + reranker-on hardware, `p95(A)` is ~1-2 s, so |
| 175 | + `3 × p95(A)` is 3-6 s and the 2500 ms **absolute ceiling** is the real |
| 176 | + gate. The `3×` term only starts binding if A itself gets faster (e.g. |
| 177 | + future reranker optimization drops A's p95 below ~833 ms). Writing both |
| 178 | + bounds protects against either regression direction. |
| 179 | + |
| 180 | +**Cancel criteria** (any one triggers "do not ship B"): |
| 181 | + |
| 182 | +- B's Recall@10 is within noise of A (`|B - A| < 0.05` on mean, and no tag |
| 183 | + bucket shows `> 0.10` improvement) |
| 184 | +- Per-tag failure: any tag bucket regresses by `> 0.05` on Recall@10 |
| 185 | +- Latency p95 exceeds either the relative or absolute ceiling |
| 186 | +- Human score shows mixed signal: B is higher on some queries and lower on |
| 187 | + others with no tag-level explanation |
| 188 | + |
| 189 | +**Sanity ceiling check**: if C is also indistinguishable from A (`|C - A| |
| 190 | +< 0.05`), expansion is not the problem — retrieval or embedder is. |
| 191 | +Abandon v0.4 and look at the embedder (v0.5+) or retrieval channels. |
| 192 | + |
| 193 | +## Open questions (resolve before the first real run) |
| 194 | + |
| 195 | +- **Ground truth scope.** Hard must-hit only, or "should appear" (weaker)? |
| 196 | + → Propose: hard must-hit only. Weaker signal = more subjective. |
| 197 | +- **Expansion prompt template.** qmd uses `/no_think Expand this search |
| 198 | + query: {query}` with GBNF output grammar, backed by a **fine-tuned** |
| 199 | + Qwen3-1.7B. Base Qwen3-0.6B has no such training; needs a richer |
| 200 | + few-shot prompt. Draft the prompt once; commit alongside queries.yaml. |
| 201 | +- **Inference backend for B.** MLX (macOS) or llama.cpp (cross-platform)? |
| 202 | + → Run both, pick the one that hits the p95 budget. Record which. |
| 203 | +- **Randomness.** Qwen3 at temperature 0.7 is non-deterministic. Propose: |
| 204 | + temperature 0.3, no manual seed, but log each query's actual expansions |
| 205 | + in the `expansions_used` field for reproducibility. For B's final |
| 206 | + acceptance run, consider `N=3` and report median. |
| 207 | + |
| 208 | +## Out of scope for this framework |
| 209 | + |
| 210 | +- Automated labeling (no — Simon labels ground truth by hand) |
| 211 | +- CI-integrated regression (no — this is a pre-release gate, not a |
| 212 | + continuous monitor) |
| 213 | +- Comparison against external tools (qmd, ripgrep, etc.) — different |
| 214 | + vaults, apples to oranges |
0 commit comments