Skip to content

Commit 1156545

Browse files
committed
Merge v0.3-wip: title-gated rerank blending + line-range retrieval
See CHANGELOG 0.3.0 entry. Blind-test measured MRR 0.932 → 0.977 (+4.5 pp), no regressions. Two iterations of codex review (Options A, B, C → final loose-gate C). 204/204 tests pass.
2 parents d7fb895 + 9de9964 commit 1156545

54 files changed

Lines changed: 12594 additions & 21 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,34 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.0] - 2026-04-23
11+
12+
### Added
13+
- **Title-gated rerank blending.** When the title-channel's best match is in the rerank candidate pool, blend `alpha · normalized_rrf + (1 − alpha) · rerank_score` with `alpha = 0.60/0.50/0.40` by rank bucket. This protects confident exact-title / alias hits (e.g. searching `Zettelkasten`, `RRF`, `遗忘曲线`) from being demoted by a content-focused reranker. When no title hit is present, the reranker takes over fully — same as pre-v0.3 behavior — so poor first-stage ordering (e.g. `把文档切块放进向量库` where the correct answer is at RRF rank 11) is still recoverable. Measured on a 22-query blind test vs the same baseline: mean MRR 0.932 → 0.977 (+4.5 pp), mean Recall@10 unchanged, zero regressions. See `docs/v0.3-plan.md` for the iteration history (Options A / B / C) and `tests/blind/results/` for the raw JSON.
14+
- **Line-range retrieval end-to-end.**
15+
- `SearchResult` now carries `line_start` and `line_end` (1-indexed, inclusive), computed by mapping chunk `char_start` / `char_end` back through the frontmatter strip to on-disk line numbers.
16+
- Daemon search responses include `line_start` / `line_end`.
17+
- CLI `_print_search_results` displays `path:line_start title` so `path:LINE` can be piped straight into `seeklink get`.
18+
- New `seeklink get PATH[:LINE] [-l N]` command reads the current on-disk file with universal-newline translation and prints the requested line range. Defaults: whole file (no `:LINE`), 100 lines starting at `LINE` (no `-l`), N lines (`-l`). Rejects path escapes, warns on beyond-EOF and `LINE < 1`.
19+
- Helper `body_offset_to_file_line(full_text, body_char_offset) → int` handles the frontmatter offset; also correct when the frontmatter was deleted from disk after indexing.
20+
- **Blind-test framework** at `tests/blind/`: 32-file CJK+EN corpus (`tests/corpus/`), 22 ground-truth queries (`tests/blind/queries.yaml`), runner (`tests/blind/run.py`) that cold-starts seeklink once per invocation, warms the reranker, measures `recall_at_10` / `mrr` / `latency_ms` / `p95`. Three configurations: A (baseline), B (v0.4 query expansion — not yet implemented), C (hand-crafted expansion, RRF-fused; upper bound). Used to validate this release; gates v0.4.
21+
- **v0.3 plan + blind-test framework docs** at `docs/v0.3-plan.md` and `docs/blind-test.md`.
22+
- **FRONTMATTER_RE** is now a public export from `seeklink.ingest` so the search layer can reuse the same regex for offset mapping.
23+
24+
### Fixed
25+
- **Cold-start vs daemon parity.** Cold-start `seeklink search` (the path triggered when `--vault` is passed or the daemon is unreachable) now constructs a `Reranker()` and passes it to `search()`, matching the daemon's behavior. Previously the same query returned different rankings depending on whether a daemon happened to be running — a silent correctness bug. `Reranker()` construction is safe on platforms without MLX (Linux, Intel macOS) because the instance self-disables at model-load time.
26+
- **Line-range accounting for newline-terminated files.** `seeklink get file:LINE` on a file that ends with `\n` no longer miscounts the trailing newline as an extra logical line. Line 6 of a 5-line (newline-terminated) file now correctly emits the `beyond-EOF` warning instead of returning a blank line.
27+
- **Title-only match with deleted file.** When a search result references a source whose file has been removed from disk (title-only match via alias to a stale source), `compute_lines_for_results` no longer returns `line_start=1` — it degrades to `0/0` so agents aren't handed a `path:1` that won't resolve. Consistent with other missing-file paths.
28+
29+
### Dev
30+
- PyYAML added as a dev dependency (required by `tests/blind/run.py`).
31+
- Test suite: 185 → 203 tests (18 new). 3 for position-aware blending, 13 for `get` command + `body_offset_to_file_line` helper, 3 for end-to-end `SearchResult.line_start/line_end` population, 1 for trailing-newline EOF accounting. All green.
32+
33+
### Deferred to v0.3.1+
34+
- `SEEKLINK_DEBUG=1` blended-score logging (proposed in v0.3 plan, skipped to avoid scope creep).
35+
- Per-result `mtime > indexed_at` drift warnings on the daemon path (cold-start already warns globally via `check_freshness`). Daemon-side follow-up tracked in `TODOS.md`.
36+
- Linux reranker via llama.cpp / GGUF (`QuantFactory/Qwen3-Reranker-0.6B-GGUF` exists; wiring it into seeklink lives on after v0.3).
37+
1038
## [0.2.2] - 2026-04-19
1139

1240
### Fixed

README.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,21 @@ seeklink status --vault PATH
161161

162162
Shows index stats and freshness warnings. If files have changed since last index, prints a warning to stderr.
163163

164+
### `seeklink get`
165+
166+
Print a line range of a vault file directly to stdout. Designed for agents that have a search hit like `notes/fsrs.md:42` and want to read a precise window without fetching the whole file.
167+
168+
```
169+
seeklink get PATH[:LINE] [-l N] [--vault PATH]
170+
171+
seeklink get notes/fsrs.md # entire file
172+
seeklink get notes/fsrs.md:120 # 100 lines starting at line 120
173+
seeklink get notes/fsrs.md:120 -l 40 # 40 lines starting at line 120
174+
seeklink get notes/fsrs.md -l 50 # first 50 lines
175+
```
176+
177+
Line numbers match `search` output. CRLF files print with universal newlines. Path escapes (`../..`) are rejected.
178+
164179
## How search works
165180

166181
SeekLink runs four search channels in parallel and merges results with Reciprocal Rank Fusion:
@@ -174,11 +189,20 @@ SeekLink runs four search channels in parallel and merges results with Reciproca
174189

175190
Many personal knowledge bases contain a mix of **titled articles** (permanent notes, literature reviews) and **untitled process notes** (daily logs, journal entries, quick captures). A high title weight systematically buries untitled content — even when it's the most relevant result for the query. The default of 1.5 keeps title matching useful for precise `[[alias]]` lookups while letting content-based matches compete on their own merits. Override with `--title-weight` per query if needed.
176191

177-
### Optional: cross-encoder reranking
192+
### Title-gated rerank blending (v0.3+)
178193

179-
When enabled (default on Apple Silicon), the top-20 RRF candidates are re-scored by Qwen3-Reranker-0.6B running on MLX (Metal GPU). This reads each (query, passage) pair with full cross-attention — more accurate than vector similarity alone, at the cost of ~1-2s per query.
194+
When the reranker is enabled, a cross-encoder (`Qwen3-Reranker-0.6B` on MLX, ~1-2s per query) re-scores the top-20 RRF candidates for precision. SeekLink applies **title-gated position blending** on top of this:
180195

181-
Disable with: `export SEEKLINK_RERANKER_MODEL=""`
196+
- **If the title channel's best match is in the candidate pool**, blend `alpha · normalized_rrf + (1 - alpha) · rerank_score` with `alpha = 0.60/0.50/0.40` by rank bucket. This protects exact title / alias hits from being demoted by a content-focused reranker.
197+
- **Otherwise** (no strong title signal), the reranker score is used directly — same as pre-v0.3 behavior. This lets the reranker correct poor first-stage ordering.
198+
199+
On the built-in 22-query blind test, this improved mean MRR from 0.932 to 0.977 vs pure-reranker-override, with zero regressions. See `tests/blind/` for the methodology.
200+
201+
Disable reranking entirely with: `export SEEKLINK_RERANKER_MODEL=""`
202+
203+
### Results carry line numbers
204+
205+
Every `search` result returns `path:line_start-line_end` pointing at the best chunk within the current on-disk file. Agents can pipe that into `seeklink get` for a precise window read — no need to slurp entire files just to see context around a hit.
182206

183207
## Frontmatter
184208

@@ -213,6 +237,14 @@ Notes are chunked (~400 tokens), embedded with jina-embeddings-v2-base-zh, and i
213237
| `SEEKLINK_EMBEDDER_MODEL` | `jinaai/jina-embeddings-v2-base-zh` | Embedding model (fastembed-supported) |
214238
| `SEEKLINK_RERANKER_MODEL` | `mlx-community/Qwen3-Reranker-0.6B-mxfp8` | Reranker model (set to `""` to disable) |
215239

240+
## What changed in v0.3
241+
242+
- **Title-gated rerank blending**: when an exact title / alias hit drives rank 1, protect it from reranker demotion; otherwise fall back to pure reranker. Measured MRR gain of +4.5 pp over v0.2 on a 22-query blind test, with no regressions. See "How search works" above.
243+
- **Line-range retrieval**: `search` results now include `line_start` / `line_end`, and a new `seeklink get PATH[:LINE] -l N` command prints line-precise windows. Agents can find-then-read without slurping whole files.
244+
- **Cold-start / daemon parity fix**: cold-start `seeklink search` now constructs a `Reranker()` and passes it to the search pipeline. Previously the same query returned different rankings depending on whether the daemon was running.
245+
- **Frontmatter-aware line mapping**: chunk offsets (stored against frontmatter-stripped body) are remapped to full-file line numbers, so `search` + `get` report lines the way you'd see them in a text editor.
246+
- **Blind-test framework** at `tests/blind/`: 32-file corpus + 22 ground-truth queries + runner that measures Recall@10 / MRR / latency. Used to validate v0.3 before tagging; gates v0.4 (query expansion) the same way.
247+
216248
## What changed in v0.2
217249

218250
- **CLI-first**: MCP server removed. All interaction via `seeklink search/index/status/daemon`.

docs/blind-test.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Query expansion blind test framework (v2, post-review)
2+
3+
## Purpose
4+
5+
Validate that query expansion (planned for v0.4) actually improves real
6+
search quality against a CJK-first personal vault **before** we commit to
7+
shipping it. Query expansion introduces:
8+
9+
- A new generative model (Qwen3-0.6B, ~500 MB quantized)
10+
- New dependency on MLX and/or llama.cpp for inference
11+
- Query-time latency (~200-500 ms extra per search)
12+
13+
That cost is only justified if the win over baseline is real and consistent.
14+
If the test shows "indistinguishable from baseline" or "regresses on too
15+
many queries", **we cancel the feature**, not ship it and hope nobody
16+
notices.
17+
18+
## Three configurations
19+
20+
| ID | Pipeline | What it measures |
21+
|----|----------|------------------|
22+
| **A** | seeklink search + reranker (daemon-path behavior) | Baseline. *Must* match product behavior — the runner constructs a real `Reranker()` and passes it to `search()`, same as `daemon.py` does. |
23+
| **B** | seeklink + Qwen3-0.6B expansion (v0.4 candidate) | Ship candidate |
24+
| **C** | seeklink + hand-crafted expansion, RRF-fused | Upper bound |
25+
26+
A and C are fixed points. B's distance from C tells us how much the small
27+
model is leaving on the table; B's distance from A tells us whether
28+
shipping B is worth it at all.
29+
30+
**Important**: config A is *not* "raw seeklink search without reranker".
31+
seeklink's cold-start CLI path has historically omitted the reranker (a
32+
known bug being fixed alongside v0.3); the daemon path has always used it.
33+
The blind test measures the daemon-path product behavior, because that's
34+
what users actually experience.
35+
36+
## Runtime requirement
37+
38+
`tests/blind/run.py` imports `yaml`. PyYAML is not currently in
39+
`pyproject.toml`. Before running, add it as a dev dependency:
40+
41+
```toml
42+
[dependency-groups]
43+
dev = [
44+
# ... existing ...
45+
"pyyaml>=6.0",
46+
]
47+
```
48+
49+
Then `uv sync --dev`.
50+
51+
## Test data format
52+
53+
`tests/blind/queries.yaml`:
54+
55+
```yaml
56+
- query: "记忆保持力"
57+
intent: "find notes about long-term memory retention techniques"
58+
expected_paths:
59+
- "notes/fsrs-algorithm.md"
60+
- "notes/spaced-repetition.md"
61+
- "logs/rhizome-dev/2026-W15.md"
62+
tags: [cjk, common]
63+
expansion:
64+
- "间隔重复 遗忘曲线 FSRS"
65+
- "how to retain memory long term"
66+
- "通过间隔算法优化长期记忆保留"
67+
68+
- query: "Zettelkasten vs 卡片盒笔记"
69+
intent: "compare methodology in user's literature review"
70+
expected_paths:
71+
- "notes/zettelkasten.md"
72+
tags: [cjk-en-mixed]
73+
expansion:
74+
- "atomic notes permanent notes"
75+
- "卡片盒笔记法"
76+
```
77+
78+
### How to build this file
79+
80+
**20-30 queries total.** Fewer than 15 and single-query noise dominates the
81+
averages.
82+
83+
1. Real-user queries only. Pull from shell history, rhizome logs, or
84+
memory. No synthetic queries.
85+
2. For each, list 2-5 `expected_paths` you'd be annoyed if not in top 10.
86+
Hard must-hit semantics — not "would be nice".
87+
3. **Skip queries where a substring of the query exactly matches a note
88+
title.** Those hit the title channel trivially and test nothing about
89+
expansion. Prefer queries where notes use different vocabulary than the
90+
query itself.
91+
4. Fill in `expansion:` with 2-3 hand-crafted alternates: lexical form,
92+
semantic paraphrase, hypothetical answer sentence (HyDE style).
93+
5. Tag each query for slicing: `cjk`, `english`, `cjk-en-mixed`, `long`,
94+
`short`, `ambiguous`, `technical`, `common`.
95+
96+
**Ground-truth stability**: commit `queries.yaml` alongside a vault-state
97+
marker (e.g. the current `rhizome log` head SHA). If you re-run against an
98+
edited vault, note the drift.
99+
100+
## Metrics
101+
102+
For each `(query, config)` pair (recorded by the runner):
103+
104+
- `hits` — top-10 result paths in rank order
105+
- `titles` — top-10 titles (for the human blind scorer)
106+
- `snippets` — top-10 content previews (for the human blind scorer)
107+
- `scores` — fused scores (not directly compared across configs)
108+
- `latency_ms` — wall-clock for the full query call chain (model load
109+
excluded — runner initializes once and warms up)
110+
- `recall_at_10` — fraction of `expected_paths` in top-10
111+
- `mrr` — reciprocal rank of first expected hit in top-10 (0 if none)
112+
113+
Aggregates:
114+
115+
- Mean `recall_at_10`, mean `mrr`, mean `latency_ms`, p95 `latency_ms`
116+
- Per-query delta (`B - A`, `C - A`, `C - B`) → find where expansion hurts
117+
- Per-tag breakdown (computed offline from `results` JSON) — especially
118+
`cjk` vs `english` to catch asymmetric wins/regressions
119+
120+
## Runner
121+
122+
`tests/blind/run.py` loads `queries.yaml`, runs each query against one
123+
config, writes results JSON. Invocation:
124+
125+
```bash
126+
# Baseline — works today
127+
python tests/blind/run.py \
128+
--config A \
129+
--queries tests/blind/queries.yaml \
130+
--vault ~/Rhizome \
131+
--out tests/blind/results/A.json
132+
133+
# Ship candidate — requires v0.4 expansion hook (runner raises until then)
134+
python tests/blind/run.py --config B ...
135+
136+
# Upper bound — uses hand-crafted `expansion:` field, fuses by RRF
137+
python tests/blind/run.py --config C ...
138+
139+
# Diagnostic: baseline without reranker (NOT the official baseline)
140+
python tests/blind/run.py --config A --no-reranker ...
141+
```
142+
143+
Runner:
144+
145+
- Initializes `init_app(vault)` and `Reranker()` exactly **once** per
146+
invocation (before the query loop). Warms the reranker with a dummy
147+
call so the first measured latency isn't the model load.
148+
- Closes the DB once, in a `finally` block.
149+
- Records per-query latency using `time.perf_counter()`. Model-load time
150+
is excluded by warmup.
151+
152+
Human blind-scoring pass is a separate script (not yet written): take
153+
results/A,B,C.json, shuffle per-query, present you one query + 5 results
154+
(path + title + snippet) at a time without labels, record 1-5 score per
155+
config.
156+
157+
## Acceptance criteria for shipping B (query expansion feature)
158+
159+
**All five must hold for B to ship:**
160+
161+
1. **Mean Recall@10 of B ≥ Recall@10 of A + 0.10** (at least +10 pp lift)
162+
2. **B regresses on ≤ 20% of queries** (Recall@10(B) < Recall@10(A))
163+
3. **Per-tag protection**: for each of the following tag buckets, B's mean
164+
Recall@10 within that bucket must be ≥ A's mean within that bucket − 0.05:
165+
- `cjk` (pure Chinese queries)
166+
- `cjk-en-mixed`
167+
- `english`
168+
- `short` (≤ 2 tokens)
169+
- `long` (≥ 6 tokens)
170+
This catches "B crushes English queries, destroys CJK" — not OK for a
171+
CJK-first vault.
172+
4. **Human blind score mean of B ≥ A + 0.5** on 1-5 scale
173+
5. **`p95(B) ≤ min(3 × p95(A), 2500 ms)`** — whichever bound is lower
174+
binds. On current M3 + reranker-on hardware, `p95(A)` is ~1-2 s, so
175+
`3 × p95(A)` is 3-6 s and the 2500 ms **absolute ceiling** is the real
176+
gate. The `` term only starts binding if A itself gets faster (e.g.
177+
future reranker optimization drops A's p95 below ~833 ms). Writing both
178+
bounds protects against either regression direction.
179+
180+
**Cancel criteria** (any one triggers "do not ship B"):
181+
182+
- B's Recall@10 is within noise of A (`|B - A| < 0.05` on mean, and no tag
183+
bucket shows `> 0.10` improvement)
184+
- Per-tag failure: any tag bucket regresses by `> 0.05` on Recall@10
185+
- Latency p95 exceeds either the relative or absolute ceiling
186+
- Human score shows mixed signal: B is higher on some queries and lower on
187+
others with no tag-level explanation
188+
189+
**Sanity ceiling check**: if C is also indistinguishable from A (`|C - A|
190+
< 0.05`), expansion is not the problem — retrieval or embedder is.
191+
Abandon v0.4 and look at the embedder (v0.5+) or retrieval channels.
192+
193+
## Open questions (resolve before the first real run)
194+
195+
- **Ground truth scope.** Hard must-hit only, or "should appear" (weaker)?
196+
→ Propose: hard must-hit only. Weaker signal = more subjective.
197+
- **Expansion prompt template.** qmd uses `/no_think Expand this search
198+
query: {query}` with GBNF output grammar, backed by a **fine-tuned**
199+
Qwen3-1.7B. Base Qwen3-0.6B has no such training; needs a richer
200+
few-shot prompt. Draft the prompt once; commit alongside queries.yaml.
201+
- **Inference backend for B.** MLX (macOS) or llama.cpp (cross-platform)?
202+
→ Run both, pick the one that hits the p95 budget. Record which.
203+
- **Randomness.** Qwen3 at temperature 0.7 is non-deterministic. Propose:
204+
temperature 0.3, no manual seed, but log each query's actual expansions
205+
in the `expansions_used` field for reproducibility. For B's final
206+
acceptance run, consider `N=3` and report median.
207+
208+
## Out of scope for this framework
209+
210+
- Automated labeling (no — Simon labels ground truth by hand)
211+
- CI-integrated regression (no — this is a pre-release gate, not a
212+
continuous monitor)
213+
- Comparison against external tools (qmd, ripgrep, etc.) — different
214+
vaults, apples to oranges

0 commit comments

Comments
 (0)