simonsysun
diff --git a/‎CHANGELOG.md‎
Lines changed: 28 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 35 additions & 3 deletions b/‎README.md‎
Lines changed: 35 additions & 3 deletions
diff --git a/‎docs/blind-test.md‎
Lines changed: 214 additions & 0 deletions b/‎docs/blind-test.md‎
Lines changed: 214 additions & 0 deletions
@@ -7,6 +7,34 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.3.0] - 2026-04-23
+
+### Added
+- **Title-gated rerank blending.** When the title-channel's best match is in the rerank candidate pool, blend `alpha · normalized_rrf + (1 − alpha) · rerank_score` with `alpha = 0.60/0.50/0.40` by rank bucket. This protects confident exact-title / alias hits (e.g. searching `Zettelkasten`, `RRF`, `遗忘曲线`) from being demoted by a content-focused reranker. When no title hit is present, the reranker takes over fully — same as pre-v0.3 behavior — so poor first-stage ordering (e.g. `把文档切块放进向量库` where the correct answer is at RRF rank 11) is still recoverable. Measured on a 22-query blind test vs the same baseline: mean MRR 0.932 → 0.977 (+4.5 pp), mean Recall@10 unchanged, zero regressions. See `docs/v0.3-plan.md` for the iteration history (Options A / B / C) and `tests/blind/results/` for the raw JSON.
+- **Line-range retrieval end-to-end.**
+  - `SearchResult` now carries `line_start` and `line_end` (1-indexed, inclusive), computed by mapping chunk `char_start` / `char_end` back through the frontmatter strip to on-disk line numbers.
+  - Daemon search responses include `line_start` / `line_end`.
+  - CLI `_print_search_results` displays `path:line_start  title` so `path:LINE` can be piped straight into `seeklink get`.
+  - New `seeklink get PATH[:LINE] [-l N]` command reads the current on-disk file with universal-newline translation and prints the requested line range. Defaults: whole file (no `:LINE`), 100 lines starting at `LINE` (no `-l`), N lines (`-l`). Rejects path escapes, warns on beyond-EOF and `LINE < 1`.
+  - Helper `body_offset_to_file_line(full_text, body_char_offset) → int` handles the frontmatter offset; also correct when the frontmatter was deleted from disk after indexing.
+- **Blind-test framework** at `tests/blind/`: 32-file CJK+EN corpus (`tests/corpus/`), 22 ground-truth queries (`tests/blind/queries.yaml`), runner (`tests/blind/run.py`) that cold-starts seeklink once per invocation, warms the reranker, measures `recall_at_10` / `mrr` / `latency_ms` / `p95`. Three configurations: A (baseline), B (v0.4 query expansion — not yet implemented), C (hand-crafted expansion, RRF-fused; upper bound). Used to validate this release; gates v0.4.
+- **v0.3 plan + blind-test framework docs** at `docs/v0.3-plan.md` and `docs/blind-test.md`.
+- **FRONTMATTER_RE** is now a public export from `seeklink.ingest` so the search layer can reuse the same regex for offset mapping.
+
+### Fixed
+- **Cold-start vs daemon parity.** Cold-start `seeklink search` (the path triggered when `--vault` is passed or the daemon is unreachable) now constructs a `Reranker()` and passes it to `search()`, matching the daemon's behavior. Previously the same query returned different rankings depending on whether a daemon happened to be running — a silent correctness bug. `Reranker()` construction is safe on platforms without MLX (Linux, Intel macOS) because the instance self-disables at model-load time.
+- **Line-range accounting for newline-terminated files.** `seeklink get file:LINE` on a file that ends with `\n` no longer miscounts the trailing newline as an extra logical line. Line 6 of a 5-line (newline-terminated) file now correctly emits the `beyond-EOF` warning instead of returning a blank line.
+- **Title-only match with deleted file.** When a search result references a source whose file has been removed from disk (title-only match via alias to a stale source), `compute_lines_for_results` no longer returns `line_start=1` — it degrades to `0/0` so agents aren't handed a `path:1` that won't resolve. Consistent with other missing-file paths.
+
+### Dev
+- PyYAML added as a dev dependency (required by `tests/blind/run.py`).
+- Test suite: 185 → 203 tests (18 new). 3 for position-aware blending, 13 for `get` command + `body_offset_to_file_line` helper, 3 for end-to-end `SearchResult.line_start/line_end` population, 1 for trailing-newline EOF accounting. All green.
+
+### Deferred to v0.3.1+
+- `SEEKLINK_DEBUG=1` blended-score logging (proposed in v0.3 plan, skipped to avoid scope creep).
+- Per-result `mtime > indexed_at` drift warnings on the daemon path (cold-start already warns globally via `check_freshness`). Daemon-side follow-up tracked in `TODOS.md`.
+- Linux reranker via llama.cpp / GGUF (`QuantFactory/Qwen3-Reranker-0.6B-GGUF` exists; wiring it into seeklink lives on after v0.3).
+
 ## [0.2.2] - 2026-04-19
 
 ### Fixed
 
@@ -161,6 +161,21 @@ seeklink status --vault PATH
 
 Shows index stats and freshness warnings. If files have changed since last index, prints a warning to stderr.
 
+### `seeklink get`
+
+Print a line range of a vault file directly to stdout. Designed for agents that have a search hit like `notes/fsrs.md:42` and want to read a precise window without fetching the whole file.
+
+```
+seeklink get PATH[:LINE] [-l N] [--vault PATH]
+
+  seeklink get notes/fsrs.md              # entire file
+  seeklink get notes/fsrs.md:120          # 100 lines starting at line 120
+  seeklink get notes/fsrs.md:120 -l 40    # 40 lines starting at line 120
+  seeklink get notes/fsrs.md -l 50        # first 50 lines
+```
+
+Line numbers match `search` output. CRLF files print with universal newlines. Path escapes (`../..`) are rejected.
+
 ## How search works
 
 SeekLink runs four search channels in parallel and merges results with Reciprocal Rank Fusion:
@@ -174,11 +189,20 @@ SeekLink runs four search channels in parallel and merges results with Reciproca
 
 Many personal knowledge bases contain a mix of **titled articles** (permanent notes, literature reviews) and **untitled process notes** (daily logs, journal entries, quick captures). A high title weight systematically buries untitled content — even when it's the most relevant result for the query. The default of 1.5 keeps title matching useful for precise `[[alias]]` lookups while letting content-based matches compete on their own merits. Override with `--title-weight` per query if needed.
 
-### Optional: cross-encoder reranking
+### Title-gated rerank blending (v0.3+)
 
-When enabled (default on Apple Silicon), the top-20 RRF candidates are re-scored by Qwen3-Reranker-0.6B running on MLX (Metal GPU). This reads each (query, passage) pair with full cross-attention — more accurate than vector similarity alone, at the cost of ~1-2s per query.
+When the reranker is enabled, a cross-encoder (`Qwen3-Reranker-0.6B` on MLX, ~1-2s per query) re-scores the top-20 RRF candidates for precision. SeekLink applies **title-gated position blending** on top of this:
 
-Disable with: `export SEEKLINK_RERANKER_MODEL=""`
+- **If the title channel's best match is in the candidate pool**, blend `alpha · normalized_rrf + (1 - alpha) · rerank_score` with `alpha = 0.60/0.50/0.40` by rank bucket. This protects exact title / alias hits from being demoted by a content-focused reranker.
+- **Otherwise** (no strong title signal), the reranker score is used directly — same as pre-v0.3 behavior. This lets the reranker correct poor first-stage ordering.
+
+On the built-in 22-query blind test, this improved mean MRR from 0.932 to 0.977 vs pure-reranker-override, with zero regressions. See `tests/blind/` for the methodology.
+
+Disable reranking entirely with: `export SEEKLINK_RERANKER_MODEL=""`
+
+### Results carry line numbers
+
+Every `search` result returns `path:line_start-line_end` pointing at the best chunk within the current on-disk file. Agents can pipe that into `seeklink get` for a precise window read — no need to slurp entire files just to see context around a hit.
 
 ## Frontmatter
 
@@ -213,6 +237,14 @@ Notes are chunked (~400 tokens), embedded with jina-embeddings-v2-base-zh, and i
 | `SEEKLINK_EMBEDDER_MODEL` | `jinaai/jina-embeddings-v2-base-zh` | Embedding model (fastembed-supported) |
 | `SEEKLINK_RERANKER_MODEL` | `mlx-community/Qwen3-Reranker-0.6B-mxfp8` | Reranker model (set to `""` to disable) |
 
+## What changed in v0.3
+
+- **Title-gated rerank blending**: when an exact title / alias hit drives rank 1, protect it from reranker demotion; otherwise fall back to pure reranker. Measured MRR gain of +4.5 pp over v0.2 on a 22-query blind test, with no regressions. See "How search works" above.
+- **Line-range retrieval**: `search` results now include `line_start` / `line_end`, and a new `seeklink get PATH[:LINE] -l N` command prints line-precise windows. Agents can find-then-read without slurping whole files.
+- **Cold-start / daemon parity fix**: cold-start `seeklink search` now constructs a `Reranker()` and passes it to the search pipeline. Previously the same query returned different rankings depending on whether the daemon was running.
+- **Frontmatter-aware line mapping**: chunk offsets (stored against frontmatter-stripped body) are remapped to full-file line numbers, so `search` + `get` report lines the way you'd see them in a text editor.
+- **Blind-test framework** at `tests/blind/`: 32-file corpus + 22 ground-truth queries + runner that measures Recall@10 / MRR / latency. Used to validate v0.3 before tagging; gates v0.4 (query expansion) the same way.
+
 ## What changed in v0.2
 
 - **CLI-first**: MCP server removed. All interaction via `seeklink search/index/status/daemon`.
 
@@ -0,0 +1,214 @@
+# Query expansion blind test framework (v2, post-review)
+
+## Purpose
+
+Validate that query expansion (planned for v0.4) actually improves real
+search quality against a CJK-first personal vault **before** we commit to
+shipping it. Query expansion introduces:
+
+- A new generative model (Qwen3-0.6B, ~500 MB quantized)
+- New dependency on MLX and/or llama.cpp for inference
+- Query-time latency (~200-500 ms extra per search)
+
+That cost is only justified if the win over baseline is real and consistent.
+If the test shows "indistinguishable from baseline" or "regresses on too
+many queries", **we cancel the feature**, not ship it and hope nobody
+notices.
+
+## Three configurations
+
+| ID | Pipeline | What it measures |
+|----|----------|------------------|
+| **A** | seeklink search + reranker (daemon-path behavior) | Baseline. *Must* match product behavior — the runner constructs a real `Reranker()` and passes it to `search()`, same as `daemon.py` does. |
+| **B** | seeklink + Qwen3-0.6B expansion (v0.4 candidate) | Ship candidate |
+| **C** | seeklink + hand-crafted expansion, RRF-fused | Upper bound |
+
+A and C are fixed points. B's distance from C tells us how much the small
+model is leaving on the table; B's distance from A tells us whether
+shipping B is worth it at all.
+
+**Important**: config A is *not* "raw seeklink search without reranker".
+seeklink's cold-start CLI path has historically omitted the reranker (a
+known bug being fixed alongside v0.3); the daemon path has always used it.
+The blind test measures the daemon-path product behavior, because that's
+what users actually experience.
+
+## Runtime requirement
+
+`tests/blind/run.py` imports `yaml`. PyYAML is not currently in
+`pyproject.toml`. Before running, add it as a dev dependency:
+
+```toml
+[dependency-groups]
+dev = [
+    # ... existing ...
+    "pyyaml>=6.0",
+]
+```
+
+Then `uv sync --dev`.
+
+## Test data format
+
+`tests/blind/queries.yaml`:
+
+```yaml
+- query: "记忆保持力"
+  intent: "find notes about long-term memory retention techniques"
+  expected_paths:
+    - "notes/fsrs-algorithm.md"
+    - "notes/spaced-repetition.md"
+    - "logs/rhizome-dev/2026-W15.md"
+  tags: [cjk, common]
+  expansion:
+    - "间隔重复 遗忘曲线 FSRS"
+    - "how to retain memory long term"
+    - "通过间隔算法优化长期记忆保留"
+
+- query: "Zettelkasten vs 卡片盒笔记"
+  intent: "compare methodology in user's literature review"
+  expected_paths:
+    - "notes/zettelkasten.md"
+  tags: [cjk-en-mixed]
+  expansion:
+    - "atomic notes permanent notes"
+    - "卡片盒笔记法"
+```
+
+### How to build this file
+
+**20-30 queries total.** Fewer than 15 and single-query noise dominates the
+averages.
+
+1. Real-user queries only. Pull from shell history, rhizome logs, or
+   memory. No synthetic queries.
+2. For each, list 2-5 `expected_paths` you'd be annoyed if not in top 10.
+   Hard must-hit semantics — not "would be nice".
+3. **Skip queries where a substring of the query exactly matches a note
+   title.** Those hit the title channel trivially and test nothing about
+   expansion. Prefer queries where notes use different vocabulary than the
+   query itself.
+4. Fill in `expansion:` with 2-3 hand-crafted alternates: lexical form,
+   semantic paraphrase, hypothetical answer sentence (HyDE style).
+5. Tag each query for slicing: `cjk`, `english`, `cjk-en-mixed`, `long`,
+   `short`, `ambiguous`, `technical`, `common`.
+
+**Ground-truth stability**: commit `queries.yaml` alongside a vault-state
+marker (e.g. the current `rhizome log` head SHA). If you re-run against an
+edited vault, note the drift.
+
+## Metrics
+
+For each `(query, config)` pair (recorded by the runner):
+
+- `hits` — top-10 result paths in rank order
+- `titles` — top-10 titles (for the human blind scorer)
+- `snippets` — top-10 content previews (for the human blind scorer)
+- `scores` — fused scores (not directly compared across configs)
+- `latency_ms` — wall-clock for the full query call chain (model load
+  excluded — runner initializes once and warms up)
+- `recall_at_10` — fraction of `expected_paths` in top-10
+- `mrr` — reciprocal rank of first expected hit in top-10 (0 if none)
+
+Aggregates:
+
+- Mean `recall_at_10`, mean `mrr`, mean `latency_ms`, p95 `latency_ms`
+- Per-query delta (`B - A`, `C - A`, `C - B`) → find where expansion hurts
+- Per-tag breakdown (computed offline from `results` JSON) — especially
+  `cjk` vs `english` to catch asymmetric wins/regressions
+
+## Runner
+
+`tests/blind/run.py` loads `queries.yaml`, runs each query against one
+config, writes results JSON. Invocation:
+
+```bash
+# Baseline — works today
+python tests/blind/run.py \
+    --config A \
+    --queries tests/blind/queries.yaml \
+    --vault ~/Rhizome \
+    --out tests/blind/results/A.json
+
+# Ship candidate — requires v0.4 expansion hook (runner raises until then)
+python tests/blind/run.py --config B ...
+
+# Upper bound — uses hand-crafted `expansion:` field, fuses by RRF
+python tests/blind/run.py --config C ...
+
+# Diagnostic: baseline without reranker (NOT the official baseline)
+python tests/blind/run.py --config A --no-reranker ...
+```
+
+Runner:
+
+- Initializes `init_app(vault)` and `Reranker()` exactly **once** per
+  invocation (before the query loop). Warms the reranker with a dummy
+  call so the first measured latency isn't the model load.
+- Closes the DB once, in a `finally` block.
+- Records per-query latency using `time.perf_counter()`. Model-load time
+  is excluded by warmup.
+
+Human blind-scoring pass is a separate script (not yet written): take
+results/A,B,C.json, shuffle per-query, present you one query + 5 results
+(path + title + snippet) at a time without labels, record 1-5 score per
+config.
+
+## Acceptance criteria for shipping B (query expansion feature)
+
+**All five must hold for B to ship:**
+
+1. **Mean Recall@10 of B ≥ Recall@10 of A + 0.10** (at least +10 pp lift)
+2. **B regresses on ≤ 20% of queries** (Recall@10(B) < Recall@10(A))
+3. **Per-tag protection**: for each of the following tag buckets, B's mean
+   Recall@10 within that bucket must be ≥ A's mean within that bucket − 0.05:
+   - `cjk` (pure Chinese queries)
+   - `cjk-en-mixed`
+   - `english`
+   - `short` (≤ 2 tokens)
+   - `long` (≥ 6 tokens)
+   This catches "B crushes English queries, destroys CJK" — not OK for a
+   CJK-first vault.
+4. **Human blind score mean of B ≥ A + 0.5** on 1-5 scale
+5. **`p95(B) ≤ min(3 × p95(A), 2500 ms)`** — whichever bound is lower
+   binds. On current M3 + reranker-on hardware, `p95(A)` is ~1-2 s, so
+   `3 × p95(A)` is 3-6 s and the 2500 ms **absolute ceiling** is the real
+   gate. The `3×` term only starts binding if A itself gets faster (e.g.
+   future reranker optimization drops A's p95 below ~833 ms). Writing both
+   bounds protects against either regression direction.
+
+**Cancel criteria** (any one triggers "do not ship B"):
+
+- B's Recall@10 is within noise of A (`|B - A| < 0.05` on mean, and no tag
+  bucket shows `> 0.10` improvement)
+- Per-tag failure: any tag bucket regresses by `> 0.05` on Recall@10
+- Latency p95 exceeds either the relative or absolute ceiling
+- Human score shows mixed signal: B is higher on some queries and lower on
+  others with no tag-level explanation
+
+**Sanity ceiling check**: if C is also indistinguishable from A (`|C - A|
+< 0.05`), expansion is not the problem — retrieval or embedder is.
+Abandon v0.4 and look at the embedder (v0.5+) or retrieval channels.
+
+## Open questions (resolve before the first real run)
+
+- **Ground truth scope.** Hard must-hit only, or "should appear" (weaker)?
+  → Propose: hard must-hit only. Weaker signal = more subjective.
+- **Expansion prompt template.** qmd uses `/no_think Expand this search
+  query: {query}` with GBNF output grammar, backed by a **fine-tuned**
+  Qwen3-1.7B. Base Qwen3-0.6B has no such training; needs a richer
+  few-shot prompt. Draft the prompt once; commit alongside queries.yaml.
+- **Inference backend for B.** MLX (macOS) or llama.cpp (cross-platform)?
+  → Run both, pick the one that hits the p95 budget. Record which.
+- **Randomness.** Qwen3 at temperature 0.7 is non-deterministic. Propose:
+  temperature 0.3, no manual seed, but log each query's actual expansions
+  in the `expansions_used` field for reproducibility. For B's final
+  acceptance run, consider `N=3` and report median.
+
+## Out of scope for this framework
+
+- Automated labeling (no — Simon labels ground truth by hand)
+- CI-integrated regression (no — this is a pre-release gate, not a
+  continuous monitor)
+- Comparison against external tools (qmd, ripgrep, etc.) — different
+  vaults, apples to oranges