simonsysun
diff --git a/‎CHANGELOG.md‎
Lines changed: 12 additions & 4 deletions b/‎CHANGELOG.md‎
Lines changed: 12 additions & 4 deletions
diff --git a/‎CHANGELOG.zh.md‎
Lines changed: 9 additions & 3 deletions b/‎CHANGELOG.zh.md‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎docs/blind-test.md‎
Lines changed: 10 additions & 9 deletions b/‎docs/blind-test.md‎
Lines changed: 10 additions & 9 deletions
diff --git a/‎seeklink/reranker.py‎
Lines changed: 83 additions & 4 deletions b/‎seeklink/reranker.py‎
Lines changed: 83 additions & 4 deletions
diff --git a/‎seeklink/search.py‎
Lines changed: 23 additions & 5 deletions b/‎seeklink/search.py‎
Lines changed: 23 additions & 5 deletions
@@ -12,6 +12,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [0.6.0] - 2026-05-04
 
 ### Changed
+- `seeklink search --rerank-k auto` now uses a middle reranker budget for
+  general CJK technical queries and reserves the deepest budget for filtered
+  searches and chunk/vector-index style CJK queries, reducing optional MLX
+  reranker latency while preserving the bundled blind-fixture quality gates.
+- The optional MLX Qwen3 reranker now scores `yes`/`no` through a two-token
+  classifier head when the model supports tied embeddings, avoiding full
+  vocabulary logits while keeping a legacy fallback via
+  `SEEKLINK_RERANK_SCORING=legacy`.
 - Added copy-paste agent setup guidance to README and clarified `llms.txt`
   discovery cues for local Markdown vault retrieval.
 - Expanded PyPI keywords for agent, local-search, Markdown-search, and
@@ -34,10 +42,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Release verification now includes an 8-query filtered fixture. On the bundled
   fixture vault with reranking disabled, it reports Recall@10 1.000 and
   Answerable@10 1.000; the regular 22-query fixture remains at Recall@10
-  0.985, MRR 0.977, and nDCG@10 0.901 with the optional MLX reranker active.
+  0.985, MRR 0.977, and nDCG@10 0.902 with the optional MLX reranker active.
 - Refreshed `tests/blind/results/` with v0.6 release-quality snapshots: the
   v0.5 baseline, v0.6 shipping run, v0.6 filtered fixture, and v0.6 expansion
-  upper bound.
+  reference.
 
 ## [0.5.0] - 2026-05-04
 
@@ -130,7 +138,7 @@ surface so the repo reads as a shipped tool rather than a work log.
 
 ### Changed
 - Consolidated the 0.3.0 / 0.3.1 narrative into a single release entry (this one). The earlier entries described the same code twice with process detail that did not belong in public release notes.
-- Trimmed `tests/blind/results/` to release-quality baseline, shipping, and upper-bound measurements. Intermediate iteration results removed.
+- Trimmed `tests/blind/results/` to release-quality baseline, shipping, and expansion-reference measurements. Intermediate iteration results removed.
 - Tightened internal code comments and test docstrings so they describe current behavior rather than the iteration history that produced it.
 - README metric claims explicitly labeled as "pilot" with sample size.
 
@@ -144,7 +152,7 @@ surface so the repo reads as a shipped tool rather than a work log.
 - **Line-range retrieval.** `SearchResult` now carries 1-indexed inclusive `line_start` / `line_end` fields mapped through the indexer's frontmatter strip back to on-disk line numbers. CLI `search` prints `SCORE  PATH:LINE  TITLE` so agents can pipe the hit into a precise window read. A new `seeklink get PATH[:LINE] [-l N]` command performs that window read directly from the filesystem — no DB round-trip, no daemon involvement, universal-newline translation, path-escape rejection.
 - **Cold-start `search` reranker parity.** `seeklink search --vault PATH` (the cold-start path) now constructs a reranker and passes it to the search pipeline, matching the daemon. Before this change, the same query returned different rankings depending on whether a daemon happened to be running.
 - **Agent-first documentation.** New "For agents" section in the README (minimum workflow, output contract, exit codes, query-shape hints, daemon JSON fallback). `llms.txt` rewritten as an explicit contract.
-- **Blind-test framework** at `tests/blind/`: 32-file bilingual (CJK + English) fixture corpus (`tests/corpus/`), 22 ground-truth queries (`tests/blind/queries.yaml`), runner that cold-starts once per invocation and measures `recall_at_10` / `mrr` / `latency_ms` / `p95`. Three configurations: `A` (current baseline), `B` (planned query expansion — not yet shipped), `C` (hand-crafted expansion, RRF-fused; upper bound). Used to gate this release.
+- **Blind-test framework** at `tests/blind/`: 32-file bilingual (CJK + English) fixture corpus (`tests/corpus/`), 22 ground-truth queries (`tests/blind/queries.yaml`), runner that cold-starts once per invocation and measures `recall_at_10` / `mrr` / `latency_ms` / `p95`. Three configurations: `A` (current baseline), `B` (planned query expansion — not yet shipped), `C` (hand-crafted expansion, RRF-fused reference). Used to gate this release.
 
 ### Fixed
 - **`seeklink get` trailing-newline accounting.** `get FILE:LINE` on a newline-terminated file no longer counts the trailing `\n` as an extra logical line. `get FILE:6` on a 5-line file correctly emits the beyond-EOF warning instead of returning a blank line.
 
@@ -12,6 +12,12 @@
 ## [0.6.0] - 2026-05-04
 
 ### 变更
+- `seeklink search --rerank-k auto` 现在会对一般中文技术查询使用中等
+  reranker 预算，只把最深预算保留给过滤搜索和 chunk / 向量索引类中文查询，
+  从而在保持捆绑盲测质量门槛的同时降低可选 MLX reranker 延迟。
+- 可选的 MLX Qwen3 reranker 现在会在模型支持 tied embeddings 时通过
+  两个 token 的分类头计算 `yes` / `no` 分数，避免生成完整词表 logits；
+  如有需要仍可通过 `SEEKLINK_RERANK_SCORING=legacy` 回退到旧路径。
 - 在 README 中新增可直接复制给 agent 的配置说明，并在 `llms.txt` 中强化本地 Markdown 笔记库检索的发现提示。
 - 扩展 PyPI 关键词，覆盖 agent、本地搜索、Markdown 搜索和 llms.txt 发现路径。
 - `seeklink search` 和单文件 `seeklink index` 现在支持 `--no-daemon`；
@@ -24,8 +30,8 @@
 
 ### 开发
 - 盲测 runner 现在支持 source 级 folder/tag 过滤、filtered-vector 诊断，以及可选的 answerability 标签，用于检查 top-10 命中是否真的包含 agent 需要的答案文本。
-- 发布验证现在包含一个 8 条查询的过滤检索 fixture。在捆绑 fixture vault 上关闭 reranker 时，它的 Recall@10 为 1.000，Answerable@10 为 1.000；常规 22 条查询 fixture 在启用可选 MLX reranker 时仍为 Recall@10 0.985、MRR 0.977、nDCG@10 0.901。
-- 将 `tests/blind/results/` 刷新为 v0.6 发布质量快照：v0.5 baseline、v0.6 shipping run、v0.6 filtered fixture，以及 v0.6 expansion upper bound。
+- 发布验证现在包含一个 8 条查询的过滤检索 fixture。在捆绑 fixture vault 上关闭 reranker 时，它的 Recall@10 为 1.000，Answerable@10 为 1.000；常规 22 条查询 fixture 在启用可选 MLX reranker 时仍为 Recall@10 0.985、MRR 0.977、nDCG@10 0.902。
+- 将 `tests/blind/results/` 刷新为 v0.6 发布质量快照：v0.5 baseline、v0.6 shipping run、v0.6 filtered fixture，以及 v0.6 expansion reference。
 
 ## [0.5.0] - 2026-05-04
 
@@ -86,7 +92,7 @@
 
 ### 变更
 - 将 0.3.0 / 0.3.1 的内容合并为单一发布条目（即本条）。早前的条目用流程细节描述了同一份代码，这些细节不属于公开的发布说明。
-- 精简 `tests/blind/results/`，仅保留发布质量的 baseline、shipped 和 upper-bound 测量结果。移除了中间迭代结果。
+- 精简 `tests/blind/results/`，仅保留发布质量的 baseline、shipped 和 expansion-reference 测量结果。移除了中间迭代结果。
 - 收紧了内部代码注释和测试文档字符串，使其描述当前行为而非产生它的迭代历史。
 - README 中的度量数据声明显式标注为"pilot"并附样本量。
 
 
@@ -24,11 +24,12 @@ private-vault measurements. Do not commit intermediate experiment output.
 |---|---|
 | `A` | Current product behavior: hybrid search plus the default reranker path when available |
 | `B` | Candidate query-expansion path, reserved for future experiments |
-| `C` | Hand-written expansion upper bound using the `expansion:` field in `queries.yaml` |
+| `C` | Hand-written expansion reference using the `expansion:` field in `queries.yaml` |
 
-Config `A` is the release baseline. Config `C` answers whether expansion has
-headroom. Config `B` should not ship unless it beats `A` on quality without
-breaking the latency budget.
+Config `A` is the release baseline. Config `C` is a reference check for whether
+hand-written expansion looks promising; it can underperform `A` when expansion
+drifts or adds latency. Config `B` should not ship unless it beats `A` on
+quality without breaking the latency budget.
 
 ## Query Format
 
@@ -158,14 +159,14 @@ uv run python tests/blind/run.py \
   --out .scratch/blind/A_rerank5.json
 ```
 
-Run the hand-written expansion upper bound:
+Run the hand-written expansion reference:
 
 ```bash
 uv run python tests/blind/run.py \
   --config C \
   --queries tests/blind/queries.yaml \
   --vault tests/corpus \
-  --out .scratch/blind/C_upper_bound.json
+  --out .scratch/blind/C_reference.json
 ```
 
 Only copy a result into `tests/blind/results/` when it is the final
@@ -197,8 +198,8 @@ candidate uses config `B`, require all of the following before shipping it:
 4. Human blind review prefers `B` by at least 0.5 points on a 1-5 scale.
 5. p95 latency is at most `min(3 * p95(A), 2500ms)`.
 
-If config `C` is also close to `A`, expansion probably is not the right lever;
-look at chunking, metadata, filters, or the embedder instead.
+If config `C` does not clearly beat `A`, expansion probably is not the right
+lever; look at chunking, metadata, filters, or the embedder instead.
 
 ## Public vs Private Results
 
@@ -207,7 +208,7 @@ Public repo:
 - fixture vault
 - labeled fixture queries
 - runner code
-- final baseline / shipping / upper-bound reference results
+- final baseline / shipping / expansion-reference results
 
 Private or local only:
 
 
@@ -34,6 +34,11 @@
     "Given a web search query, retrieve relevant passages that answer the query."
 )
 _MAX_PASSAGE_TOKENS = 200
+_SCORING_MODE_ENV = "SEEKLINK_RERANK_SCORING"
+_SCORING_AUTO = "auto"
+_SCORING_CLS_HEAD = "cls_head"
+_SCORING_LEGACY = "legacy"
+_VALID_SCORING_MODES = {_SCORING_AUTO, _SCORING_CLS_HEAD, _SCORING_LEGACY}
 
 
 class Reranker:
@@ -51,8 +56,23 @@ def __init__(self) -> None:
         self._tokenizer = None
         self._token_yes: int | None = None
         self._token_no: int | None = None
+        self._cls_weight = None
         self._lock = threading.Lock()
         self._disabled = self.MODEL_NAME == ""
+        self._scoring_mode = (
+            os.environ.get(_SCORING_MODE_ENV, _SCORING_AUTO)
+            .strip()
+            .casefold()
+            .replace("-", "_")
+        )
+        if self._scoring_mode not in _VALID_SCORING_MODES:
+            logger.warning(
+                "Unsupported %s=%r; using %s",
+                _SCORING_MODE_ENV,
+                self._scoring_mode,
+                _SCORING_AUTO,
+            )
+            self._scoring_mode = _SCORING_AUTO
 
     @property
     def disabled(self) -> bool:
@@ -70,6 +90,8 @@ def _ensure_model(self) -> None:
                 self._model, self._tokenizer = mlx_lm.load(self.MODEL_NAME)
                 self._token_yes = self._tokenizer.convert_tokens_to_ids("yes")
                 self._token_no = self._tokenizer.convert_tokens_to_ids("no")
+                if self._scoring_mode != _SCORING_LEGACY:
+                    self._prepare_cls_head()
             except Exception as e:
                 logger.warning(
                     "Reranker load failed (%s): %s — reranking disabled",
@@ -78,6 +100,33 @@ def _ensure_model(self) -> None:
                 )
                 self._disabled = True
 
+    def _prepare_cls_head(self) -> None:
+        """Prepare a two-token classifier head when the MLX model supports it."""
+        try:
+            import mlx.core as mx
+
+            if self._model is None or self._token_yes is None or self._token_no is None:
+                return
+            body = getattr(self._model, "model", None)
+            embed_tokens = getattr(body, "embed_tokens", None)
+            if body is None or embed_tokens is None:
+                if self._scoring_mode == _SCORING_CLS_HEAD:
+                    logger.warning(
+                        "Reranker cls_head scoring unavailable; using legacy logits"
+                    )
+                return
+            yes_no = embed_tokens(mx.array([self._token_yes, self._token_no]))
+            self._cls_weight = yes_no[0] - yes_no[1]
+            mx.eval(self._cls_weight)
+        except Exception as e:
+            self._cls_weight = None
+            if self._scoring_mode == _SCORING_CLS_HEAD:
+                logger.warning(
+                    "Reranker cls_head scoring failed to initialize (%s); "
+                    "using legacy logits",
+                    e,
+                )
+
     def _token_list(self, text: str) -> list[int]:
         """Tokenize text into a flat Python list."""
         tokens = self._tokenizer.encode(text, return_tensors=None)
@@ -106,10 +155,7 @@ def _truncate_passage(self, passage: str) -> str:
         # Conservative fallback for unusual tokenizers without decode().
         return passage[:1200]
 
-    def _score_one(self, query: str, passage: str) -> float:
-        """Score a single (query, passage) pair. Returns 0-1 probability."""
-        import mlx.core as mx
-
+    def _pair_tokens(self, query: str, passage: str) -> list[int]:
         passage = self._truncate_passage(passage)
         prompt = (
             f"Instruct: {_DEFAULT_INSTRUCTION}\n"
@@ -123,6 +169,30 @@ def _score_one(self, query: str, passage: str) -> float:
         text += "<think>\n"
 
         tokens = self._token_list(text)
+        return tokens
+
+    def _score_one_cls_head(self, tokens: list[int]) -> float | None:
+        """Score via final hidden state dot (yes_embedding - no_embedding)."""
+        if self._cls_weight is None:
+            return None
+
+        import mlx.core as mx
+
+        input_ids = mx.array([tokens])
+        hidden = self._model.model(input_ids)
+        last_h = hidden[0, -1, :]
+        logit = mx.sum(last_h * self._cls_weight)
+        mx.eval(logit)
+        value = logit.item()
+        if value >= 0:
+            return 1.0 / (1.0 + math.exp(-value))
+        exp_value = math.exp(value)
+        return exp_value / (1.0 + exp_value)
+
+    def _score_one_legacy(self, tokens: list[int]) -> float:
+        """Score via full-vocabulary yes/no token logits."""
+        import mlx.core as mx
+
         input_ids = mx.array([tokens])
         logits = self._model(input_ids)
         last_logits = logits[0, -1, :]
@@ -135,6 +205,15 @@ def _score_one(self, query: str, passage: str) -> float:
         no_e = math.exp(no_s - max_s)
         return yes_e / (yes_e + no_e)
 
+    def _score_one(self, query: str, passage: str) -> float:
+        """Score a single (query, passage) pair. Returns 0-1 probability."""
+        tokens = self._pair_tokens(query, passage)
+        if self._scoring_mode != _SCORING_LEGACY:
+            score = self._score_one_cls_head(tokens)
+            if score is not None:
+                return score
+        return self._score_one_legacy(tokens)
+
     def rerank(
         self, query: str, passages: list[str]
     ) -> list[float] | None:
 
@@ -26,6 +26,7 @@
 
 RerankK = int | Literal["auto"]
 AUTO_RERANK_FAST_K = 5
+AUTO_RERANK_MID_K = 10
 AUTO_RERANK_DEEP_K = 20
 FILTERED_VEC_K_CAP_DEFAULT = 5000
 FILTERED_VEC_K_CAP_ENV = "SEEKLINK_FILTERED_VEC_K_CAP"
@@ -51,6 +52,15 @@
     "embedding",
     "vector",
 )
+_CJK_DEEP_RERANK_TERMS = (
+    "切块",
+    "分块",
+    "向量库",
+    "嵌入",
+    "embedding",
+    "chunk",
+    "vector database",
+)
 _METADATA_COMPANION_STOPWORDS = frozenset({
     "and",
     "are",
@@ -235,6 +245,11 @@ def _contains_technical_rerank_term(text: str) -> bool:
     return any(term in folded for term in _CJK_TECHNICAL_RERANK_TERMS)
 
 
+def _contains_deep_rerank_term(text: str) -> bool:
+    folded = text.casefold()
+    return any(term in folded for term in _CJK_DEEP_RERANK_TERMS)
+
+
 def _resolve_rerank_k_with_reason(
     query: str,
     rerank_k: RerankK,
@@ -244,10 +259,11 @@ def _resolve_rerank_k_with_reason(
 ) -> tuple[int, str]:
     """Resolve a numeric rerank budget for one query.
 
-    The default CLI path uses "auto", a conservative policy from the 22-query
-    pilot: English, source-metadata, and ordinary CJK lookups got most of
-    the reranker benefit by reranking only the top 5, while CJK / mixed
-    technical queries needed deeper candidates to recover recall.
+    The default CLI path uses "auto", a conservative policy from the bundled
+    blind fixture: English, source-metadata, and ordinary CJK lookups get most
+    of the reranker benefit from the top 5. CJK technical lookups use a mid
+    budget, and only chunk/vector-index style CJK queries use the deepest
+    budget because they may need candidates below the shallow first-stage head.
     """
     if isinstance(rerank_k, int):
         return rerank_k, "fixed"
@@ -259,8 +275,10 @@ def _resolve_rerank_k_with_reason(
         return AUTO_RERANK_DEEP_K, "filter"
     if title_ranks:
         return AUTO_RERANK_FAST_K, "title"
+    if _contains_cjk(query) and _contains_deep_rerank_term(query):
+        return AUTO_RERANK_DEEP_K, "cjk_deep"
     if _contains_cjk(query) and _contains_technical_rerank_term(query):
-        return AUTO_RERANK_DEEP_K, "cjk_technical"
+        return AUTO_RERANK_MID_K, "cjk_technical"
     return AUTO_RERANK_FAST_K, "default"