Skip to content

Commit 8d34499

Browse files
committed
perf(reranker): reduce MLX rerank latency
1 parent a7a96a2 commit 8d34499

9 files changed

Lines changed: 711 additions & 589 deletions

File tree

CHANGELOG.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212
## [0.6.0] - 2026-05-04
1313

1414
### Changed
15+
- `seeklink search --rerank-k auto` now uses a middle reranker budget for
16+
general CJK technical queries and reserves the deepest budget for filtered
17+
searches and chunk/vector-index style CJK queries, reducing optional MLX
18+
reranker latency while preserving the bundled blind-fixture quality gates.
19+
- The optional MLX Qwen3 reranker now scores `yes`/`no` through a two-token
20+
classifier head when the model supports tied embeddings, avoiding full
21+
vocabulary logits while keeping a legacy fallback via
22+
`SEEKLINK_RERANK_SCORING=legacy`.
1523
- Added copy-paste agent setup guidance to README and clarified `llms.txt`
1624
discovery cues for local Markdown vault retrieval.
1725
- Expanded PyPI keywords for agent, local-search, Markdown-search, and
@@ -34,10 +42,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3442
- Release verification now includes an 8-query filtered fixture. On the bundled
3543
fixture vault with reranking disabled, it reports Recall@10 1.000 and
3644
Answerable@10 1.000; the regular 22-query fixture remains at Recall@10
37-
0.985, MRR 0.977, and nDCG@10 0.901 with the optional MLX reranker active.
45+
0.985, MRR 0.977, and nDCG@10 0.902 with the optional MLX reranker active.
3846
- Refreshed `tests/blind/results/` with v0.6 release-quality snapshots: the
3947
v0.5 baseline, v0.6 shipping run, v0.6 filtered fixture, and v0.6 expansion
40-
upper bound.
48+
reference.
4149

4250
## [0.5.0] - 2026-05-04
4351

@@ -130,7 +138,7 @@ surface so the repo reads as a shipped tool rather than a work log.
130138

131139
### Changed
132140
- Consolidated the 0.3.0 / 0.3.1 narrative into a single release entry (this one). The earlier entries described the same code twice with process detail that did not belong in public release notes.
133-
- Trimmed `tests/blind/results/` to release-quality baseline, shipping, and upper-bound measurements. Intermediate iteration results removed.
141+
- Trimmed `tests/blind/results/` to release-quality baseline, shipping, and expansion-reference measurements. Intermediate iteration results removed.
134142
- Tightened internal code comments and test docstrings so they describe current behavior rather than the iteration history that produced it.
135143
- README metric claims explicitly labeled as "pilot" with sample size.
136144

@@ -144,7 +152,7 @@ surface so the repo reads as a shipped tool rather than a work log.
144152
- **Line-range retrieval.** `SearchResult` now carries 1-indexed inclusive `line_start` / `line_end` fields mapped through the indexer's frontmatter strip back to on-disk line numbers. CLI `search` prints `SCORE PATH:LINE TITLE` so agents can pipe the hit into a precise window read. A new `seeklink get PATH[:LINE] [-l N]` command performs that window read directly from the filesystem — no DB round-trip, no daemon involvement, universal-newline translation, path-escape rejection.
145153
- **Cold-start `search` reranker parity.** `seeklink search --vault PATH` (the cold-start path) now constructs a reranker and passes it to the search pipeline, matching the daemon. Before this change, the same query returned different rankings depending on whether a daemon happened to be running.
146154
- **Agent-first documentation.** New "For agents" section in the README (minimum workflow, output contract, exit codes, query-shape hints, daemon JSON fallback). `llms.txt` rewritten as an explicit contract.
147-
- **Blind-test framework** at `tests/blind/`: 32-file bilingual (CJK + English) fixture corpus (`tests/corpus/`), 22 ground-truth queries (`tests/blind/queries.yaml`), runner that cold-starts once per invocation and measures `recall_at_10` / `mrr` / `latency_ms` / `p95`. Three configurations: `A` (current baseline), `B` (planned query expansion — not yet shipped), `C` (hand-crafted expansion, RRF-fused; upper bound). Used to gate this release.
155+
- **Blind-test framework** at `tests/blind/`: 32-file bilingual (CJK + English) fixture corpus (`tests/corpus/`), 22 ground-truth queries (`tests/blind/queries.yaml`), runner that cold-starts once per invocation and measures `recall_at_10` / `mrr` / `latency_ms` / `p95`. Three configurations: `A` (current baseline), `B` (planned query expansion — not yet shipped), `C` (hand-crafted expansion, RRF-fused reference). Used to gate this release.
148156

149157
### Fixed
150158
- **`seeklink get` trailing-newline accounting.** `get FILE:LINE` on a newline-terminated file no longer counts the trailing `\n` as an extra logical line. `get FILE:6` on a 5-line file correctly emits the beyond-EOF warning instead of returning a blank line.

CHANGELOG.zh.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,12 @@
1212
## [0.6.0] - 2026-05-04
1313

1414
### 变更
15+
- `seeklink search --rerank-k auto` 现在会对一般中文技术查询使用中等
16+
reranker 预算,只把最深预算保留给过滤搜索和 chunk / 向量索引类中文查询,
17+
从而在保持捆绑盲测质量门槛的同时降低可选 MLX reranker 延迟。
18+
- 可选的 MLX Qwen3 reranker 现在会在模型支持 tied embeddings 时通过
19+
两个 token 的分类头计算 `yes` / `no` 分数,避免生成完整词表 logits;
20+
如有需要仍可通过 `SEEKLINK_RERANK_SCORING=legacy` 回退到旧路径。
1521
- 在 README 中新增可直接复制给 agent 的配置说明,并在 `llms.txt` 中强化本地 Markdown 笔记库检索的发现提示。
1622
- 扩展 PyPI 关键词,覆盖 agent、本地搜索、Markdown 搜索和 llms.txt 发现路径。
1723
- `seeklink search` 和单文件 `seeklink index` 现在支持 `--no-daemon`
@@ -24,8 +30,8 @@
2430

2531
### 开发
2632
- 盲测 runner 现在支持 source 级 folder/tag 过滤、filtered-vector 诊断,以及可选的 answerability 标签,用于检查 top-10 命中是否真的包含 agent 需要的答案文本。
27-
- 发布验证现在包含一个 8 条查询的过滤检索 fixture。在捆绑 fixture vault 上关闭 reranker 时,它的 Recall@10 为 1.000,Answerable@10 为 1.000;常规 22 条查询 fixture 在启用可选 MLX reranker 时仍为 Recall@10 0.985、MRR 0.977、nDCG@10 0.901
28-
-`tests/blind/results/` 刷新为 v0.6 发布质量快照:v0.5 baseline、v0.6 shipping run、v0.6 filtered fixture,以及 v0.6 expansion upper bound
33+
- 发布验证现在包含一个 8 条查询的过滤检索 fixture。在捆绑 fixture vault 上关闭 reranker 时,它的 Recall@10 为 1.000,Answerable@10 为 1.000;常规 22 条查询 fixture 在启用可选 MLX reranker 时仍为 Recall@10 0.985、MRR 0.977、nDCG@10 0.902
34+
-`tests/blind/results/` 刷新为 v0.6 发布质量快照:v0.5 baseline、v0.6 shipping run、v0.6 filtered fixture,以及 v0.6 expansion reference
2935

3036
## [0.5.0] - 2026-05-04
3137

@@ -86,7 +92,7 @@
8692

8793
### 变更
8894
- 将 0.3.0 / 0.3.1 的内容合并为单一发布条目(即本条)。早前的条目用流程细节描述了同一份代码,这些细节不属于公开的发布说明。
89-
- 精简 `tests/blind/results/`,仅保留发布质量的 baseline、shipped 和 upper-bound 测量结果。移除了中间迭代结果。
95+
- 精简 `tests/blind/results/`,仅保留发布质量的 baseline、shipped 和 expansion-reference 测量结果。移除了中间迭代结果。
9096
- 收紧了内部代码注释和测试文档字符串,使其描述当前行为而非产生它的迭代历史。
9197
- README 中的度量数据声明显式标注为"pilot"并附样本量。
9298

docs/blind-test.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,12 @@ private-vault measurements. Do not commit intermediate experiment output.
2424
|---|---|
2525
| `A` | Current product behavior: hybrid search plus the default reranker path when available |
2626
| `B` | Candidate query-expansion path, reserved for future experiments |
27-
| `C` | Hand-written expansion upper bound using the `expansion:` field in `queries.yaml` |
27+
| `C` | Hand-written expansion reference using the `expansion:` field in `queries.yaml` |
2828

29-
Config `A` is the release baseline. Config `C` answers whether expansion has
30-
headroom. Config `B` should not ship unless it beats `A` on quality without
31-
breaking the latency budget.
29+
Config `A` is the release baseline. Config `C` is a reference check for whether
30+
hand-written expansion looks promising; it can underperform `A` when expansion
31+
drifts or adds latency. Config `B` should not ship unless it beats `A` on
32+
quality without breaking the latency budget.
3233

3334
## Query Format
3435

@@ -158,14 +159,14 @@ uv run python tests/blind/run.py \
158159
--out .scratch/blind/A_rerank5.json
159160
```
160161

161-
Run the hand-written expansion upper bound:
162+
Run the hand-written expansion reference:
162163

163164
```bash
164165
uv run python tests/blind/run.py \
165166
--config C \
166167
--queries tests/blind/queries.yaml \
167168
--vault tests/corpus \
168-
--out .scratch/blind/C_upper_bound.json
169+
--out .scratch/blind/C_reference.json
169170
```
170171

171172
Only copy a result into `tests/blind/results/` when it is the final
@@ -197,8 +198,8 @@ candidate uses config `B`, require all of the following before shipping it:
197198
4. Human blind review prefers `B` by at least 0.5 points on a 1-5 scale.
198199
5. p95 latency is at most `min(3 * p95(A), 2500ms)`.
199200

200-
If config `C` is also close to `A`, expansion probably is not the right lever;
201-
look at chunking, metadata, filters, or the embedder instead.
201+
If config `C` does not clearly beat `A`, expansion probably is not the right
202+
lever; look at chunking, metadata, filters, or the embedder instead.
202203

203204
## Public vs Private Results
204205

@@ -207,7 +208,7 @@ Public repo:
207208
- fixture vault
208209
- labeled fixture queries
209210
- runner code
210-
- final baseline / shipping / upper-bound reference results
211+
- final baseline / shipping / expansion-reference results
211212

212213
Private or local only:
213214

seeklink/reranker.py

Lines changed: 83 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,11 @@
3434
"Given a web search query, retrieve relevant passages that answer the query."
3535
)
3636
_MAX_PASSAGE_TOKENS = 200
37+
_SCORING_MODE_ENV = "SEEKLINK_RERANK_SCORING"
38+
_SCORING_AUTO = "auto"
39+
_SCORING_CLS_HEAD = "cls_head"
40+
_SCORING_LEGACY = "legacy"
41+
_VALID_SCORING_MODES = {_SCORING_AUTO, _SCORING_CLS_HEAD, _SCORING_LEGACY}
3742

3843

3944
class Reranker:
@@ -51,8 +56,23 @@ def __init__(self) -> None:
5156
self._tokenizer = None
5257
self._token_yes: int | None = None
5358
self._token_no: int | None = None
59+
self._cls_weight = None
5460
self._lock = threading.Lock()
5561
self._disabled = self.MODEL_NAME == ""
62+
self._scoring_mode = (
63+
os.environ.get(_SCORING_MODE_ENV, _SCORING_AUTO)
64+
.strip()
65+
.casefold()
66+
.replace("-", "_")
67+
)
68+
if self._scoring_mode not in _VALID_SCORING_MODES:
69+
logger.warning(
70+
"Unsupported %s=%r; using %s",
71+
_SCORING_MODE_ENV,
72+
self._scoring_mode,
73+
_SCORING_AUTO,
74+
)
75+
self._scoring_mode = _SCORING_AUTO
5676

5777
@property
5878
def disabled(self) -> bool:
@@ -70,6 +90,8 @@ def _ensure_model(self) -> None:
7090
self._model, self._tokenizer = mlx_lm.load(self.MODEL_NAME)
7191
self._token_yes = self._tokenizer.convert_tokens_to_ids("yes")
7292
self._token_no = self._tokenizer.convert_tokens_to_ids("no")
93+
if self._scoring_mode != _SCORING_LEGACY:
94+
self._prepare_cls_head()
7395
except Exception as e:
7496
logger.warning(
7597
"Reranker load failed (%s): %s — reranking disabled",
@@ -78,6 +100,33 @@ def _ensure_model(self) -> None:
78100
)
79101
self._disabled = True
80102

103+
def _prepare_cls_head(self) -> None:
104+
"""Prepare a two-token classifier head when the MLX model supports it."""
105+
try:
106+
import mlx.core as mx
107+
108+
if self._model is None or self._token_yes is None or self._token_no is None:
109+
return
110+
body = getattr(self._model, "model", None)
111+
embed_tokens = getattr(body, "embed_tokens", None)
112+
if body is None or embed_tokens is None:
113+
if self._scoring_mode == _SCORING_CLS_HEAD:
114+
logger.warning(
115+
"Reranker cls_head scoring unavailable; using legacy logits"
116+
)
117+
return
118+
yes_no = embed_tokens(mx.array([self._token_yes, self._token_no]))
119+
self._cls_weight = yes_no[0] - yes_no[1]
120+
mx.eval(self._cls_weight)
121+
except Exception as e:
122+
self._cls_weight = None
123+
if self._scoring_mode == _SCORING_CLS_HEAD:
124+
logger.warning(
125+
"Reranker cls_head scoring failed to initialize (%s); "
126+
"using legacy logits",
127+
e,
128+
)
129+
81130
def _token_list(self, text: str) -> list[int]:
82131
"""Tokenize text into a flat Python list."""
83132
tokens = self._tokenizer.encode(text, return_tensors=None)
@@ -106,10 +155,7 @@ def _truncate_passage(self, passage: str) -> str:
106155
# Conservative fallback for unusual tokenizers without decode().
107156
return passage[:1200]
108157

109-
def _score_one(self, query: str, passage: str) -> float:
110-
"""Score a single (query, passage) pair. Returns 0-1 probability."""
111-
import mlx.core as mx
112-
158+
def _pair_tokens(self, query: str, passage: str) -> list[int]:
113159
passage = self._truncate_passage(passage)
114160
prompt = (
115161
f"Instruct: {_DEFAULT_INSTRUCTION}\n"
@@ -123,6 +169,30 @@ def _score_one(self, query: str, passage: str) -> float:
123169
text += "<think>\n"
124170

125171
tokens = self._token_list(text)
172+
return tokens
173+
174+
def _score_one_cls_head(self, tokens: list[int]) -> float | None:
175+
"""Score via final hidden state dot (yes_embedding - no_embedding)."""
176+
if self._cls_weight is None:
177+
return None
178+
179+
import mlx.core as mx
180+
181+
input_ids = mx.array([tokens])
182+
hidden = self._model.model(input_ids)
183+
last_h = hidden[0, -1, :]
184+
logit = mx.sum(last_h * self._cls_weight)
185+
mx.eval(logit)
186+
value = logit.item()
187+
if value >= 0:
188+
return 1.0 / (1.0 + math.exp(-value))
189+
exp_value = math.exp(value)
190+
return exp_value / (1.0 + exp_value)
191+
192+
def _score_one_legacy(self, tokens: list[int]) -> float:
193+
"""Score via full-vocabulary yes/no token logits."""
194+
import mlx.core as mx
195+
126196
input_ids = mx.array([tokens])
127197
logits = self._model(input_ids)
128198
last_logits = logits[0, -1, :]
@@ -135,6 +205,15 @@ def _score_one(self, query: str, passage: str) -> float:
135205
no_e = math.exp(no_s - max_s)
136206
return yes_e / (yes_e + no_e)
137207

208+
def _score_one(self, query: str, passage: str) -> float:
209+
"""Score a single (query, passage) pair. Returns 0-1 probability."""
210+
tokens = self._pair_tokens(query, passage)
211+
if self._scoring_mode != _SCORING_LEGACY:
212+
score = self._score_one_cls_head(tokens)
213+
if score is not None:
214+
return score
215+
return self._score_one_legacy(tokens)
216+
138217
def rerank(
139218
self, query: str, passages: list[str]
140219
) -> list[float] | None:

seeklink/search.py

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626

2727
RerankK = int | Literal["auto"]
2828
AUTO_RERANK_FAST_K = 5
29+
AUTO_RERANK_MID_K = 10
2930
AUTO_RERANK_DEEP_K = 20
3031
FILTERED_VEC_K_CAP_DEFAULT = 5000
3132
FILTERED_VEC_K_CAP_ENV = "SEEKLINK_FILTERED_VEC_K_CAP"
@@ -51,6 +52,15 @@
5152
"embedding",
5253
"vector",
5354
)
55+
_CJK_DEEP_RERANK_TERMS = (
56+
"切块",
57+
"分块",
58+
"向量库",
59+
"嵌入",
60+
"embedding",
61+
"chunk",
62+
"vector database",
63+
)
5464
_METADATA_COMPANION_STOPWORDS = frozenset({
5565
"and",
5666
"are",
@@ -235,6 +245,11 @@ def _contains_technical_rerank_term(text: str) -> bool:
235245
return any(term in folded for term in _CJK_TECHNICAL_RERANK_TERMS)
236246

237247

248+
def _contains_deep_rerank_term(text: str) -> bool:
249+
folded = text.casefold()
250+
return any(term in folded for term in _CJK_DEEP_RERANK_TERMS)
251+
252+
238253
def _resolve_rerank_k_with_reason(
239254
query: str,
240255
rerank_k: RerankK,
@@ -244,10 +259,11 @@ def _resolve_rerank_k_with_reason(
244259
) -> tuple[int, str]:
245260
"""Resolve a numeric rerank budget for one query.
246261
247-
The default CLI path uses "auto", a conservative policy from the 22-query
248-
pilot: English, source-metadata, and ordinary CJK lookups got most of
249-
the reranker benefit by reranking only the top 5, while CJK / mixed
250-
technical queries needed deeper candidates to recover recall.
262+
The default CLI path uses "auto", a conservative policy from the bundled
263+
blind fixture: English, source-metadata, and ordinary CJK lookups get most
264+
of the reranker benefit from the top 5. CJK technical lookups use a mid
265+
budget, and only chunk/vector-index style CJK queries use the deepest
266+
budget because they may need candidates below the shallow first-stage head.
251267
"""
252268
if isinstance(rerank_k, int):
253269
return rerank_k, "fixed"
@@ -259,8 +275,10 @@ def _resolve_rerank_k_with_reason(
259275
return AUTO_RERANK_DEEP_K, "filter"
260276
if title_ranks:
261277
return AUTO_RERANK_FAST_K, "title"
278+
if _contains_cjk(query) and _contains_deep_rerank_term(query):
279+
return AUTO_RERANK_DEEP_K, "cjk_deep"
262280
if _contains_cjk(query) and _contains_technical_rerank_term(query):
263-
return AUTO_RERANK_DEEP_K, "cjk_technical"
281+
return AUTO_RERANK_MID_K, "cjk_technical"
264282
return AUTO_RERANK_FAST_K, "default"
265283

266284

0 commit comments

Comments
 (0)