Skip to content

Commit b3e4a5a

Browse files
authored
Docs updates to reranker section from audit workflow (#228)
* Fix bug with preparing artifact outputs * Add reranking manifest * Update workflow automation doc * Update reranker docs based on audit
1 parent 3aba8e6 commit b3e4a5a

8 files changed

Lines changed: 240 additions & 32 deletions

File tree

docs/reranking/custom-reranker.mdx

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,16 @@ description: Learn how to create custom rerankers in LanceDB by extending the ba
55
icon: "code"
66
---
77

8-
You can build your own custom reranker in LanceDB by subclassing the `Reranker` class and implementing the
9-
`rerank_hybrid()` method. Optionally, you can also implement the `rerank_vector()` and `rerank_fts()`
10-
methods if you want to support reranking for vector and FTS search separately.
8+
You can build your own custom reranker in LanceDB by subclassing the base `Reranker` class. At a
9+
minimum, you need to implement `rerank_hybrid()`, which is the logic that combines vector and
10+
full-text search results. Beyond that, you can optionally implement `rerank_vector()` and
11+
`rerank_fts()` if you want your reranker to also handle pure vector or pure full-text searches.
12+
13+
Decide up front which surfaces — hybrid, pure vector, or pure full-text — your reranker should
14+
cover, and only override the ones you need. The base class leaves `rerank_vector()` and
15+
`rerank_fts()` unimplemented, so calling `.rerank(...)` on a single-modality search you haven't
16+
overridden raises `NotImplementedError` rather than silently returning unsorted results. That's a
17+
useful guard, but worth knowing about before you wire up a query path you didn't plan for.
1118

1219
## Interface
1320

@@ -18,6 +25,11 @@ first copy of the row encountered. This works well in cases that don't require t
1825
and full-text search to combine the results. If you want to use the scores or want to support
1926
`return_score="all"`, you'll need to implement your own merging algorithm.
2027

28+
Whichever methods you override, your reranker has one job on the way out: attach a
29+
`_relevance_score` column with the most relevant rows at the top. LanceDB will reject the result
30+
if that column is missing, and downstream `.limit(...)` calls trust the order you return, so
31+
sort descending before handing the table back.
32+
2133
Below, we show the pseudocode of a custom reranker that combines the results of semantic and full-text
2234
search using a linear combination of the scores:
2335

docs/reranking/eval.mdx

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,26 @@ Combining results from multiple searches thus requires a reranking step.
1515

1616
There are two common approaches for reranking search results from multiple sources.
1717

18-
- **Score-based**: Calculate final relevance scores based on a weighted linear combination of individual search algorithm scores. Example: Weighted linear combination of semantic search & keyword-based search results.
18+
- **Score-based**: Calculate final relevance scores from the individual search algorithm scores. Examples: Reciprocal Rank Fusion (the default in LanceDB), and weighted linear combination of semantic & keyword-based search scores.
1919

20-
- **Relevance-based**: Discards the existing scores and calculates the relevance of each search result-query pair. Example: Cross Encoder models
20+
- **Relevance-based**: Discards the existing scores and calculates the relevance of each search result-query pair. Example: Cross Encoder models
21+
22+
<Info>
23+
If you call `.rerank()` on a hybrid query without passing a reranker, LanceDB defaults to
24+
`RRFReranker()` — a score-based reranker that uses Reciprocal Rank Fusion. This is the
25+
score-based path most readers encounter first; `LinearCombinationReranker` is an alternative
26+
score-based strategy you opt into explicitly.
27+
</Info>
28+
29+
The hybrid `rerank(...)` method also accepts a `normalize` argument that controls how the raw
30+
vector and FTS scores are made comparable before reranking:
31+
32+
- `normalize="score"` (the default) — normalizes the raw vector and FTS scores directly.
33+
- `normalize="rank"` — converts each result list to ranks first, then normalizes.
34+
35+
This choice materially affects score-based rerankers (such as `LinearCombinationReranker`), so
36+
when you evaluate score-based strategies, treat `normalize` as a tunable hyperparameter
37+
alongside the reranker itself.
2138

2239
Even though there may many more strategies for reranking, there are no "universally best"
2340
ones that work well for all cases, because they be dataset or application specific.

docs/reranking/index.mdx

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,16 @@ LanceDB supports several rerankers out of the box. Here are a few examples:
4242

4343
You can find more details about these and other rerankers in the [integrations](/integrations/reranking) section.
4444

45+
<Note>
46+
**SDK coverage differs across languages**
47+
48+
The provider-specific rerankers in the table above
49+
(`CohereReranker`, `CrossEncoderReranker`, `ColbertReranker`, and others under `lancedb.rerankers`)
50+
are currently **Python-only**. The TypeScript and Rust SDKs currently expose the generic `Reranker`
51+
interface (`rerankHybrid` / `rerank_hybrid`) and the built-in `RRFReranker`. To use a
52+
model-based reranker from TypeScript or Rust, you must implement the `Reranker` interface yourself.
53+
</Note>
54+
4555

4656
### Multi-vector reranking
4757
Most rerankers support reranking based on multiple vectors. To rerank based on multiple vectors, you can pass a list of vectors to the `rerank` method. Here's an example of how to rerank based on multiple vector columns using the `CrossEncoderReranker`:
@@ -54,14 +64,22 @@ reranker = CrossEncoderReranker()
5464

5565
query = "hello"
5666

57-
res1 = table.search(query, vector_column_name="vector").limit(3)
58-
res2 = table.search(query, vector_column_name="text_vector").limit(3)
59-
res3 = table.search(query, vector_column_name="meta_vector").limit(3)
67+
# `deduplicate=True` requires `_rowid` on every input result set,
68+
# so call `.with_row_id(True)` on each search before passing it in.
69+
res1 = table.search(query, vector_column_name="vector").limit(3).with_row_id(True)
70+
res2 = table.search(query, vector_column_name="text_vector").limit(3).with_row_id(True)
71+
res3 = table.search(query, vector_column_name="meta_vector").limit(3).with_row_id(True)
6072

61-
reranked = reranker.rerank_multivector([res1, res2, res3], deduplicate=True)
73+
reranked = reranker.rerank_multivector([res1, res2, res3], deduplicate=True)
6274
```
6375
</CodeGroup>
6476

77+
- Passing `deduplicate=True` to `rerank_multivector(...)` raises a `ValueError` if any of the
78+
input result sets is missing the `_rowid` column. Therefore, it's recommended to add `.with_row_id(True)` to every
79+
`table.search(...)` call before reranking, or omit `deduplicate=True` if you don't need it.
80+
- `RRFReranker.rerank_multivector(...)` always requires `_rowid` on its inputs, regardless of
81+
the `deduplicate` flag.
82+
6583
## Creating Custom Rerankers
6684

6785
LanceDB also allows you to create custom rerankers by extending the base `Reranker` class. The custom reranker

workflows/docs-audit/README.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -115,14 +115,17 @@ So `--area indexing` maps to `manifests/indexing.toml`. If you add `manifests/se
115115
uv run python scripts/run_audit.py prepare --area search --refresh
116116
```
117117

118-
This creates a new run directory under `artifacts/runs/<run_id>/` and prints a JSON summary to stdout.
118+
This creates a pending run directory under `artifacts/pending/<run_id>/` and prints a JSON summary to stdout.
119119

120-
After the LLM phase writes the expected outputs into that run directory, complete the run with:
120+
After the LLM phase writes the expected outputs into that pending run directory, complete the run with:
121121

122122
```bash
123123
uv run python scripts/run_audit.py complete --run-id <run_id>
124124
```
125125

126+
Completion publishes the directory to `artifacts/runs/<run_id>/`. Directories under `artifacts/runs/`
127+
are completed audit artifacts and should contain `report.md`.
128+
126129
To clean up old generated run artifacts, use:
127130

128131
```bash
@@ -146,7 +149,7 @@ uv run python scripts/run_audit.py prepare \
146149

147150
## Inspecting Artifacts
148151

149-
Each run directory contains:
152+
Each completed run directory under `artifacts/runs/<run_id>/` contains:
150153

151154
- `metadata.json`: run-level metadata, repo refresh results, selection decisions
152155
- `page_bundles/*.json`: deterministic evidence bundles per page
@@ -156,6 +159,10 @@ Each run directory contains:
156159

157160
`artifacts/latest_run.json` points to the most recently completed run.
158161

162+
Pending run directories under `artifacts/pending/<run_id>/` are working directories from `prepare`.
163+
They are used for manifest validation and LLM drafting, and are not considered completed artifacts
164+
until `complete` publishes them.
165+
159166
## Using and Updating Area Manifests
160167

161168
The manifest is the only thing you usually need to change when you want to audit another docs domain. Treat it as a mapping file:
@@ -272,7 +279,7 @@ A practical workflow:
272279
5. Replace each source block with a small set of relevant files.
273280
6. Make sure every `applies_to` entry refers to a real page `id`.
274281
7. Add the area to `enabled_areas` in `config.toml` if your automation depends on that list.
275-
8. Run `prepare --area <new-area>` and inspect the generated `page_bundles/*.json`.
282+
8. Run `prepare --area <new-area>` and inspect the generated `page_bundles/*.json` in the printed pending `run_dir`.
276283

277284
### 6. Sanity-check the manifest before using it weekly
278285

@@ -284,9 +291,10 @@ uv run python scripts/run_audit.py prepare --area <new-area>
284291

285292
Then inspect:
286293

287-
- `artifacts/runs/<run_id>/metadata.json`
288-
- `artifacts/runs/<run_id>/selected_pages.json`
289-
- `artifacts/runs/<run_id>/page_bundles/*.json`
294+
- the `run_dir` printed by `prepare` (normally `artifacts/pending/<run_id>`)
295+
- `<run_dir>/metadata.json`
296+
- `<run_dir>/selected_pages.json`
297+
- `<run_dir>/page_bundles/*.json`
290298

291299
If the bundles look noisy, the fix is usually one of:
292300

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
name = "reranking"
2+
description = "Audit the user-guide reranking docs against public SDK reranker APIs, tested behavior, and user-facing examples in lancedb and docs snippets/tests."
3+
docs_repo = "docs"
4+
rotation_unit = "page"
5+
keywords = [
6+
"rerank_hybrid",
7+
"rerank_vector",
8+
"rerank_fts",
9+
"rerank_multivector",
10+
"return_score",
11+
"RRFReranker",
12+
"MRRReranker",
13+
"LinearCombinationReranker",
14+
]
15+
16+
[[pages]]
17+
id = "overview"
18+
title = "Reranking Search Results"
19+
path = "docs/reranking/index.mdx"
20+
keywords = ["reranking", "CohereReranker", "CrossEncoderReranker", "ColbertReranker", "rerank_multivector", "deduplicate"]
21+
22+
[[pages]]
23+
id = "custom-reranker"
24+
title = "Building Custom Rerankers"
25+
path = "docs/reranking/custom-reranker.mdx"
26+
keywords = ["custom reranker", "Reranker", "rerank_hybrid", "rerank_vector", "rerank_fts", "merge_results", "return_score"]
27+
28+
[[pages]]
29+
id = "evaluation"
30+
title = "Evaluating Hybrid Search Performance"
31+
path = "docs/reranking/eval.mdx"
32+
keywords = ["hybrid search", "reranking strategies", "score-based", "relevance-based", "Linear Combination", "Cross Encoder", "Cohere", "ColBERT"]
33+
34+
[[sources]]
35+
id = "lancedb-python-reranker-core"
36+
repo = "lancedb"
37+
kind = "public_python_api"
38+
applies_to = ["overview", "custom-reranker", "evaluation"]
39+
paths = [
40+
"python/python/lancedb/rerankers/__init__.py",
41+
"python/python/lancedb/rerankers/base.py",
42+
"python/python/lancedb/rerankers/linear_combination.py",
43+
"python/python/lancedb/rerankers/mrr.py",
44+
"python/python/lancedb/rerankers/rrf.py",
45+
"python/python/lancedb/query.py",
46+
]
47+
extract_keywords = ["Reranker", "rerank_hybrid", "rerank_vector", "rerank_fts", "merge_results", "rerank_multivector", "return_score", "_relevance_score", "RRFReranker", "MRRReranker", "LinearCombinationReranker"]
48+
49+
[[sources]]
50+
id = "lancedb-python-reranker-tests"
51+
repo = "lancedb"
52+
kind = "public_python_tests"
53+
applies_to = ["overview", "custom-reranker", "evaluation"]
54+
paths = [
55+
"python/python/tests/test_rerankers.py",
56+
"python/python/tests/test_hybrid_query.py",
57+
]
58+
extract_keywords = ["return_score", "_relevance_score", "rerank_multivector", "deduplicate", "RRFReranker", "MRRReranker", "LinearCombinationReranker", "CohereReranker", "CrossEncoderReranker", "ColbertReranker"]
59+
60+
[[sources]]
61+
id = "lancedb-python-provider-rerankers-overview"
62+
repo = "lancedb"
63+
kind = "public_python_api"
64+
applies_to = ["overview"]
65+
paths = [
66+
"python/python/lancedb/rerankers/cohere.py",
67+
"python/python/lancedb/rerankers/colbert.py",
68+
"python/python/lancedb/rerankers/cross_encoder.py",
69+
]
70+
extract_keywords = ["CohereReranker", "ColbertReranker", "CrossEncoderReranker", "model_name", "return_score", "_relevance_score"]
71+
72+
[[sources]]
73+
id = "lancedb-typescript-rust-rerankers"
74+
repo = "lancedb"
75+
kind = "typescript_rust_api"
76+
applies_to = ["overview", "custom-reranker"]
77+
paths = [
78+
"nodejs/lancedb/rerankers/index.ts",
79+
"nodejs/lancedb/rerankers/rrf.ts",
80+
"nodejs/__test__/rerankers.test.ts",
81+
"rust/lancedb/src/rerankers.rs",
82+
"rust/lancedb/src/rerankers/rrf.rs",
83+
"rust/lancedb/src/query.rs",
84+
]
85+
extract_keywords = ["Reranker", "RRFReranker", "rerankHybrid", "rerank_hybrid", "_relevance_score", "custom reranker"]
86+
87+
[[sources]]
88+
id = "lancedb-generated-js-reranker-docs"
89+
repo = "lancedb"
90+
kind = "generated_api_docs"
91+
applies_to = ["overview"]
92+
paths = [
93+
"docs/src/js/namespaces/rerankers/README.md",
94+
"docs/src/js/namespaces/rerankers/interfaces/Reranker.md",
95+
"docs/src/js/namespaces/rerankers/classes/RRFReranker.md",
96+
"docs/src/js/classes/VectorQuery.md",
97+
]
98+
extract_keywords = ["rerankers", "Reranker", "RRFReranker", "create", "rerank"]
99+
100+
[[sources]]
101+
id = "sophon-reranking-example-surface"
102+
repo = "sophon"
103+
kind = "enterprise_surface"
104+
applies_to = ["overview"]
105+
paths = [
106+
"src/dash/src/components/examples/components/ExampleCards.tsx",
107+
]
108+
extract_keywords = ["reranking", "RRF", "hybrid-search", "custom reranking"]

workflows/docs-audit/prompts/weekly_automation.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,23 +39,27 @@ Then read the manifest file for each area listed in `enabled_areas` in `config.t
3939
- `uv run python scripts/run_audit.py prepare --area <first-area> --refresh`
4040
- For subsequent areas in the same weekly run, skip the refresh to avoid repeating `git pull`:
4141
- `uv run python scripts/run_audit.py prepare --area <next-area>`
42-
4. Read the JSON summary printed by each `prepare` command and locate each new run directory.
43-
5. For each run directory, read `selected_pages.json` and the corresponding files in `page_bundles/`.
42+
4. Read the JSON summary printed by each `prepare` command and locate each pending run directory.
43+
- Use the printed `run_dir`; it should point under `artifacts/pending/<run_id>`.
44+
- Do not create or write directly under `artifacts/runs/<run_id>` before completion.
45+
5. For each pending run directory, read `selected_pages.json` and the corresponding files in `page_bundles/`.
4446
6. For each selected page bundle:
4547
- apply `prompts/page_audit_guidelines.md` as the page-level review rubric
4648
- infer normalized code claims from the evidence bundle
4749
- infer normalized doc claims from the docs bundle
4850
- identify only the missing documentation
49-
7. Write semantic outputs under `llm_outputs/` in each run directory.
51+
7. Write semantic outputs under `llm_outputs/` in each pending run directory.
5052
- one file per page for code claims
5153
- one file per page for doc claims
5254
- one file per page for candidate gaps
53-
8. Write `report.md` in each run directory.
55+
8. Write `report.md` in each pending run directory.
5456
- `report.md` is the docs-gap summary only.
5557
- Do not include refresh status, manifest-maintenance notes, selected-pages bookkeeping, or any other workflow narration in `report.md`.
5658
- Include operational notes only if they materially affected audit quality, such as an unrefreshable repo, missing source files, or a manifest ambiguity that changes confidence in the findings.
5759
9. Complete each run:
5860
- `uv run python scripts/run_audit.py complete --run-id <run_id>`
61+
- Completion publishes the pending directory to `artifacts/runs/<run_id>` and updates `artifacts/latest_run.json`.
62+
- Only completed runs with `report.md` should appear under `artifacts/runs/`.
5963
10. Return a concise markdown summary suitable for the Codex inbox item.
6064

6165
## Manifest maintenance rules

0 commit comments

Comments
 (0)