Skip to content

Commit 7d261e7

Browse files
prrao87claude
andauthored
Document fine-grained IVF search controls and custom index names (#220)
* Document fine-grained nprobes, distance_range, bypass_vector_index, and custom index names - Replace Search Configuration bullets with markdown tables covering core knobs and per-index-type nprobes guidance - Document minimum_nprobes / maximum_nprobes with a note on adaptive partition scanning under filtered queries - Add Advanced Search Controls subsection covering distance_range() for thresholded retrieval and bypass_vector_index() for recall measurement against flat-scan ground truth - Document that vector indexes support custom names via name=..., clarifying that the _idx suffix is a default convention - Add four runnable pytest snippet tests backing the new examples Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update snippets --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 045a80b commit 7d261e7

3 files changed

Lines changed: 179 additions & 11 deletions

File tree

docs/indexing/vector-index.mdx

Lines changed: 73 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@ import {
1616
PyVectorIndexBinaryBuildIndex as VectorIndexBinaryBuildIndex,
1717
PyVectorIndexBinarySearch as VectorIndexBinarySearch,
1818
PyVectorIndexCheckStatus as VectorIndexCheckStatus,
19+
PyVectorIndexNprobes as VectorIndexNprobes,
20+
PyVectorIndexDistanceRange as VectorIndexDistanceRange,
21+
PyVectorIndexBypassRecall as VectorIndexBypassRecall,
22+
PyVectorIndexCustomName as VectorIndexCustomName,
1923
} from '/snippets/indexing.mdx';
2024

2125
You can create and manage multiple vector indexes on any Lance dataset. LanceDB offers two kinds of vector indexing algorithms: **Inverted File (IVF)** and **Hierarchical Navigable Small World (HNSW)**.
@@ -144,17 +148,65 @@ Search using a random 1,536-dimensional embedding.
144148

145149
#### Search Configuration
146150

147-
The previous query uses:
151+
Core knobs available on a vector search call:
148152

149-
- `limit`: number of results to return
150-
- `nprobes`: number of IVF partitions to scan. LanceDB auto-tunes this by default.
151-
- `ef`: primarily relevant for HNSW-backed IVF indexes such as `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` (where `k=limit`) and increase up to `10 * k` for higher recall.
152-
- `nprobes` by index type:
153-
- `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`: usually keep auto-tuned `nprobes`, then tune `ef` first. For filtered search (`where(...)`), expect higher latency variance.
154-
- `IVF_RQ`: keep auto-tuned `nprobes`; increase only when recall is insufficient.
155-
- `IVF_PQ`: keep auto-tuned `nprobes`; increase when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`.
156-
- `refine_factor`: reads additional candidates and reranks in memory
157-
- `.to_pandas()`: converts the results to a pandas DataFrame
153+
| Parameter | Description |
154+
| :--- | :--- |
155+
| `limit` | Number of results to return (`k`). |
156+
| `nprobes` | Shorthand that sets both `minimum_nprobes` and `maximum_nprobes` to the same value. LanceDB auto-tunes this by default. |
157+
| `minimum_nprobes` | Partitions that are *always* scanned. Higher values raise recall at the cost of latency. |
158+
| `maximum_nprobes` | Upper bound on partitions scanned. The partitions above `minimum_nprobes` are only searched if the initial pass does not return enough results — useful for narrow filters. Set to `0` to remove the cap. |
159+
| `ef` | HNSW search-time exploration factor. Relevant for `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` and increase up to `10 * k` for higher recall. |
160+
| `refine_factor` | Reads additional candidates and reranks them in memory to recover recall lost to quantization. |
161+
162+
<Note>
163+
**Filtered queries and adaptive nprobes.** When a `where(...)` filter is active, LanceDB starts by scanning `minimum_nprobes` partitions and only extends toward `maximum_nprobes` if fewer than `limit` rows survive the filter. Setting `minimum_nprobes == maximum_nprobes` (or calling `nprobes(n)`) disables this adaptive behavior and fixes the partition count.
164+
</Note>
165+
166+
<CodeGroup>
167+
<CodeBlock filename="Python" language="Python" icon="python">
168+
{VectorIndexNprobes}
169+
</CodeBlock>
170+
</CodeGroup>
171+
172+
Recommended `nprobes` behavior by index type:
173+
174+
| Index type | Guidance |
175+
| :--- | :--- |
176+
| `IVF_HNSW_FLAT`, `IVF_HNSW_SQ` | Keep the auto-tuned `nprobes`, then tune `ef` first. Expect higher latency variance under filtered search. |
177+
| `IVF_RQ` | Keep auto-tuned `nprobes`; raise only when recall is insufficient. |
178+
| `IVF_PQ` | Keep auto-tuned `nprobes`; raise when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`. |
179+
180+
#### Advanced Search Controls
181+
182+
These controls are useful for thresholded retrieval, recall measurement, and working around index-level metric constraints.
183+
184+
| Method | Description |
185+
| :--- | :--- |
186+
| `distance_range(lower_bound, upper_bound)` | Return only rows whose distance falls within `[lower_bound, upper_bound)`. Either bound is optional. Useful for near-duplicate detection or "close-enough" matching. |
187+
| `bypass_vector_index()` | Skip the ANN index and perform an exhaustive (flat) scan. Primary uses: (1) compute ground-truth results to measure ANN recall@k, and (2) query with a metric the index was not built for (e.g., a non-cosine query on a multivector column). |
188+
189+
**Thresholding with `distance_range`:**
190+
191+
<CodeGroup>
192+
<CodeBlock filename="Python" language="Python" icon="python">
193+
{VectorIndexDistanceRange}
194+
</CodeBlock>
195+
</CodeGroup>
196+
197+
**Measuring recall with `bypass_vector_index`:**
198+
199+
Compare ANN results against a flat-scan ground truth to compute recall@k. This is the standard way to pick `nprobes` for your workload.
200+
201+
<CodeGroup>
202+
<CodeBlock filename="Python" language="Python" icon="python">
203+
{VectorIndexBypassRecall}
204+
</CodeBlock>
205+
</CodeGroup>
206+
207+
<Warning>
208+
Flat search is $O(n)$ — reserve `bypass_vector_index()` for sampled recall measurements or small tables, not production queries.
209+
</Warning>
158210

159211
## Example: Construct an HNSW Index
160212

@@ -238,11 +290,21 @@ Navigate to your table page - the "Index" column shows index status. It remains
238290

239291
### Option 2: Use the API
240292

241-
Use `list_indices()` and `index_stats()` to check index status. The index name is formed by appending "\_idx" to the column name. Note that `list_indices()` only returns information after the index is fully built.
293+
Use `list_indices()` and `index_stats()` to check index status. **By default**, the index name is formed by appending `_idx` to the column name (e.g., a `keywords_embeddings` column produces `keywords_embeddings_idx`). Note that `list_indices()` only returns information after the index is fully built.
242294
To wait until all data is fully indexed, you can specify the `wait_timeout` parameter on `create_index()` or call `wait_for_index()` on the table.
243295

244296
<CodeGroup>
245297
<CodeBlock filename="Python" language="Python" icon="python">
246298
{VectorIndexCheckStatus}
247299
</CodeBlock>
248300
</CodeGroup>
301+
302+
#### Custom Index Names
303+
304+
The `{column}_idx` suffix is a default convention, not the only supported naming path. Pass `name=...` to `create_index()` to override it — useful when you want to manage multiple indexes on the same column (for example, side-by-side `IVF_PQ` and `IVF_HNSW_SQ` builds) or when you script index replacement by name. Once set, `list_indices()`, `index_stats(name)`, and `wait_for_index([name])` all reference the custom name.
305+
306+
<CodeGroup>
307+
<CodeBlock filename="Python" language="Python" icon="python">
308+
{VectorIndexCustomName}
309+
</CodeBlock>
310+
</CodeGroup>

docs/snippets/indexing.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,18 @@ export const PyVectorIndexBuildHnsw = "table.create_index(index_type=\"IVF_HNSW_
4444

4545
export const PyVectorIndexBuildIvf = "table_name = \"vector-index-build-ivf\"\ntable = db.open_table(table_name)\ntable.create_index(\n metric=\"cosine\",\n vector_column_name=\"keywords_embeddings\",\n)\n";
4646

47+
export const PyVectorIndexBypassRecall = "query = np.random.random(128)\nk = 10\n\n# Ground truth: flat (exhaustive) scan, ignoring the ANN index.\ntruth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()[\"id\"])\n\n# ANN results with the current nprobes setting.\nann = set(table.search(query).nprobes(20).limit(k).to_pandas()[\"id\"])\n\nrecall_at_k = len(truth & ann) / k\n";
48+
4749
export const PyVectorIndexCheckStatus = "index_name = \"keywords_embeddings_idx\"\ntable.wait_for_index([index_name])\nprint(table.index_stats(index_name))\n";
4850

4951
export const PyVectorIndexConfigureIvf = "table.create_index(metric=\"l2\", num_partitions=16, num_sub_vectors=4)\n";
5052

53+
export const PyVectorIndexCustomName = "# Override the default `{column}_idx` convention by passing `name=...`.\ntable.create_index(\n metric=\"cosine\",\n vector_column_name=\"keywords_embeddings\",\n name=\"my_custom_index\",\n)\ntable.wait_for_index([\"my_custom_index\"])\nprint(table.index_stats(\"my_custom_index\"))\n";
54+
55+
export const PyVectorIndexDistanceRange = "# Only return results whose distance falls within [0.0, 0.5).\n# Useful for near-duplicate detection or thresholded similarity search.\n(\n table.search(np.random.random(128))\n .distance_range(lower_bound=0.0, upper_bound=0.5)\n .limit(10)\n .to_pandas()\n)\n";
56+
57+
export const PyVectorIndexNprobes = "# Always scan 10 partitions; scan up to 50 only if the initial pass\n# returns fewer than `limit` results (common with narrow filters).\n(\n table.search(np.random.random(128))\n .minimum_nprobes(10)\n .maximum_nprobes(50)\n .where(\"id > 100\")\n .limit(5)\n .to_pandas()\n)\n";
58+
5159
export const PyVectorIndexQueryHnsw = "tbl = table\ntbl.search(np.random.random((16))).limit(2).to_pandas()\n";
5260

5361
export const PyVectorIndexQueryIvf = "tbl = table\ntbl.search(np.random.random((1536))).limit(2).nprobes(20).refine_factor(\n 10\n).to_pandas()\n";

tests/py/test_indexing.py

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,104 @@ def test_vector_index_query_ivf(tmp_db):
9898
assert len(df) == 2
9999

100100

101+
def test_vector_index_nprobes(tmp_db):
102+
dim = 128
103+
data = [
104+
{"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
105+
for i in range(512)
106+
]
107+
table = tmp_db.create_table("vector_index_nprobes", data, mode="overwrite")
108+
table.create_index(
109+
metric="cosine",
110+
vector_column_name="keywords_embeddings",
111+
)
112+
113+
# --8<-- [start:vector_index_nprobes]
114+
# Always scan 10 partitions; scan up to 50 only if the initial pass
115+
# returns fewer than `limit` results (common with narrow filters).
116+
(
117+
table.search(np.random.random(128))
118+
.minimum_nprobes(10)
119+
.maximum_nprobes(50)
120+
.where("id > 100")
121+
.limit(5)
122+
.to_pandas()
123+
)
124+
# --8<-- [end:vector_index_nprobes]
125+
126+
127+
def test_vector_index_distance_range(tmp_db):
128+
dim = 128
129+
data = [
130+
{"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
131+
for i in range(256)
132+
]
133+
table = tmp_db.create_table("vector_index_distance_range", data, mode="overwrite")
134+
table.create_index(
135+
metric="cosine",
136+
vector_column_name="keywords_embeddings",
137+
)
138+
139+
# --8<-- [start:vector_index_distance_range]
140+
# Only return results whose distance falls within [0.0, 0.5).
141+
# Useful for near-duplicate detection or thresholded similarity search.
142+
(
143+
table.search(np.random.random(128))
144+
.distance_range(lower_bound=0.0, upper_bound=0.5)
145+
.limit(10)
146+
.to_pandas()
147+
)
148+
# --8<-- [end:vector_index_distance_range]
149+
150+
151+
def test_vector_index_bypass_recall(tmp_db):
152+
dim = 128
153+
data = [
154+
{"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
155+
for i in range(256)
156+
]
157+
table = tmp_db.create_table("vector_index_bypass_recall", data, mode="overwrite")
158+
table.create_index(
159+
metric="cosine",
160+
vector_column_name="keywords_embeddings",
161+
)
162+
163+
# --8<-- [start:vector_index_bypass_recall]
164+
query = np.random.random(128)
165+
k = 10
166+
167+
# Ground truth: flat (exhaustive) scan, ignoring the ANN index.
168+
truth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()["id"])
169+
170+
# ANN results with the current nprobes setting.
171+
ann = set(table.search(query).nprobes(20).limit(k).to_pandas()["id"])
172+
173+
recall_at_k = len(truth & ann) / k
174+
# --8<-- [end:vector_index_bypass_recall]
175+
assert 0.0 <= recall_at_k <= 1.0
176+
177+
178+
def test_vector_index_custom_name(tmp_db):
179+
table = tmp_db.create_table(
180+
"vector_index_custom_name",
181+
_make_vector_rows(512, 8, column="keywords_embeddings"),
182+
mode="overwrite",
183+
)
184+
185+
# --8<-- [start:vector_index_custom_name]
186+
# Override the default `{column}_idx` convention by passing `name=...`.
187+
table.create_index(
188+
metric="cosine",
189+
vector_column_name="keywords_embeddings",
190+
name="my_custom_index",
191+
)
192+
table.wait_for_index(["my_custom_index"])
193+
print(table.index_stats("my_custom_index"))
194+
# --8<-- [end:vector_index_custom_name]
195+
196+
assert table.index_stats("my_custom_index")
197+
198+
101199
def test_vector_index_hnsw(tmp_db):
102200
table = tmp_db.create_table(
103201
"vector_index_hnsw",

0 commit comments

Comments
 (0)