Document fine-grained IVF search controls and custom index names (#220)

prrao87 · claude · web-flow · commit 7d261e76ab07 · 2026-04-21T15:01:11.000+08:00
* Document fine-grained nprobes, distance_range, bypass_vector_index, and custom index names

- Replace Search Configuration bullets with markdown tables covering
  core knobs and per-index-type nprobes guidance
- Document minimum_nprobes / maximum_nprobes with a note on adaptive
  partition scanning under filtered queries
- Add Advanced Search Controls subsection covering distance_range()
  for thresholded retrieval and bypass_vector_index() for recall
  measurement against flat-scan ground truth
- Document that vector indexes support custom names via name=...,
  clarifying that the _idx suffix is a default convention
- Add four runnable pytest snippet tests backing the new examples

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* Update snippets

---------

Co-authored-by: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/indexing/vector-index.mdx b/docs/indexing/vector-index.mdx
@@ -16,6 +16,10 @@ import {
     PyVectorIndexBinaryBuildIndex as VectorIndexBinaryBuildIndex,
     PyVectorIndexBinarySearch as VectorIndexBinarySearch,
     PyVectorIndexCheckStatus as VectorIndexCheckStatus,
+    PyVectorIndexNprobes as VectorIndexNprobes,
+    PyVectorIndexDistanceRange as VectorIndexDistanceRange,
+    PyVectorIndexBypassRecall as VectorIndexBypassRecall,
+    PyVectorIndexCustomName as VectorIndexCustomName,
 } from '/snippets/indexing.mdx';
 
 You can create and manage multiple vector indexes on any Lance dataset. LanceDB offers two kinds of vector indexing algorithms: **Inverted File (IVF)** and **Hierarchical Navigable Small World (HNSW)**.
@@ -144,17 +148,65 @@ Search using a random 1,536-dimensional embedding.
 
 #### Search Configuration
 
-The previous query uses:
+Core knobs available on a vector search call:
 
-- `limit`: number of results to return
-- `nprobes`: number of IVF partitions to scan. LanceDB auto-tunes this by default.
-- `ef`: primarily relevant for HNSW-backed IVF indexes such as `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` (where `k=limit`) and increase up to `10 * k` for higher recall.
-- `nprobes` by index type:
-    - `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`: usually keep auto-tuned `nprobes`, then tune `ef` first. For filtered search (`where(...)`), expect higher latency variance.
-    - `IVF_RQ`: keep auto-tuned `nprobes`; increase only when recall is insufficient.
-    - `IVF_PQ`: keep auto-tuned `nprobes`; increase when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`.
-- `refine_factor`: reads additional candidates and reranks in memory
-- `.to_pandas()`: converts the results to a pandas DataFrame
+| Parameter | Description |
+| :--- | :--- |
+| `limit` | Number of results to return (`k`). |
+| `nprobes` | Shorthand that sets both `minimum_nprobes` and `maximum_nprobes` to the same value. LanceDB auto-tunes this by default. |
+| `minimum_nprobes` | Partitions that are *always* scanned. Higher values raise recall at the cost of latency. |
+| `maximum_nprobes` | Upper bound on partitions scanned. The partitions above `minimum_nprobes` are only searched if the initial pass does not return enough results — useful for narrow filters. Set to `0` to remove the cap. |
+| `ef` | HNSW search-time exploration factor. Relevant for `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` and increase up to `10 * k` for higher recall. |
+| `refine_factor` | Reads additional candidates and reranks them in memory to recover recall lost to quantization. |
+
+<Note>
+**Filtered queries and adaptive nprobes.** When a `where(...)` filter is active, LanceDB starts by scanning `minimum_nprobes` partitions and only extends toward `maximum_nprobes` if fewer than `limit` rows survive the filter. Setting `minimum_nprobes == maximum_nprobes` (or calling `nprobes(n)`) disables this adaptive behavior and fixes the partition count.
+</Note>
+
+<CodeGroup>
+    <CodeBlock filename="Python" language="Python" icon="python">
+    {VectorIndexNprobes}
+    </CodeBlock>
+</CodeGroup>
+
+Recommended `nprobes` behavior by index type:
+
+| Index type | Guidance |
+| :--- | :--- |
+| `IVF_HNSW_FLAT`, `IVF_HNSW_SQ` | Keep the auto-tuned `nprobes`, then tune `ef` first. Expect higher latency variance under filtered search. |
+| `IVF_RQ` | Keep auto-tuned `nprobes`; raise only when recall is insufficient. |
+| `IVF_PQ` | Keep auto-tuned `nprobes`; raise when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`. |
+
+#### Advanced Search Controls
+
+These controls are useful for thresholded retrieval, recall measurement, and working around index-level metric constraints.
+
+| Method | Description |
+| :--- | :--- |
+| `distance_range(lower_bound, upper_bound)` | Return only rows whose distance falls within `[lower_bound, upper_bound)`. Either bound is optional. Useful for near-duplicate detection or "close-enough" matching. |
+| `bypass_vector_index()` | Skip the ANN index and perform an exhaustive (flat) scan. Primary uses: (1) compute ground-truth results to measure ANN recall@k, and (2) query with a metric the index was not built for (e.g., a non-cosine query on a multivector column). |
+
+**Thresholding with `distance_range`:**
+
+<CodeGroup>
+    <CodeBlock filename="Python" language="Python" icon="python">
+    {VectorIndexDistanceRange}
+    </CodeBlock>
+</CodeGroup>
+
+**Measuring recall with `bypass_vector_index`:**
+
+Compare ANN results against a flat-scan ground truth to compute recall@k. This is the standard way to pick `nprobes` for your workload.
+
+<CodeGroup>
+    <CodeBlock filename="Python" language="Python" icon="python">
+    {VectorIndexBypassRecall}
+    </CodeBlock>
+</CodeGroup>
+
+<Warning>
+Flat search is $O(n)$ — reserve `bypass_vector_index()` for sampled recall measurements or small tables, not production queries.
+</Warning>
 
 ## Example: Construct an HNSW Index
 
@@ -238,11 +290,21 @@ Navigate to your table page - the "Index" column shows index status. It remains
 
 ### Option 2: Use the API
 
-Use `list_indices()` and `index_stats()` to check index status. The index name is formed by appending "\_idx" to the column name. Note that `list_indices()` only returns information after the index is fully built.
+Use `list_indices()` and `index_stats()` to check index status. **By default**, the index name is formed by appending `_idx` to the column name (e.g., a `keywords_embeddings` column produces `keywords_embeddings_idx`). Note that `list_indices()` only returns information after the index is fully built.
 To wait until all data is fully indexed, you can specify the `wait_timeout` parameter on `create_index()` or call `wait_for_index()` on the table.
 
 <CodeGroup>
     <CodeBlock filename="Python" language="Python" icon="python">
     {VectorIndexCheckStatus}
     </CodeBlock>
 </CodeGroup>
+
+#### Custom Index Names
+
+The `{column}_idx` suffix is a default convention, not the only supported naming path. Pass `name=...` to `create_index()` to override it — useful when you want to manage multiple indexes on the same column (for example, side-by-side `IVF_PQ` and `IVF_HNSW_SQ` builds) or when you script index replacement by name. Once set, `list_indices()`, `index_stats(name)`, and `wait_for_index([name])` all reference the custom name.
+
+<CodeGroup>
+    <CodeBlock filename="Python" language="Python" icon="python">
+    {VectorIndexCustomName}
+    </CodeBlock>
+</CodeGroup>
diff --git a/docs/snippets/indexing.mdx b/docs/snippets/indexing.mdx
@@ -44,10 +44,18 @@ export const PyVectorIndexBuildHnsw = "table.create_index(index_type=\"IVF_HNSW_
 
 export const PyVectorIndexBuildIvf = "table_name = \"vector-index-build-ivf\"\ntable = db.open_table(table_name)\ntable.create_index(\n    metric=\"cosine\",\n    vector_column_name=\"keywords_embeddings\",\n)\n";
 
+export const PyVectorIndexBypassRecall = "query = np.random.random(128)\nk = 10\n\n# Ground truth: flat (exhaustive) scan, ignoring the ANN index.\ntruth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()[\"id\"])\n\n# ANN results with the current nprobes setting.\nann = set(table.search(query).nprobes(20).limit(k).to_pandas()[\"id\"])\n\nrecall_at_k = len(truth & ann) / k\n";
+
 export const PyVectorIndexCheckStatus = "index_name = \"keywords_embeddings_idx\"\ntable.wait_for_index([index_name])\nprint(table.index_stats(index_name))\n";
 
 export const PyVectorIndexConfigureIvf = "table.create_index(metric=\"l2\", num_partitions=16, num_sub_vectors=4)\n";
 
+export const PyVectorIndexCustomName = "# Override the default `{column}_idx` convention by passing `name=...`.\ntable.create_index(\n    metric=\"cosine\",\n    vector_column_name=\"keywords_embeddings\",\n    name=\"my_custom_index\",\n)\ntable.wait_for_index([\"my_custom_index\"])\nprint(table.index_stats(\"my_custom_index\"))\n";
+
+export const PyVectorIndexDistanceRange = "# Only return results whose distance falls within [0.0, 0.5).\n# Useful for near-duplicate detection or thresholded similarity search.\n(\n    table.search(np.random.random(128))\n    .distance_range(lower_bound=0.0, upper_bound=0.5)\n    .limit(10)\n    .to_pandas()\n)\n";
+
+export const PyVectorIndexNprobes = "# Always scan 10 partitions; scan up to 50 only if the initial pass\n# returns fewer than `limit` results (common with narrow filters).\n(\n    table.search(np.random.random(128))\n    .minimum_nprobes(10)\n    .maximum_nprobes(50)\n    .where(\"id > 100\")\n    .limit(5)\n    .to_pandas()\n)\n";
+
 export const PyVectorIndexQueryHnsw = "tbl = table\ntbl.search(np.random.random((16))).limit(2).to_pandas()\n";
 
 export const PyVectorIndexQueryIvf = "tbl = table\ntbl.search(np.random.random((1536))).limit(2).nprobes(20).refine_factor(\n    10\n).to_pandas()\n";
diff --git a/tests/py/test_indexing.py b/tests/py/test_indexing.py
@@ -98,6 +98,104 @@ def test_vector_index_query_ivf(tmp_db):
     assert len(df) == 2
 
 
+def test_vector_index_nprobes(tmp_db):
+    dim = 128
+    data = [
+        {"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
+        for i in range(512)
+    ]
+    table = tmp_db.create_table("vector_index_nprobes", data, mode="overwrite")
+    table.create_index(
+        metric="cosine",
+        vector_column_name="keywords_embeddings",
+    )
+
+    # --8<-- [start:vector_index_nprobes]
+    # Always scan 10 partitions; scan up to 50 only if the initial pass
+    # returns fewer than `limit` results (common with narrow filters).
+    (
+        table.search(np.random.random(128))
+        .minimum_nprobes(10)
+        .maximum_nprobes(50)
+        .where("id > 100")
+        .limit(5)
+        .to_pandas()
+    )
+    # --8<-- [end:vector_index_nprobes]
+
+
+def test_vector_index_distance_range(tmp_db):
+    dim = 128
+    data = [
+        {"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
+        for i in range(256)
+    ]
+    table = tmp_db.create_table("vector_index_distance_range", data, mode="overwrite")
+    table.create_index(
+        metric="cosine",
+        vector_column_name="keywords_embeddings",
+    )
+
+    # --8<-- [start:vector_index_distance_range]
+    # Only return results whose distance falls within [0.0, 0.5).
+    # Useful for near-duplicate detection or thresholded similarity search.
+    (
+        table.search(np.random.random(128))
+        .distance_range(lower_bound=0.0, upper_bound=0.5)
+        .limit(10)
+        .to_pandas()
+    )
+    # --8<-- [end:vector_index_distance_range]
+
+
+def test_vector_index_bypass_recall(tmp_db):
+    dim = 128
+    data = [
+        {"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
+        for i in range(256)
+    ]
+    table = tmp_db.create_table("vector_index_bypass_recall", data, mode="overwrite")
+    table.create_index(
+        metric="cosine",
+        vector_column_name="keywords_embeddings",
+    )
+
+    # --8<-- [start:vector_index_bypass_recall]
+    query = np.random.random(128)
+    k = 10
+
+    # Ground truth: flat (exhaustive) scan, ignoring the ANN index.
+    truth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()["id"])
+
+    # ANN results with the current nprobes setting.
+    ann = set(table.search(query).nprobes(20).limit(k).to_pandas()["id"])
+
+    recall_at_k = len(truth & ann) / k
+    # --8<-- [end:vector_index_bypass_recall]
+    assert 0.0 <= recall_at_k <= 1.0
+
+
+def test_vector_index_custom_name(tmp_db):
+    table = tmp_db.create_table(
+        "vector_index_custom_name",
+        _make_vector_rows(512, 8, column="keywords_embeddings"),
+        mode="overwrite",
+    )
+
+    # --8<-- [start:vector_index_custom_name]
+    # Override the default `{column}_idx` convention by passing `name=...`.
+    table.create_index(
+        metric="cosine",
+        vector_column_name="keywords_embeddings",
+        name="my_custom_index",
+    )
+    table.wait_for_index(["my_custom_index"])
+    print(table.index_stats("my_custom_index"))
+    # --8<-- [end:vector_index_custom_name]
+
+    assert table.index_stats("my_custom_index")
+
+
 def test_vector_index_hnsw(tmp_db):
     table = tmp_db.create_table(
         "vector_index_hnsw",