diff --git a/docs/indexing/vector-index.mdx b/docs/indexing/vector-index.mdx index 8ee6a48..c3802fc 100644 --- a/docs/indexing/vector-index.mdx +++ b/docs/indexing/vector-index.mdx @@ -16,6 +16,10 @@ import { PyVectorIndexBinaryBuildIndex as VectorIndexBinaryBuildIndex, PyVectorIndexBinarySearch as VectorIndexBinarySearch, PyVectorIndexCheckStatus as VectorIndexCheckStatus, + PyVectorIndexNprobes as VectorIndexNprobes, + PyVectorIndexDistanceRange as VectorIndexDistanceRange, + PyVectorIndexBypassRecall as VectorIndexBypassRecall, + PyVectorIndexCustomName as VectorIndexCustomName, } from '/snippets/indexing.mdx'; You can create and manage multiple vector indexes on any Lance dataset. LanceDB offers two kinds of vector indexing algorithms: **Inverted File (IVF)** and **Hierarchical Navigable Small World (HNSW)**. @@ -144,17 +148,65 @@ Search using a random 1,536-dimensional embedding. #### Search Configuration -The previous query uses: +Core knobs available on a vector search call: -- `limit`: number of results to return -- `nprobes`: number of IVF partitions to scan. LanceDB auto-tunes this by default. -- `ef`: primarily relevant for HNSW-backed IVF indexes such as `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` (where `k=limit`) and increase up to `10 * k` for higher recall. -- `nprobes` by index type: - - `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`: usually keep auto-tuned `nprobes`, then tune `ef` first. For filtered search (`where(...)`), expect higher latency variance. - - `IVF_RQ`: keep auto-tuned `nprobes`; increase only when recall is insufficient. - - `IVF_PQ`: keep auto-tuned `nprobes`; increase when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`. -- `refine_factor`: reads additional candidates and reranks in memory -- `.to_pandas()`: converts the results to a pandas DataFrame +| Parameter | Description | +| :--- | :--- | +| `limit` | Number of results to return (`k`). | +| `nprobes` | Shorthand that sets both `minimum_nprobes` and `maximum_nprobes` to the same value. LanceDB auto-tunes this by default. | +| `minimum_nprobes` | Partitions that are *always* scanned. Higher values raise recall at the cost of latency. | +| `maximum_nprobes` | Upper bound on partitions scanned. The partitions above `minimum_nprobes` are only searched if the initial pass does not return enough results — useful for narrow filters. Set to `0` to remove the cap. | +| `ef` | HNSW search-time exploration factor. Relevant for `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` and increase up to `10 * k` for higher recall. | +| `refine_factor` | Reads additional candidates and reranks them in memory to recover recall lost to quantization. | + + +**Filtered queries and adaptive nprobes.** When a `where(...)` filter is active, LanceDB starts by scanning `minimum_nprobes` partitions and only extends toward `maximum_nprobes` if fewer than `limit` rows survive the filter. Setting `minimum_nprobes == maximum_nprobes` (or calling `nprobes(n)`) disables this adaptive behavior and fixes the partition count. + + + + + {VectorIndexNprobes} + + + +Recommended `nprobes` behavior by index type: + +| Index type | Guidance | +| :--- | :--- | +| `IVF_HNSW_FLAT`, `IVF_HNSW_SQ` | Keep the auto-tuned `nprobes`, then tune `ef` first. Expect higher latency variance under filtered search. | +| `IVF_RQ` | Keep auto-tuned `nprobes`; raise only when recall is insufficient. | +| `IVF_PQ` | Keep auto-tuned `nprobes`; raise when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`. | + +#### Advanced Search Controls + +These controls are useful for thresholded retrieval, recall measurement, and working around index-level metric constraints. + +| Method | Description | +| :--- | :--- | +| `distance_range(lower_bound, upper_bound)` | Return only rows whose distance falls within `[lower_bound, upper_bound)`. Either bound is optional. Useful for near-duplicate detection or "close-enough" matching. | +| `bypass_vector_index()` | Skip the ANN index and perform an exhaustive (flat) scan. Primary uses: (1) compute ground-truth results to measure ANN recall@k, and (2) query with a metric the index was not built for (e.g., a non-cosine query on a multivector column). | + +**Thresholding with `distance_range`:** + + + + {VectorIndexDistanceRange} + + + +**Measuring recall with `bypass_vector_index`:** + +Compare ANN results against a flat-scan ground truth to compute recall@k. This is the standard way to pick `nprobes` for your workload. + + + + {VectorIndexBypassRecall} + + + + +Flat search is $O(n)$ — reserve `bypass_vector_index()` for sampled recall measurements or small tables, not production queries. + ## Example: Construct an HNSW Index @@ -238,7 +290,7 @@ Navigate to your table page - the "Index" column shows index status. It remains ### Option 2: Use the API -Use `list_indices()` and `index_stats()` to check index status. The index name is formed by appending "\_idx" to the column name. Note that `list_indices()` only returns information after the index is fully built. +Use `list_indices()` and `index_stats()` to check index status. **By default**, the index name is formed by appending `_idx` to the column name (e.g., a `keywords_embeddings` column produces `keywords_embeddings_idx`). Note that `list_indices()` only returns information after the index is fully built. To wait until all data is fully indexed, you can specify the `wait_timeout` parameter on `create_index()` or call `wait_for_index()` on the table. @@ -246,3 +298,13 @@ To wait until all data is fully indexed, you can specify the `wait_timeout` para {VectorIndexCheckStatus} + +#### Custom Index Names + +The `{column}_idx` suffix is a default convention, not the only supported naming path. Pass `name=...` to `create_index()` to override it — useful when you want to manage multiple indexes on the same column (for example, side-by-side `IVF_PQ` and `IVF_HNSW_SQ` builds) or when you script index replacement by name. Once set, `list_indices()`, `index_stats(name)`, and `wait_for_index([name])` all reference the custom name. + + + + {VectorIndexCustomName} + + diff --git a/docs/snippets/indexing.mdx b/docs/snippets/indexing.mdx index 14ea86a..7ed3df2 100644 --- a/docs/snippets/indexing.mdx +++ b/docs/snippets/indexing.mdx @@ -44,10 +44,18 @@ export const PyVectorIndexBuildHnsw = "table.create_index(index_type=\"IVF_HNSW_ export const PyVectorIndexBuildIvf = "table_name = \"vector-index-build-ivf\"\ntable = db.open_table(table_name)\ntable.create_index(\n metric=\"cosine\",\n vector_column_name=\"keywords_embeddings\",\n)\n"; +export const PyVectorIndexBypassRecall = "query = np.random.random(128)\nk = 10\n\n# Ground truth: flat (exhaustive) scan, ignoring the ANN index.\ntruth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()[\"id\"])\n\n# ANN results with the current nprobes setting.\nann = set(table.search(query).nprobes(20).limit(k).to_pandas()[\"id\"])\n\nrecall_at_k = len(truth & ann) / k\n"; + export const PyVectorIndexCheckStatus = "index_name = \"keywords_embeddings_idx\"\ntable.wait_for_index([index_name])\nprint(table.index_stats(index_name))\n"; export const PyVectorIndexConfigureIvf = "table.create_index(metric=\"l2\", num_partitions=16, num_sub_vectors=4)\n"; +export const PyVectorIndexCustomName = "# Override the default `{column}_idx` convention by passing `name=...`.\ntable.create_index(\n metric=\"cosine\",\n vector_column_name=\"keywords_embeddings\",\n name=\"my_custom_index\",\n)\ntable.wait_for_index([\"my_custom_index\"])\nprint(table.index_stats(\"my_custom_index\"))\n"; + +export const PyVectorIndexDistanceRange = "# Only return results whose distance falls within [0.0, 0.5).\n# Useful for near-duplicate detection or thresholded similarity search.\n(\n table.search(np.random.random(128))\n .distance_range(lower_bound=0.0, upper_bound=0.5)\n .limit(10)\n .to_pandas()\n)\n"; + +export const PyVectorIndexNprobes = "# Always scan 10 partitions; scan up to 50 only if the initial pass\n# returns fewer than `limit` results (common with narrow filters).\n(\n table.search(np.random.random(128))\n .minimum_nprobes(10)\n .maximum_nprobes(50)\n .where(\"id > 100\")\n .limit(5)\n .to_pandas()\n)\n"; + export const PyVectorIndexQueryHnsw = "tbl = table\ntbl.search(np.random.random((16))).limit(2).to_pandas()\n"; export const PyVectorIndexQueryIvf = "tbl = table\ntbl.search(np.random.random((1536))).limit(2).nprobes(20).refine_factor(\n 10\n).to_pandas()\n"; diff --git a/tests/py/test_indexing.py b/tests/py/test_indexing.py index ee1f7c3..34c12a6 100644 --- a/tests/py/test_indexing.py +++ b/tests/py/test_indexing.py @@ -98,6 +98,104 @@ def test_vector_index_query_ivf(tmp_db): assert len(df) == 2 +def test_vector_index_nprobes(tmp_db): + dim = 128 + data = [ + {"id": i, "keywords_embeddings": np.random.random(dim).tolist()} + for i in range(512) + ] + table = tmp_db.create_table("vector_index_nprobes", data, mode="overwrite") + table.create_index( + metric="cosine", + vector_column_name="keywords_embeddings", + ) + + # --8<-- [start:vector_index_nprobes] + # Always scan 10 partitions; scan up to 50 only if the initial pass + # returns fewer than `limit` results (common with narrow filters). + ( + table.search(np.random.random(128)) + .minimum_nprobes(10) + .maximum_nprobes(50) + .where("id > 100") + .limit(5) + .to_pandas() + ) + # --8<-- [end:vector_index_nprobes] + + +def test_vector_index_distance_range(tmp_db): + dim = 128 + data = [ + {"id": i, "keywords_embeddings": np.random.random(dim).tolist()} + for i in range(256) + ] + table = tmp_db.create_table("vector_index_distance_range", data, mode="overwrite") + table.create_index( + metric="cosine", + vector_column_name="keywords_embeddings", + ) + + # --8<-- [start:vector_index_distance_range] + # Only return results whose distance falls within [0.0, 0.5). + # Useful for near-duplicate detection or thresholded similarity search. + ( + table.search(np.random.random(128)) + .distance_range(lower_bound=0.0, upper_bound=0.5) + .limit(10) + .to_pandas() + ) + # --8<-- [end:vector_index_distance_range] + + +def test_vector_index_bypass_recall(tmp_db): + dim = 128 + data = [ + {"id": i, "keywords_embeddings": np.random.random(dim).tolist()} + for i in range(256) + ] + table = tmp_db.create_table("vector_index_bypass_recall", data, mode="overwrite") + table.create_index( + metric="cosine", + vector_column_name="keywords_embeddings", + ) + + # --8<-- [start:vector_index_bypass_recall] + query = np.random.random(128) + k = 10 + + # Ground truth: flat (exhaustive) scan, ignoring the ANN index. + truth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()["id"]) + + # ANN results with the current nprobes setting. + ann = set(table.search(query).nprobes(20).limit(k).to_pandas()["id"]) + + recall_at_k = len(truth & ann) / k + # --8<-- [end:vector_index_bypass_recall] + assert 0.0 <= recall_at_k <= 1.0 + + +def test_vector_index_custom_name(tmp_db): + table = tmp_db.create_table( + "vector_index_custom_name", + _make_vector_rows(512, 8, column="keywords_embeddings"), + mode="overwrite", + ) + + # --8<-- [start:vector_index_custom_name] + # Override the default `{column}_idx` convention by passing `name=...`. + table.create_index( + metric="cosine", + vector_column_name="keywords_embeddings", + name="my_custom_index", + ) + table.wait_for_index(["my_custom_index"]) + print(table.index_stats("my_custom_index")) + # --8<-- [end:vector_index_custom_name] + + assert table.index_stats("my_custom_index") + + def test_vector_index_hnsw(tmp_db): table = tmp_db.create_table( "vector_index_hnsw",