Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 73 additions & 11 deletions docs/indexing/vector-index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ import {
PyVectorIndexBinaryBuildIndex as VectorIndexBinaryBuildIndex,
PyVectorIndexBinarySearch as VectorIndexBinarySearch,
PyVectorIndexCheckStatus as VectorIndexCheckStatus,
PyVectorIndexNprobes as VectorIndexNprobes,
PyVectorIndexDistanceRange as VectorIndexDistanceRange,
PyVectorIndexBypassRecall as VectorIndexBypassRecall,
PyVectorIndexCustomName as VectorIndexCustomName,
} from '/snippets/indexing.mdx';

You can create and manage multiple vector indexes on any Lance dataset. LanceDB offers two kinds of vector indexing algorithms: **Inverted File (IVF)** and **Hierarchical Navigable Small World (HNSW)**.
Expand Down Expand Up @@ -144,17 +148,65 @@ Search using a random 1,536-dimensional embedding.

#### Search Configuration

The previous query uses:
Core knobs available on a vector search call:

- `limit`: number of results to return
- `nprobes`: number of IVF partitions to scan. LanceDB auto-tunes this by default.
- `ef`: primarily relevant for HNSW-backed IVF indexes such as `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` (where `k=limit`) and increase up to `10 * k` for higher recall.
- `nprobes` by index type:
- `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`: usually keep auto-tuned `nprobes`, then tune `ef` first. For filtered search (`where(...)`), expect higher latency variance.
- `IVF_RQ`: keep auto-tuned `nprobes`; increase only when recall is insufficient.
- `IVF_PQ`: keep auto-tuned `nprobes`; increase when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`.
- `refine_factor`: reads additional candidates and reranks in memory
- `.to_pandas()`: converts the results to a pandas DataFrame
| Parameter | Description |
| :--- | :--- |
| `limit` | Number of results to return (`k`). |
| `nprobes` | Shorthand that sets both `minimum_nprobes` and `maximum_nprobes` to the same value. LanceDB auto-tunes this by default. |
| `minimum_nprobes` | Partitions that are *always* scanned. Higher values raise recall at the cost of latency. |
| `maximum_nprobes` | Upper bound on partitions scanned. The partitions above `minimum_nprobes` are only searched if the initial pass does not return enough results — useful for narrow filters. Set to `0` to remove the cap. |
| `ef` | HNSW search-time exploration factor. Relevant for `IVF_HNSW_FLAT` and `IVF_HNSW_SQ`; start around `1.5 * k` and increase up to `10 * k` for higher recall. |
| `refine_factor` | Reads additional candidates and reranks them in memory to recover recall lost to quantization. |

<Note>
**Filtered queries and adaptive nprobes.** When a `where(...)` filter is active, LanceDB starts by scanning `minimum_nprobes` partitions and only extends toward `maximum_nprobes` if fewer than `limit` rows survive the filter. Setting `minimum_nprobes == maximum_nprobes` (or calling `nprobes(n)`) disables this adaptive behavior and fixes the partition count.
</Note>

<CodeGroup>
<CodeBlock filename="Python" language="Python" icon="python">
{VectorIndexNprobes}
</CodeBlock>
</CodeGroup>

Recommended `nprobes` behavior by index type:

| Index type | Guidance |
| :--- | :--- |
| `IVF_HNSW_FLAT`, `IVF_HNSW_SQ` | Keep the auto-tuned `nprobes`, then tune `ef` first. Expect higher latency variance under filtered search. |
| `IVF_RQ` | Keep auto-tuned `nprobes`; raise only when recall is insufficient. |
| `IVF_PQ` | Keep auto-tuned `nprobes`; raise when recall is insufficient. Often preferred over `IVF_RQ` when `dimension <= 256`. |

#### Advanced Search Controls

These controls are useful for thresholded retrieval, recall measurement, and working around index-level metric constraints.

| Method | Description |
| :--- | :--- |
| `distance_range(lower_bound, upper_bound)` | Return only rows whose distance falls within `[lower_bound, upper_bound)`. Either bound is optional. Useful for near-duplicate detection or "close-enough" matching. |
| `bypass_vector_index()` | Skip the ANN index and perform an exhaustive (flat) scan. Primary uses: (1) compute ground-truth results to measure ANN recall@k, and (2) query with a metric the index was not built for (e.g., a non-cosine query on a multivector column). |

**Thresholding with `distance_range`:**

<CodeGroup>
<CodeBlock filename="Python" language="Python" icon="python">
{VectorIndexDistanceRange}
</CodeBlock>
</CodeGroup>

**Measuring recall with `bypass_vector_index`:**

Compare ANN results against a flat-scan ground truth to compute recall@k. This is the standard way to pick `nprobes` for your workload.

<CodeGroup>
<CodeBlock filename="Python" language="Python" icon="python">
{VectorIndexBypassRecall}
</CodeBlock>
</CodeGroup>

<Warning>
Flat search is $O(n)$ — reserve `bypass_vector_index()` for sampled recall measurements or small tables, not production queries.
</Warning>

## Example: Construct an HNSW Index

Expand Down Expand Up @@ -238,11 +290,21 @@ Navigate to your table page - the "Index" column shows index status. It remains

### Option 2: Use the API

Use `list_indices()` and `index_stats()` to check index status. The index name is formed by appending "\_idx" to the column name. Note that `list_indices()` only returns information after the index is fully built.
Use `list_indices()` and `index_stats()` to check index status. **By default**, the index name is formed by appending `_idx` to the column name (e.g., a `keywords_embeddings` column produces `keywords_embeddings_idx`). Note that `list_indices()` only returns information after the index is fully built.
To wait until all data is fully indexed, you can specify the `wait_timeout` parameter on `create_index()` or call `wait_for_index()` on the table.

<CodeGroup>
<CodeBlock filename="Python" language="Python" icon="python">
{VectorIndexCheckStatus}
</CodeBlock>
</CodeGroup>

#### Custom Index Names

The `{column}_idx` suffix is a default convention, not the only supported naming path. Pass `name=...` to `create_index()` to override it — useful when you want to manage multiple indexes on the same column (for example, side-by-side `IVF_PQ` and `IVF_HNSW_SQ` builds) or when you script index replacement by name. Once set, `list_indices()`, `index_stats(name)`, and `wait_for_index([name])` all reference the custom name.

<CodeGroup>
<CodeBlock filename="Python" language="Python" icon="python">
{VectorIndexCustomName}
</CodeBlock>
</CodeGroup>
8 changes: 8 additions & 0 deletions docs/snippets/indexing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,18 @@ export const PyVectorIndexBuildHnsw = "table.create_index(index_type=\"IVF_HNSW_

export const PyVectorIndexBuildIvf = "table_name = \"vector-index-build-ivf\"\ntable = db.open_table(table_name)\ntable.create_index(\n metric=\"cosine\",\n vector_column_name=\"keywords_embeddings\",\n)\n";

export const PyVectorIndexBypassRecall = "query = np.random.random(128)\nk = 10\n\n# Ground truth: flat (exhaustive) scan, ignoring the ANN index.\ntruth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()[\"id\"])\n\n# ANN results with the current nprobes setting.\nann = set(table.search(query).nprobes(20).limit(k).to_pandas()[\"id\"])\n\nrecall_at_k = len(truth & ann) / k\n";

export const PyVectorIndexCheckStatus = "index_name = \"keywords_embeddings_idx\"\ntable.wait_for_index([index_name])\nprint(table.index_stats(index_name))\n";

export const PyVectorIndexConfigureIvf = "table.create_index(metric=\"l2\", num_partitions=16, num_sub_vectors=4)\n";

export const PyVectorIndexCustomName = "# Override the default `{column}_idx` convention by passing `name=...`.\ntable.create_index(\n metric=\"cosine\",\n vector_column_name=\"keywords_embeddings\",\n name=\"my_custom_index\",\n)\ntable.wait_for_index([\"my_custom_index\"])\nprint(table.index_stats(\"my_custom_index\"))\n";

export const PyVectorIndexDistanceRange = "# Only return results whose distance falls within [0.0, 0.5).\n# Useful for near-duplicate detection or thresholded similarity search.\n(\n table.search(np.random.random(128))\n .distance_range(lower_bound=0.0, upper_bound=0.5)\n .limit(10)\n .to_pandas()\n)\n";

export const PyVectorIndexNprobes = "# Always scan 10 partitions; scan up to 50 only if the initial pass\n# returns fewer than `limit` results (common with narrow filters).\n(\n table.search(np.random.random(128))\n .minimum_nprobes(10)\n .maximum_nprobes(50)\n .where(\"id > 100\")\n .limit(5)\n .to_pandas()\n)\n";

export const PyVectorIndexQueryHnsw = "tbl = table\ntbl.search(np.random.random((16))).limit(2).to_pandas()\n";

export const PyVectorIndexQueryIvf = "tbl = table\ntbl.search(np.random.random((1536))).limit(2).nprobes(20).refine_factor(\n 10\n).to_pandas()\n";
Expand Down
98 changes: 98 additions & 0 deletions tests/py/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,104 @@ def test_vector_index_query_ivf(tmp_db):
assert len(df) == 2


def test_vector_index_nprobes(tmp_db):
dim = 128
data = [
{"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
for i in range(512)
]
table = tmp_db.create_table("vector_index_nprobes", data, mode="overwrite")
table.create_index(
metric="cosine",
vector_column_name="keywords_embeddings",
)

# --8<-- [start:vector_index_nprobes]
# Always scan 10 partitions; scan up to 50 only if the initial pass
# returns fewer than `limit` results (common with narrow filters).
(
table.search(np.random.random(128))
.minimum_nprobes(10)
.maximum_nprobes(50)
.where("id > 100")
.limit(5)
.to_pandas()
)
# --8<-- [end:vector_index_nprobes]


def test_vector_index_distance_range(tmp_db):
dim = 128
data = [
{"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
for i in range(256)
]
table = tmp_db.create_table("vector_index_distance_range", data, mode="overwrite")
table.create_index(
metric="cosine",
vector_column_name="keywords_embeddings",
)

# --8<-- [start:vector_index_distance_range]
# Only return results whose distance falls within [0.0, 0.5).
# Useful for near-duplicate detection or thresholded similarity search.
(
table.search(np.random.random(128))
.distance_range(lower_bound=0.0, upper_bound=0.5)
.limit(10)
.to_pandas()
)
# --8<-- [end:vector_index_distance_range]


def test_vector_index_bypass_recall(tmp_db):
dim = 128
data = [
{"id": i, "keywords_embeddings": np.random.random(dim).tolist()}
for i in range(256)
]
table = tmp_db.create_table("vector_index_bypass_recall", data, mode="overwrite")
table.create_index(
metric="cosine",
vector_column_name="keywords_embeddings",
)

# --8<-- [start:vector_index_bypass_recall]
query = np.random.random(128)
k = 10

# Ground truth: flat (exhaustive) scan, ignoring the ANN index.
truth = set(table.search(query).bypass_vector_index().limit(k).to_pandas()["id"])

# ANN results with the current nprobes setting.
ann = set(table.search(query).nprobes(20).limit(k).to_pandas()["id"])

recall_at_k = len(truth & ann) / k
# --8<-- [end:vector_index_bypass_recall]
assert 0.0 <= recall_at_k <= 1.0


def test_vector_index_custom_name(tmp_db):
table = tmp_db.create_table(
"vector_index_custom_name",
_make_vector_rows(512, 8, column="keywords_embeddings"),
mode="overwrite",
)

# --8<-- [start:vector_index_custom_name]
# Override the default `{column}_idx` convention by passing `name=...`.
table.create_index(
metric="cosine",
vector_column_name="keywords_embeddings",
name="my_custom_index",
)
table.wait_for_index(["my_custom_index"])
print(table.index_stats("my_custom_index"))
# --8<-- [end:vector_index_custom_name]

assert table.index_stats("my_custom_index")


def test_vector_index_hnsw(tmp_db):
table = tmp_db.create_table(
"vector_index_hnsw",
Expand Down
Loading