perf guide

AyushExel · AyushExel · commit 2d0d5dad5921 · 2026-05-01T22:50:13.000+05:30
diff --git a/docs/docs.json b/docs/docs.json
@@ -35,6 +35,7 @@
             "group": "Get started",
             "pages": [
               "quickstart",
+              "performance",
               {
                 "group": "What is LanceDB?",
                 "pages": [
diff --git a/docs/performance.mdx b/docs/performance.mdx
@@ -0,0 +1,228 @@
+---
+title: "Performance Tuning Guide"
+sidebarTitle: "Performance"
+description: "The short list of things to get right (and the things to avoid) when running LanceDB."
+icon: "gauge-high"
+keywords: ["performance", "tuning", "best practices", "optimization", "ingestion", "indexing", "vector search", "filtering", "compaction", "oss", "enterprise"]
+---
+
+
+# LanceDB OSS
+
+## Common pitfalls
+
+Here are some common patters to avoid when dealing with large LanceDB datasets:
+
+1. **Materializing the whole table in memory.** Calling `table.to_pandas()` / `table.to_arrow()` etc. loads entire table in memory. Filter, project, or stream instead. See [Querying](#querying).
+2. **Calling `add()` once per row or for small batches.** Each call commits a new version. Use bulk ingestion if you're ingesting form existing source or use iterators to for better ingestion perforance without creating too many versions or fragents. [Ingestion](#ingestion).
+3. **No index on a column you filter by.** Running `where(...)` predicate on a large table would perform better with a scalar index instead of brute forcing. Similarly, for vector columns it is recommended to create vector index once the table is larger than a few 100K rows. See [Indexing](#indexing).
+4. **Not running `optimize()` on fragmented tables.** Without it, queries on unindexed rows fall back to flat search, soft-deleted rows aren't physically removed, and old-version files accumulate on disk. See [Maintenance](#maintenance).
+5. **Wrong distance metric for the embedding model.** Once an index is built the metric is fixed; mismatching it degrades results without an obvious error. See [Indexing](#indexing).
+6. **Updating indexed columns without rebuilding.** Updates move rows out of the vector index. They are still searchable but no longer benefit from the index. See [Maintenance](#maintenance).
+7. **`fork()` with multiprocessing.** Use `spawn`. LanceDB is multi-threaded internally, and `fork` plus a multi-threaded process is unsafe.
+
+The rest of this section expands on each.
+
+## Ingestion
+
+The first rule is to **batch your writes**. Every `add()` call commits a new version and a new fragment, so per-row `add()` in a loop is a common cause of slow ingestion. Past that, LanceDB supports two ingestion modes that solve different problems. Pick by what your data source looks like.
+
+### Bulk ingestion: for data you already have
+
+Pass an Arrow Table, a Pandas DataFrame, or `pyarrow.dataset(...)` (best for large file-backed loads). LanceDB sees the total size up front and parallelizes the write across workers automatically.
+
+```python Python icon="python"
+import pyarrow.dataset as ds
+
+table.add(arrow_table)                              # in-memory Arrow
+table.add(df)                                       # pandas DataFrame
+table.add(ds.dataset("data/", format="parquet"))    # streams from disk, still parallelized
+```
+
+This is the right path for initial loads from existing files, ETL outputs, or anything already materialized. `pa.dataset(...)` is a good fit for big file-based loads: it streams Parquet/CSV from disk without loading them into memory, and still preserves auto-parallelism. Use it instead of reading files into a DataFrame first.
+
+For very large initial loads, create the table empty (with a schema) and then call `add()`. Passing the dataset directly to `create_table(name, data)` skips the auto-parallel write path.
+
+### Iterator ingestion: for streaming or computed-on-the-fly data
+
+Pass an iterator (or generator) of `pyarrow.RecordBatch`. You control chunk size, memory stays bounded, and LanceDB can process unbounded sources.
+
+```python Python icon="python"
+import pyarrow as pa
+
+def stream():
+    for raw in source:                                  # queue, HTTP, Kafka, etc.
+        vectors = model.encode(raw["text"])             # compute on the fly
+        yield pa.RecordBatch.from_pydict({**raw, "vector": vectors})
+
+table.add(stream())
+```
+
+Use this when data arrives over time, when you compute rows on the fly (embedding pipelines), or when the dataset doesn't fit in memory and isn't a Parquet/CSV file LanceDB can open directly.
+
+The trade-off: streaming inputs don't currently expose the parallelism knob, so per-batch throughput is lower than bulk mode. Pick chunk sizes large enough that per-batch overhead amortizes; a few thousand to tens of thousands of rows is a reasonable range. **Don't yield single-row batches**, since that loses the benefit of the iterator path and recreates the per-row `add()` problem.
+
+### Concurrent writers on S3 need a commit lock
+
+Plain S3 has no atomic put-if-absent. Use `s3+ddb://` with a DynamoDB table:
+
+```python Python icon="python"
+db = lancedb.connect("s3+ddb://bucket/path?ddbTableName=lance_commits")
+```
+
+Concurrent reads scale freely; concurrent writers retry commits a finite number of times, so don't fan out hundreds of writers against the same table.
+
+For more, see [Create a Table](/tables/create) and [Storage configuration](/storage/configuration).
+
+## Indexing
+
+Two kinds of indexes are independent and complementary. Most workloads use both.
+
+### Vector indexes
+
+A vector index is not strictly required below ~100K vectors; disk-based brute-force scan is fast enough at that scale. Past that, build one. Pick by what your workload looks like:
+
+| Index | When to use |
+|-------|-------------|
+| `IVF_PQ` | Common starting point, especially when most queries carry a `where(...)` filter. Higher accuracy than RQ at small dimensions (≤ 256). This is also the index Enterprise builds automatically. |
+| `IVF_RQ` | Maximum compression on high-dimensional vectors, faster builds than PQ, and good behavior under filters. |
+| `IVF_HNSW_SQ` | Best recall/latency trade-off for unfiltered search. Can show higher latency variance under selective `where(...)` filters, so prefer `IVF_PQ` or `IVF_RQ` if most queries are filtered. |
+| `IVF_FLAT` | Required for binary vectors with `hamming`. Highest recall, no compression. |
+
+The distance metric is fixed once an index is built. Match it to how the embedding model was trained (`cosine` for most general-purpose embeddings; `dot` for already-normalized vectors; `l2` for Euclidean-trained models; `hamming` for binary).
+
+For parameter tuning (`num_partitions`, `num_sub_vectors`, `ef_construction`), see [Vector Indexing](/indexing/vector-index).
+
+### Scalar indexes
+
+Every column you filter on should have one. Without a scalar index, `where(...)` and `merge_insert` join keys do a full column scan.
+
+| Index | Best for |
+|-------|----------|
+| `BTREE` (default) | Numeric, string, temporal columns with mostly distinct values; range queries |
+| `BITMAP` | Boolean and low-cardinality columns (< ~1,000 distinct values) |
+| `LABEL_LIST` | `List<T>` columns queried with `array_has_any` / `array_has_all` |
+
+For more, see [Scalar Indexing](/indexing/scalar-index).
+
+### Full-text search
+
+Defaults are fine. Phrase queries require `with_position=True` and `remove_stop_words=False`, which significantly increases the index size and indexing time. Leave them off unless you need phrase matching. See [FTS Indexing](/indexing/fts-index).
+
+## Querying
+
+Three things to get right on every query.
+
+**Always use `select()` and `limit()`.** Returned columns drive I/O; `limit()` bounds work and prevents flooding the client.
+
+```python Python icon="python"
+table.search(query_emb).select(["id", "title"]).limit(20)
+```
+
+**Use pre-filter (the default) unless you have a reason not to.** Pre-filter applies `where` before vector search, so results always satisfy the predicate. Post-filter (`prefilter=False`) is cheaper but may return fewer than `limit` rows; only use it when correctness allows. See [Filtering](/search/filtering).
+
+**Tune for recall with one knob, not many.** If results are wrong (low recall, missing the right answer):
+
+- For quantized indexes (PQ, RQ, SQ): use `refine_factor` to pull extra candidates and re-score on full vectors. Distances on quantized indexes are computed on compressed vectors and are approximate without it.
+- For HNSW-backed indexes: use `ef` at search time. Start at `1.5 × k`, raise toward `10 × k` if recall is short.
+- For IVF candidate breadth: `nprobes` is auto-tuned. Only raise it manually when a selective pre-filter leaves too few neighbors.
+
+```python Python icon="python"
+table.search(emb).refine_factor(20).limit(10)   # quantized recall recovery
+```
+
+For hybrid search (vector + FTS), reranking is required because the score scales aren't directly comparable. The default `RRFReranker` is a strong starting point. See [Hybrid Search](/search/hybrid-search).
+
+### Iterating the whole dataset
+
+When you genuinely need every row (training data export, batch processing, dataset migration), don't materialize with `to_pandas()` / `to_arrow()`. Drop to the underlying Lance dataset and stream:
+
+```python Python icon="python"
+ds = table.to_lance()
+for batch in ds.to_batches(columns=["id", "text"], batch_size=10000):
+    process(batch)
+```
+
+For more control (filters, fragment-level parallelism), use a scanner explicitly:
+
+```python Python icon="python"
+scanner = table.to_lance().scanner(
+    columns=["id", "text"],
+    filter="created_at > '2025-01-01'",
+    batch_size=10000,
+)
+for batch in scanner.to_batches():
+    process(batch)
+```
+
+Memory stays bounded regardless of table size. On Enterprise, `to_lance()` is not available on `RemoteTable`; iterate via scoped query builders (`table.search(...)`, `table.query(...)`) instead.
+
+## Maintenance
+
+In OSS, you own the lifecycle. One call covers most of it:
+
+```python Python icon="python"
+from datetime import timedelta
+table.optimize()                                       # default 7-day retention
+table.optimize(cleanup_older_than=timedelta(days=1))   # reclaim space sooner
+```
+
+`optimize()` performs three maintenance operations:
+
+1. **Compaction**: merges small fragments into larger ones to improve read performance.
+2. **Pruning/cleanup**: removes files from versions older than `cleanup_older_than` (7 days by default).
+3. **Index update**: adds newly-ingested data to existing indexes.
+
+Run it after large writes, on a schedule, or both. Without it, queries on unindexed rows fall back to flat search, deleted rows continue to occupy storage, and the more data you add between optimizations, the more noticeable the latency impact becomes.
+
+A few things to remember:
+
+- Updates **move rows out of** the vector index. After large updates, rebuild it.
+- Deletes are soft. `optimize()` is what physically reclaims space.
+- `merge_insert` joins on a key. That join column needs a scalar index.
+
+See [Reindexing](/indexing/reindexing) and [Versioning](/tables/versioning) for more.
+
+## Diagnostics
+
+Two tools, in this order:
+
+```python Python icon="python"
+print(table.search(emb).where("year > 2000").limit(10).analyze_plan())
+print(table.index_stats("vector_idx"))   # num_unindexed_rows should be ~0
+```
+
+In `analyze_plan`, look for:
+
+| Symptom | Likely cause |
+|---------|--------------|
+| `LanceScan` with high `bytes_read` / `iops` | Missing index, no `select()`, or uncompacted dataset |
+| `KNNVectorDistance` over millions of rows | No vector index, or bypassed |
+| Many small `output_batches` | Fragmented data; run `optimize()` |
+
+For a worked example, see [Optimize Query Performance](/search/optimize-queries).
+
+# LanceDB Enterprise
+
+<Note>
+Enterprise-specific performance guidance is coming soon. For benchmark methodology and reference latency numbers, see [Performance Characteristics](/enterprise/performance).
+</Note>
+
+---
+
+## Where to go next
+
+<Columns cols={2}>
+  <Card title="Optimize Query Performance" icon="gauge" href="/search/optimize-queries">
+    Read execution plans, find the bottleneck.
+  </Card>
+  <Card title="Vector Indexing" icon="layer-group" href="/indexing/vector-index">
+    Index types, parameters, and tuning in depth.
+  </Card>
+  <Card title="Filtering" icon="filter" href="/search/filtering">
+    Pre- vs post-filter, scalar indexes, list columns.
+  </Card>
+  <Card title="Enterprise Performance" icon="server" href="/enterprise/performance">
+    Benchmark methodology and reference latency numbers.
+  </Card>
+</Columns>

Original file line number	Diff line number	Diff line change
`@@ -35,6 +35,7 @@`
`35`	`35`	`"group": "Get started",`
`36`	`36`	`"pages": [`
`37`	`37`	`"quickstart",`
	`38`	`+ "performance",`
`38`	`39`	`{`
`39`	`40`	`"group": "What is LanceDB?",`
`40`	`41`	`"pages": [`