Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/search/full-text-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ icon: "book"
import {
PyFtsPrefiltering,
PyFtsPostfiltering,
PyFtsIncrementalIndex,
} from '/snippets/search.mdx'

LanceDB provides support for Full-Text Search via Lance, allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
Expand Down Expand Up @@ -129,6 +130,28 @@ If you want to specify which columns to search use `fts_columns="text"`
LanceDB automatically searches on the existing FTS index if the input to the search is of type `str`. If you provide a vector as input, LanceDB will search the ANN index instead.
</Note>

### Keeping the index up to date

Rows you add after building an FTS index aren't part of the index until you optimize the table. Until then, queries fall back to a flat scan over the unindexed fragments to keep results complete, which slows them down as the unindexed tail grows. Call `table.optimize()` to fold new rows into the existing index — it's the same operation used for vector indexes:

<CodeGroup>
<CodeBlock filename="Python" language="python" icon="python">
{PyFtsIncrementalIndex}
</CodeBlock>

```typescript TypeScript icon="square-js"
await tbl.add([{ vector: [3.1, 4.1], text: "Frodo was a happy puppy" }]);
await tbl.optimize();
```

```rust Rust icon="rust"
tbl.add(new_data).execute().await?;
tbl.optimize(OptimizeAction::All).await?;
```
</CodeGroup>

A useful rule of thumb is to call `optimize()` after roughly 100,000 row changes or 20 data-modification operations, whichever comes first. For tables with continuous ingest, schedule it on a cadence that keeps `num_unindexed_rows` (from `table.index_stats(...)`) close to zero. If you want to skip the flat scan over unindexed rows entirely — for example, on a hot read path where stale results are acceptable — call `.fast_search()` on the query so the search returns only indexed results.

## Advanced Usage

### Tokenize Table Data
Expand Down
98 changes: 98 additions & 0 deletions docs/search/hybrid-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,104 @@ text_query = "flower moon"
```
</CodeGroup>

## Query controls

Hybrid queries inherit the same builder API as vector and FTS queries, so the same knobs for filtering, distance bounds, and row identity apply. These compose with `.rerank(...)` and the explicit `.vector()` / `.text()` form shown above.

### Returning row IDs

Pass `with_row_id(True)` (Python) or `withRowId()` (TypeScript) to include the internal `_rowid` column in the results. This is useful for joining hybrid results back to a primary table, or for deduping across multiple queries:

<CodeGroup>
```python Python icon="python"
results = (
table.search("flower moon", query_type="hybrid")
.with_row_id(True)
.limit(10)
.to_pandas()
)
# results now contains a `_rowid` column alongside `_relevance_score`
```

```typescript TypeScript icon="square-js"
const results = await table
.query()
.fullTextSearch("flower moon")
.nearestTo(queryVector)
.withRowId()
.limit(10)
.toArray();
```
</CodeGroup>

### Bounding vector distance

`distance_range(lower, upper)` (Python) and `distanceRange(lower, upper)` (TypeScript) constrain the vector half of the hybrid query to the half-open interval `[lower, upper)`. This is helpful when you want to cap how far semantic candidates can drift from the query vector before reranking:

<CodeGroup>
```python Python icon="python"
results = (
table.search("flower moon", query_type="hybrid")
.distance_range(lower_bound=0.0, upper_bound=0.4)
.limit(10)
.to_pandas()
)
```

```typescript TypeScript icon="square-js"
const results = await table
.query()
.fullTextSearch("flower moon")
.nearestTo(queryVector)
.distanceRange(0.0, 0.4)
.limit(10)
.toArray();
```
</CodeGroup>

Either bound can be omitted to leave that side unbounded.

### Prefilter vs. postfilter

When the query carries a metadata filter via `where(...)`, you can choose whether the filter runs before or after the vector and FTS sub-queries. **Prefiltering** (the default) applies `where` to the candidate set before scoring, which is usually what you want — it shrinks the working set and benefits from any scalar indexes on the filter columns. **Postfiltering** runs the filter on the already-ranked top-k from each sub-query; this can be faster when the filter is non-selective or unindexed, but it may return fewer than `limit` rows because some of the top-k may be filtered out.

<CodeGroup>
```python Python icon="python"
# Prefilter (default): filter applied before scoring
table.search("flower moon", query_type="hybrid") \
.where("category = 'film'", prefilter=True) \
.limit(10) \
.to_pandas()

# Postfilter: filter applied after the sub-queries return top-k
table.search("flower moon", query_type="hybrid") \
.where("category = 'film'", prefilter=False) \
.limit(10) \
.to_pandas()
```

```typescript TypeScript icon="square-js"
// Prefilter (default): just call .where(...)
await table.query()
.fullTextSearch("flower moon")
.nearestTo(queryVector)
.where("category = 'film'")
.limit(10)
.toArray();

// Postfilter: chain .postfilter() after .where(...)
await table.query()
.fullTextSearch("flower moon")
.nearestTo(queryVector)
.where("category = 'film'")
.postfilter()
.limit(10)
.toArray();
```
</CodeGroup>

The choice gets baked into both sub-queries, so the vector and FTS halves see the filter applied the same way. Use [`explain_plan`](/search/optimize-queries#analyzing-non-vector-queries) on a hybrid query to see whether the filter pushed into the scan or ran as a separate `FilterExec` step.

## More on Reranking

You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice. LanceDB comes with [**built-in rerankers**](https://lancedb.github.io/lancedb/reranking/) and you can implement your own **custom reranker** as well.
Expand Down
17 changes: 14 additions & 3 deletions docs/search/multivector-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,15 @@ tbl.create_index(metric="cosine", vector_column_name="vector")
```
</CodeGroup>

<Info>
**Indexing matters more for multivector tables than for single-vector ones.**

A brute-force scan over a multivector column has to compare every query vector to every document vector in every row, so the cost grows with both the row count and the number of vectors per row.
In LanceDB OSS, the query will just run, so a large unindexed multivector table can stall a process for a long time before returning results.

On LanceDB Enterprise, the brute-force KNN safety check applies a stricter row threshold to multivector columns — roughly 10× lower than for single-vector columns. So an unindexed multivector table will start being rejected with a "vector search would use brute-force KNN" error well before a comparable single-vector table would. Build the index before you start hitting it from production traffic, even if the dataset is small enough that you'd skip indexing for a single-vector workload.
</Info>

### 6. Query a Single Vector

When searching with a single query vector, it will be compared against all vectors in each document, and the similarity scores will be aggregated to find the most relevant documents.
Expand Down Expand Up @@ -200,9 +209,11 @@ for d, emb in zip(docs, doc_embs):

tbl = db.create_table("docs", data=rows, schema=schema, mode="overwrite")

# 4) If your dataset is large, build an index + query using a query matrix
# For small datasets < 100k records, you can skip indexing
# tbl.create_index(vector_column_name="mv", metric="cosine")
# 4) Build an index + query using a query matrix.
# Multivector brute-force scales with rows × vectors-per-row, so build the index
# at much smaller dataset sizes than you would for single-vector search — and
# always before exposing the table to remote traffic.
tbl.create_index(vector_column_name="mv", metric="cosine")

query = "Tell me about ramen in Japan"
q_emb = np.asarray(model.encode([query], is_query=True)[0], dtype=np.float32) # (Tq, dim) :contentReference[oaicite:5]{index=5}
Expand Down
50 changes: 50 additions & 0 deletions docs/search/optimize-queries.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -386,3 +386,53 @@ For vector search performance:

- Create ANN index on your vector column(s) as described in the [index guide](/indexing/vector-index/)
- If you often filter by metadata, create [scalar indices](/indexing/scalar-index/) on those columns

## Analyzing non-vector queries

`explain_plan` and `analyze_plan` aren't vector-specific — they're available on every query builder, including FTS and hybrid. The most common reason to look at the plan for a non-vector query is to confirm whether your `where` clause pushed into the scan (good) or ran as a separate `FilterExec` step on top of the search results (often slower, and a hint that the filter column needs a scalar index).

### FTS queries

<CodeGroup>
```python Python icon="python"
plan = (
table.search("puppy", query_type="fts")
.where("category = 'animals'", prefilter=True)
.limit(10)
.explain_plan(True)
)
print(plan)
```

```typescript TypeScript icon="square-js"
const plan = await table
.query()
.fullTextSearch("puppy")
.where("category = 'animals'")
.limit(10)
.explainPlan(true);
```
</CodeGroup>

In an indexed FTS plan you should see a `MatchQuery` (or other FTS execution node) reading from the inverted index, with the metadata filter pushed down. If the plan shows a `LanceScan` followed by `FilterExec` over the entire text column, the FTS index either isn't covering the column or the filter isn't using a scalar index — both worth investigating.

### Hybrid queries

For hybrid queries, `explain_plan` returns the reranker label followed by the vector and FTS sub-plans, indented for readability:

<CodeGroup>
```python Python icon="python"
plan = (
table.search("flower moon", query_type="hybrid")
.where("category = 'film'", prefilter=True)
.limit(10)
.explain_plan(True)
)
print(plan)
# RRFReranker(...)
# <vector sub-plan>
# <FTS sub-plan>
```
</CodeGroup>

`analyze_plan` does the same, but executes both sub-queries and labels them as `Vector Search Plan:` and `FTS Search Plan:` in the output. This is the easiest way to see whether the filter pushed into both halves uniformly, and which half is dominating latency.
40 changes: 39 additions & 1 deletion docs/search/sql/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -114,4 +114,42 @@ interface FlightRecord {
const flights = (await result.collectToObjects()) as FlightRecord[];
console.log(flights);
```
</CodeGroup>
</CodeGroup>

### Inspecting query plans

The SQL endpoint runs queries through DataFusion, which means DataFusion's `EXPLAIN` family of statements is available unchanged. They're the SQL counterpart of the Python/TypeScript [`explain_plan` and `analyze_plan` methods](/search/optimize-queries) and are useful for the same things: confirming index usage, checking filter pushdown, and finding the slow operator in a query that's underperforming.

| Statement | What it returns |
| :----------------------- | :---------------------------------------------------------------------------------- |
| `EXPLAIN <query>` | Logical and physical plan, without executing the query. |
| `EXPLAIN ANALYZE <query>`| Executes the query and annotates each operator with runtime metrics (rows, timing). |
| `EXPLAIN VERBOSE <query>`| Adds intermediate optimizer plans on top of `EXPLAIN`. |

Run them through the same client you use for regular queries — the result is a small Arrow table with `plan_type` and `plan` columns:

<CodeGroup>
```python Python icon="python"
plan = run_query(
"EXPLAIN ANALYZE SELECT origin, destination "
"FROM flights WHERE origin = 'SFO' LIMIT 100"
)
for row in plan.to_pylist():
print(row["plan_type"])
print(row["plan"])
print()
```

```typescript TypeScript icon="square-js"
const plan = await client.query(
"EXPLAIN ANALYZE SELECT origin, destination " +
"FROM flights WHERE origin = 'SFO' LIMIT 100"
);
for (const row of (await plan.collectToObjects()) as Array<{ plan_type: string; plan: string }>) {
console.log(row.plan_type);
console.log(row.plan);
}
```
</CodeGroup>

The operators that show up in the SQL plan are the same ones documented on the [Optimize Query Performance](/search/optimize-queries) page (`LanceScan`, `ScalarIndexQuery`, `KNNVectorDistance`, `ANNIvfPartition`, and so on), so the same reasoning about index coverage and filter pushdown applies — just read the plan from a SQL client instead of a query builder.
104 changes: 104 additions & 0 deletions workflows/docs-audit/skills/docs-writer/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
name: docs-writer
description: Use when closing documentation gaps surfaced by a docs-audit run (e.g., workflows/docs-audit/artifacts/runs/<run-id>/report.md) by editing pages in this Mintlify site. The skill grounds every claim in source code from caller-named repos and prevents fabricated APIs, parameters, or behaviors.
---

# Docs writer (audit-driven)

You've been given a list of documentation gaps — typically from a `report.md` produced by a docs-audit run — and your job is to close them by editing pages under `docs/`. The work is mechanical, but two failure modes can ruin it:

1. **Fabrication**: writing about parameters, methods, or defaults that don't exist in the actual code.
2. **Drift**: leaving the prose updated but runnable snippets stale, so the page contradicts itself on the next regen.

Everything below exists to prevent those two failures.

## Workflow

### 1. Read the gap report

Start with the artifact the user pointed you at. For each gap, capture:

- **Page**: which docs page is affected (path under `docs/`).
- **Surface**: the specific behavior, parameter, or API the report says is missing or wrong.
- **Repos**: any source repos the report or user names. These are the source of truth — don't substitute or guess.

If a gap is vague, ask the user before writing. Vague gaps produce filler.

### 2. Inspect the source code

Locate the relevant repos before writing a single sentence. The user will name them; sibling checkouts are usually under `../` from this repo, but confirm with `ls ../` rather than assuming. If you can't find a named repo locally, ask the user for the path before proceeding — never substitute a different repo or fall back to memory.

Whatever repo the user names, that repo wins over your prior knowledge or training data.

For every claim you're about to write:

- **Grep for the symbol**: confirm the parameter, method, or flag exists by name.
- **Read the surrounding code**: understand the actual behavior, not just the signature. Defaults, error paths, and edge cases all matter.
- **Cite paths and line numbers** in the response so the user can audit your check.

If the source code disagrees with the gap report, trust the source and flag the discrepancy back to the user — it might be a real bug, or the report may be stale.

### 3. Draft the doc updates

Edit the affected MDX pages directly. Keep the change scoped to the gap; don't sweep in unrelated improvements unless the user asked for them.

For prose:

- **Placement**: put new sections where readers will encounter the concept naturally, not in the next empty slot at the bottom of the page.
- **Depth and tone**: match the heading depth and voice of surrounding sections.
- **Cross-links**: link to related pages with anchor links when it helps the reader, without spraying too many links and making the prose look cluttered.

For code examples:

- **Short illustrative one-liners**: inline code blocks are fine.
- **Anything canonical or multi-line**: follow the test → snippet → MDX pipeline described in `skills/docs-writer/SKILL.md`. Add a runnable test, mark it with `--8<-- [start:name]` / `[end:name]`, run `make snippets`, and import the generated `{Py|Ts|Rs}{TitleCase}` export.
- **Consistency**: if the page already uses snippet imports, don't introduce inline blocks (or vice versa).

### 4. Verify before reporting back

- Run the relevant test suite if you added tests (`pytest tests/py/...`, `npx jest ...`, `cargo run --example ...`).
- Run `make snippets` if test sources changed, and confirm `git status` shows the regenerated MDX so it lands in the same commit.
- Re-read each new sentence against the code one more time. If a claim isn't directly supported by something you grepped, cut it.

## Style rules

Apply these to every piece of prose you write or edit through this skill.

- **Bullet separators**: use colons, not em dashes. Write `Placement: put new sections where...`, not `Placement — put new sections where...`.
- **Contractions**: write `it's`, `don't`, `you'll`. Skip the formal register; the docs are technical, but they're not a legal document.
- **Subheader case**: sentence case only — capitalize just the first word and any proper nouns. Use `Tag-based versioning`, not `Tag-Based Versioning`.
- **Tone**: technical but approachable. Write primarily for engineers, but define jargon on first use and ground abstract ideas in concrete examples. Don't assume the reader has the same context you have from reading the source.

## Anti-fabrication checklist

Before finalizing any doc update, walk this list:

- Every parameter named in prose was found via grep in the source repo.
- Every method call shown in a code example compiles or runs (in a test, ideally).
- Every claimed default value was checked in the code, not assumed.
- Every cross-language parity claim (e.g., "Python and TypeScript both expose X") was checked in each binding's source.
- Every "this is exempt", "this is preserved", or "this is automatic" claim has a code or comment citation.

If an item fails, either fix the doc or surface the question to the user. Never paper over a gap with confident-sounding prose.

## Output contract

When using this skill, the final result should:

- Update the affected MDX pages so the identified gaps are genuinely closed.
- Cite source-code paths and line numbers for non-trivial claims, both in the PR description and in the response back to the user.
- Flag any gap that turned out to be inaccurate, ambiguous, or out of scope, with reasoning.
- Leave the snippet pipeline coherent if code examples were touched: tests pass, `docs/snippets/` regenerated, regenerated files staged alongside the test changes.

## Request template

A typical invocation looks like:

```text
Use workflows/docs-audit/skills/docs-writer/SKILL.md to close the gaps in
workflows/docs-audit/artifacts/runs/<run-id>/report.md. Cross-check each claim
against <repo-path-1> and <repo-path-2>, and update the affected pages under
docs/.
```

The user supplies the repo paths in the invocation. Don't infer them, confirm with the user if you are unsure which source repos to use as a grounding reference.
Loading