Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions website/docs/components/embeddings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,28 @@ datasets:
row_id: id
```

### Multi-Vector Embeddings

When the source column is `List<Utf8>` (or `LargeList<Utf8>`), Spice embeds each list element independently and produces a `List<FixedSizeList<Float32, N>>` column. This is the multi-vector (column-of-vectors) mode, useful for rows that carry several independent pieces of text such as tags, section headings, or historical queries.

```yaml
datasets:
- from: file:products.parquet
name: products
acceleration:
enabled: true
columns:
- name: tags # List<Utf8>
embeddings:
- from: local_embedding_model
aggregation: max
max_elements_per_row: 64
```

The `aggregation` field controls how per-element similarities are combined into a per-row score during vector search. `max` (default) is ColBERT-style `MaxSim`; `mean` and `sum` are also supported. The `max_elements_per_row` field caps how many list elements are embedded per row (default `32`, hard limit `1024`). Multi-vector columns also support [ColBERT-style late-interaction search](../features/search/multi-vector#late-interaction-multi-query-search) via an array of query strings.

See [Multi-Vector Search](../features/search/multi-vector) for query usage and [`columns[*].embeddings[*]`](../reference/spicepod/datasets#columnsembeddings) for the full field reference.

import DocCardList from '@theme/DocCardList';

<DocCardList />
22 changes: 22 additions & 0 deletions website/docs/features/search/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Spice provides comprehensive search capabilities enabling developers to query da
Spice supports multiple search methods:

- **Vector Search**: Semantic search using embeddings to retrieve data by meaning and similarity.
- **Multi-Vector Search**: Search over columns of vectors, including ColBERT-style late-interaction queries.
- **Full-Text Search**: Keyword-driven search optimized for text data retrieval.
- **Hybrid Search**: Combine multiple search methods using Reciprocal Rank Fusion (RRF) for improved relevance.
- **SQL Search**: Traditional SQL queries for precise and structured searches.
Expand Down Expand Up @@ -52,6 +53,27 @@ LIMIT 5

For complete SQL UDTF specifications, see [Vector-Based Search SQL UDTF](search/vector-search#sql-udtf).

### Multi-Vector Search

Multi-vector search operates on columns that store many vectors per row, such as per-tag or per-section embeddings. It also supports ColBERT-style late-interaction queries where the query itself is an array of strings.

**Requirements:**

- A list-typed source column (`List<Utf8>`) embedded with a multi-vector aggregation

**Getting Started:**

- [Multi-Vector Search Docs](search/multi-vector)

**Example SQL Multi-Vector Search:**

```sql
SELECT product_id, name, score
FROM vector_search(products, ['hiking', 'waterproof'], tags)
ORDER BY score DESC
LIMIT 10
```

### Full-Text Search

Full-text search efficiently retrieves records matching specific keywords.
Expand Down
110 changes: 110 additions & 0 deletions website/docs/features/search/multi-vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: 'Multi-Vector Search'
sidebar_label: 'Multi-Vector Search'
description: 'Embed list-of-strings columns as a column of vectors and use ColBERT-style late-interaction search in Spice.'
sidebar_position: 3
tags:
- search
- embeddings
- models
---

A multi-vector column stores many embedding vectors per row rather than a single vector. Spice produces a multi-vector column by embedding each element of a `List<Utf8>` source column independently, yielding a `List<FixedSizeList<Float32, N>>` embedding column.

Multi-vector embeddings are useful when a single row has several distinct pieces of text — for example, a product with many tags, a paper with multiple titles and section headings, or a user with a set of historical queries. Each element is embedded and scored separately, and per-row results are produced by aggregating the per-element similarities.

## How Multi-Vector Differs from Chunking

Chunking splits one long string (such as a document body) into pieces and embeds each piece. Multi-vector starts from a column that is already a list of independent strings and embeds each list element as-is.

| Source column type | Embedding mode | Produced embedding type |
| ------------------ | ---------------------- | --------------------------------- |
| `Utf8` | Scalar (default) | `FixedSizeList<Float32, N>` |
| `Utf8` + chunking | Chunked | `List<FixedSizeList<Float32, N>>` |
| `List<Utf8>` | Multi-vector (default) | `List<FixedSizeList<Float32, N>>` |

Multi-vector and chunked columns share the same Arrow type, but the per-element offsets column (`<column>_offsets`) is only produced for chunked columns.

## Configuring a Multi-Vector Column

Define an embedding on a `List<Utf8>` column the same way as a scalar string column. Spice detects the list type and embeds each element independently.

```yaml
datasets:
- from: file:products.parquet
name: products
acceleration:
enabled: true
columns:
- name: tags # List<Utf8>
embeddings:
- from: local_embedding_model
aggregation: max
max_elements_per_row: 64

embeddings:
- from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
name: local_embedding_model
```

### Aggregation Strategies

When a multi-vector column is queried with a single query string, each element's similarity to the query is computed, and the per-row score is the aggregate of those similarities.

| `aggregation` | Description |
| ------------- | ---------------------------------------------------------------------------------- |
| `max` | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). |
| `mean` | Average similarity across elements. Favors rows where most elements are relevant. |
| `sum` | Sum of similarities. Biases toward rows with many matching elements. |

### Element Caps

Multi-vector columns default to embedding the first 32 elements per row. Raise the cap with `max_elements_per_row` (hard-capped at `1024`). Excess elements are dropped with a warning log so that rows with unbounded tag counts do not blow up embedding cost.

## Querying with `vector_search`

A multi-vector column is queried with the standard `vector_search` UDTF. The configured `aggregation` is applied automatically.

```sql
SELECT product_id, name, score
FROM vector_search(products, 'travel accessories', tags)
ORDER BY score DESC
LIMIT 10;
```

## Late-Interaction (Multi-Query) Search

Multi-vector columns also support ColBERT-style late-interaction search, where the query itself is an array of strings. Each query is embedded independently, the best-matching element is selected for each query (`MaxSim`), and the per-row score is the sum across queries:

```text
score(d) = Σ_{q ∈ Q} max_{e ∈ d} cos(q, e)
```

```sql
SELECT product_id, name, score
FROM vector_search(
products,
['hiking', 'waterproof', 'lightweight'],
tags
)
ORDER BY score DESC
LIMIT 10;
```

Late-interaction search is only supported on multi-vector columns; passing an array of queries to a scalar or chunked column returns an error. A maximum of 32 query strings are accepted per call.

## Passthrough Multi-Vector Columns

Datasets that already contain multi-vector columns can be used directly when their schema matches the conventions in [Vector-Based Search](vector-search#using-existing-embeddings):

- Column name: `<original_column>_embedding`
- Type: `List<FixedSizeList<Float32 or Float64, N>>`
- No offsets column (that is only required for chunked scalar columns)

Declare the underlying column's embedding in `spicepod.yaml` so that Spice knows which embedding model the existing vectors came from.

## Limitations

- Multi-vector embeddings require the source column to be `List<Utf8>` or `LargeList<Utf8>`.
- Late-interaction search accepts at most 32 query strings per call.
- Multi-vector columns cannot currently be stored in an external vector engine; use a [data accelerator](../../components/data-accelerators) with `acceleration.enabled: true` to cache embeddings.
2 changes: 2 additions & 0 deletions website/docs/features/search/vector-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ Vector search uses embeddings (numerical representations of text or data) to fin
- Retrieval-augmented generation (RAG) applications
- Recommendation systems

For embedding columns that contain many vectors per row (for example, one vector per tag or per section), see [Multi-Vector Search](multi-vector).

## Embedding Models

Spice supports two types of embedding providers:
Expand Down
42 changes: 37 additions & 5 deletions website/docs/reference/spicepod/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -806,6 +806,38 @@ columns:
vector_size: 1024
```

## `columns[*].embeddings[*].aggregation`

Optional. For multi-vector columns (`List<Utf8>` source), the strategy used to combine per-element similarities into a single per-row score during vector search. Only meaningful when the underlying column is list-typed.

| Value | Description |
| ------ | ---------------------------------------------------------------------------------- |
| `max` | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). |
| `mean` | Average similarity across elements. |
| `sum` | Sum of similarities across elements. |

See [Multi-Vector Search](../../features/search/multi-vector) for details.

```yaml
columns:
- name: tags
embeddings:
- from: local_embedding_model
aggregation: max
```

## `columns[*].embeddings[*].max_elements_per_row`

Optional. For multi-vector columns, the maximum number of list elements embedded per row. Defaults to `32`; hard-capped at `1024`. Excess elements are dropped with a warning log.

```yaml
columns:
- name: tags
embeddings:
- from: local_embedding_model
max_elements_per_row: 128
```

## `columns[*].full_text_search` {#columns-search-full-text}

## `columns[*].full_text_search.enabled`
Expand Down Expand Up @@ -910,11 +942,11 @@ The `metadata` field serves two purposes:

2. **File metadata columns** — For [file-based connectors](../../components/data-connectors/#metadata-columns) (S3, ABFS, File, FTP, SFTP, SMB, NFS, HTTP/HTTPS), the following reserved keys enable virtual columns that expose per-file object store metadata in query results:

| Key | Value | Column Type | Description |
| ---------------- | ----------- | ---------------------- | ---------------------------------- |
| `_location` | `enabled` | `Utf8` | Full URI of the source file |
| `_last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
| `_size` | `enabled` | `UInt64` | File size in bytes |
| Key | Value | Column Type | Description |
| ---------------- | --------- | ---------------------- | ------------------------------- |
| `_location` | `enabled` | `Utf8` | Full URI of the source file |
| `_last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
| `_size` | `enabled` | `UInt64` | File size in bytes |

```yaml
datasets:
Expand Down
12 changes: 12 additions & 0 deletions website/docs/reference/sql/search.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ This section documents search capabilities in Spice SQL, including vector search
- [Vector Search (`vector_search`)](#vector-search-vector_search)
- [Usage](#usage)
- [Example](#example)
- [Multi-Query (Late-Interaction) Form](#multi-query-late-interaction-form)
- [Full-Text Search (`text_search`)](#full-text-search-text_search)
- [Usage](#usage-1)
- [Example](#example-1)
Expand Down Expand Up @@ -61,6 +62,17 @@ LIMIT 2;

See [Vector-Based Search](../../features/search/vector-search) for configuration and advanced usage.

### Multi-Query (Late-Interaction) Form

When the target column is a [multi-vector column](../../features/search/multi-vector), `vector_search` also accepts an array of query strings. Each query is embedded independently and the per-row score is `Σ_q max_e cos(q, e)` — ColBERT-style late interaction. Passing an array to a scalar or chunked column returns an error. At most 32 query strings are accepted per call.

```sql
SELECT product_id, name, score
FROM vector_search(products, ['hiking', 'waterproof', 'lightweight'], tags)
ORDER BY score DESC
LIMIT 10;
```

---

## Full-Text Search (`text_search`)
Expand Down
Loading