Skip to content

Commit 093e946

Browse files
committed
Document Elasticsearch data connector and vector engine
1 parent 61c920c commit 093e946

6 files changed

Lines changed: 319 additions & 4 deletions

File tree

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: 'Elasticsearch Data Connector'
3+
sidebar_label: 'Elasticsearch Data Connector'
4+
description: 'Query Elasticsearch indexes as SQL tables in Spice, including kNN vector search, full-text search, and hybrid search.'
5+
tags:
6+
- data-connectors
7+
- elasticsearch
8+
- search
9+
---
10+
11+
The Elasticsearch Data Connector exposes Elasticsearch indexes as SQL tables in Spice. Index mappings are translated to Arrow schemas so that documents can be queried with federated SQL alongside data from other connectors. The connector also bridges Elasticsearch's native kNN and full-text search into Spice, enabling [hybrid search](../../features/search) through the standard `vector_search`, `text_search`, and `rrf` UDTFs.
12+
13+
```yaml
14+
datasets:
15+
- from: elasticsearch:products
16+
name: products
17+
params:
18+
elasticsearch_endpoint: https://localhost:9200
19+
elasticsearch_user: ${secrets:es_user}
20+
elasticsearch_pass: ${secrets:es_pass}
21+
```
22+
23+
:::note[Build requirement]
24+
The Elasticsearch connector is behind the `elasticsearch` Cargo feature and is not compiled into the default distribution. Build Spice from source with `cargo build --release --features elasticsearch`, or use a distribution that enables it.
25+
:::
26+
27+
## Configuration
28+
29+
### `from`
30+
31+
The `from` field takes the form `elasticsearch:{index_name}` where `index_name` is the Elasticsearch index to query.
32+
33+
```yaml
34+
datasets:
35+
- from: elasticsearch:products
36+
name: products
37+
```
38+
39+
Dot-separated paths may be used to refer to nested fields in query results (e.g. `address.city`); the connector flattens object mappings into Arrow columns using that convention.
40+
41+
### `name`
42+
43+
The dataset name used as the table name within Spice. The dataset name cannot be a [reserved keyword](../../reference/spicepod/keywords).
44+
45+
### `params`
46+
47+
The Elasticsearch connector accepts the following `params`. Use the [secret replacement syntax](../secret-stores) to load credentials from a secret store.
48+
49+
| Parameter Name | Description | Required | Default |
50+
| ------------------------ | ------------------------------------------------------------- | -------- | ------- |
51+
| `elasticsearch_endpoint` | Cluster URL (e.g., `https://localhost:9200`). | Yes | - |
52+
| `elasticsearch_user` | Username for HTTP basic authentication. | No | - |
53+
| `elasticsearch_pass` | Password for HTTP basic authentication. | No | - |
54+
55+
## Types
56+
57+
The connector derives an Arrow schema from each index's mapping via `GET /<index>/_mapping`. Elasticsearch field types map to Arrow as follows:
58+
59+
| Elasticsearch Field Type | Arrow Type | Notes |
60+
| ---------------------------------------------------------------- | ------------------------------------ | ----------------------------------------------------------- |
61+
| `text`, `keyword`, `wildcard`, `constant_keyword`, `match_only_text` | `Utf8` | |
62+
| `long` | `Int64` | |
63+
| `integer` | `Int32` | |
64+
| `short` | `Int16` | |
65+
| `byte` | `Int8` | |
66+
| `double` | `Float64` | |
67+
| `float`, `half_float`, `scaled_float` | `Float32` | |
68+
| `boolean` | `Boolean` | |
69+
| `date`, `date_nanos` | `Utf8` | ES dates are flexibly formatted; preserved as strings. |
70+
| `binary` | `Utf8` | Base64-encoded in the JSON response. |
71+
| `ip` | `Utf8` | |
72+
| `dense_vector` (with `dims`) | `FixedSizeList<Float32, dims>` | Required `dims` field must fit in `i32`. |
73+
| `dense_vector` (missing `dims`) | `Utf8` | Falls back to raw JSON when dims cannot be resolved. |
74+
| `object`, `nested` | `Utf8` | Serialized JSON. |
75+
| Any other mapping type | `Utf8` | Fallback — the raw JSON value is preserved as a string. |
76+
77+
Nested `object` fields are flattened by concatenating field names with dots (e.g. `address.city`). `nested` fields are preserved as JSON strings because per-document ordering must be retained.
78+
79+
## Querying
80+
81+
After registering a dataset, query it like any other Spice table:
82+
83+
```sql
84+
SELECT name, price
85+
FROM products
86+
WHERE price > 100
87+
ORDER BY price DESC
88+
LIMIT 10;
89+
```
90+
91+
### Vector and Full-Text Search
92+
93+
When an index contains a `dense_vector` field, the Elasticsearch connector wires it into Spice's search pipeline. This enables:
94+
95+
- **Vector similarity search** via [`vector_search`](../../reference/sql/search#vector-search-vector_search) — executed natively as an Elasticsearch kNN query.
96+
- **Full-text search** via [`text_search`](../../reference/sql/search#full-text-search-text_search) — executed using Elasticsearch's native BM25 ranking.
97+
- **Hybrid search** via [`rrf`](../../reference/sql/search#reciprocal-rank-fusion-rrf) — combining both with Reciprocal Rank Fusion.
98+
99+
These operations run against the Elasticsearch cluster directly rather than ingesting vectors into an accelerator, keeping indexing and search colocated in Elasticsearch.
100+
101+
Example:
102+
103+
```sql
104+
-- kNN vector search against Elasticsearch
105+
SELECT product_id, name, score
106+
FROM vector_search(products, 'wireless noise cancelling headphones')
107+
ORDER BY score DESC
108+
LIMIT 10;
109+
110+
-- BM25 full-text search
111+
SELECT product_id, name, score
112+
FROM text_search(products, 'headphones waterproof', description)
113+
ORDER BY score DESC
114+
LIMIT 10;
115+
116+
-- Hybrid search via RRF
117+
SELECT product_id, name, fused_score
118+
FROM rrf(
119+
vector_search(products, 'wireless noise cancelling headphones'),
120+
text_search(products, 'headphones waterproof', description),
121+
join_key => 'product_id'
122+
)
123+
ORDER BY fused_score DESC
124+
LIMIT 10;
125+
```
126+
127+
See [Search Functionality](../../features/search) for the full search feature guide.
128+
129+
## Authentication
130+
131+
The connector uses HTTP basic authentication when `elasticsearch_user` and `elasticsearch_pass` are provided. For production deployments, store credentials in a [secret store](../secret-stores) and reference them with `${secrets:...}` rather than hard-coding them in `spicepod.yaml`.
132+
133+
TLS is enabled automatically for `https://` endpoints.
134+
135+
## Limitations
136+
137+
- Nested object fields are exposed as JSON strings rather than structured columns.
138+
- `date` and `date_nanos` fields are preserved as strings because Elasticsearch accepts heterogeneous date formats; cast to a timestamp in SQL when numeric comparison is required.
139+
- `dense_vector` fields without a declared `dims` value fall back to `Utf8` and are not usable as a vector column.
140+
- Pushdown of SQL predicates to Elasticsearch query DSL is limited; complex filter expressions are evaluated locally by DataFusion after fetching results.
141+
142+
Elasticsearch can also be configured as a [Vector Engine](../vectors/elasticsearch) for datasets sourced from other connectors (storing Spice-managed embeddings in Elasticsearch rather than querying an existing index).

website/docs/components/data-connectors/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Supported Data Connectors include:
7373
| `ducklake` | DuckLake | Alpha | Parquet |
7474
| `scylladb` | ScyllaDB | Alpha | CQL, Alternator (DynamoDB) |
7575
| `adbc` | [ADBC][adbc] | Alpha | Arrow (ADBC) |
76-
| `elasticsearch` | ElasticSearch | Roadmap | |
76+
| `elasticsearch` | [Elasticsearch][elasticsearch] | Alpha | Elasticsearch REST |
7777

7878
[databricks]: https://github.com/spiceai/cookbook/tree/trunk/databricks#readme
7979
[spark]: https://spark.apache.org/docs/latest/spark-connect-overview.html
@@ -85,6 +85,7 @@ Supported Data Connectors include:
8585
[glue]: https://github.com/spiceai/cookbook/tree/trunk/glue#readme
8686
[adbc]: https://arrow.apache.org/adbc/
8787
[ODPIC]: https://oracle.github.io/odpi/
88+
[elasticsearch]: ./elasticsearch.md
8889

8990
## File Formats
9091

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
title: 'Elasticsearch Vector Engine'
3+
sidebar_label: 'Elasticsearch'
4+
description: 'Use Elasticsearch as a vector engine in Spice for kNN vector search, full-text search, and hybrid search.'
5+
sidebar_position: 2
6+
pagination_next: null
7+
---
8+
9+
Elasticsearch can be used as a vector engine in Spice to store embeddings and execute kNN similarity search, full-text search (BM25), and hybrid search (RRF) natively in the Elasticsearch cluster. This is useful when Elasticsearch is already the system of record for a workload, or when the operational characteristics of a managed Elasticsearch cluster (replication, sharding, snapshots) are preferred over a dedicated vector store.
10+
11+
Unlike the [Elasticsearch Data Connector](../data-connectors/elasticsearch), which reads an existing Elasticsearch index as a Spice dataset, the Elasticsearch vector engine accepts data from any Spice data connector, generates embeddings using the configured embedding model, and writes vectors (and source fields) to an Elasticsearch index that Spice manages.
12+
13+
```yaml
14+
datasets:
15+
- from: file:products.parquet
16+
name: products
17+
acceleration:
18+
enabled: true
19+
vectors:
20+
enabled: true
21+
engine: elasticsearch
22+
params:
23+
elasticsearch_endpoint: https://localhost:9200
24+
elasticsearch_user: ${secrets:es_user}
25+
elasticsearch_pass: ${secrets:es_pass}
26+
elasticsearch_index: products-embeddings
27+
columns:
28+
- name: description
29+
embeddings:
30+
- from: bedrock_titan
31+
32+
embeddings:
33+
- from: bedrock:amazon.titan-embed-text-v2:0
34+
name: bedrock_titan
35+
params:
36+
aws_region: us-east-2
37+
dimensions: '1024'
38+
```
39+
40+
:::note[Build requirement]
41+
The Elasticsearch vector engine requires the `elasticsearch` Cargo feature, which is not enabled in the default distribution. Build Spice from source with `cargo build --release --features elasticsearch`, or use a distribution that enables it.
42+
:::
43+
44+
## Parameters
45+
46+
| Parameter | Description | Example Value |
47+
| ------------------------ | -------------------------------------------------------------------------------------------------------------------- | ----------------------------- |
48+
| `elasticsearch_endpoint` | Required. Cluster URL. | `https://localhost:9200` |
49+
| `elasticsearch_user` | Optional. Username for HTTP basic authentication. | `${secrets:es_user}` |
50+
| `elasticsearch_pass` | Optional. Password for HTTP basic authentication. | `${secrets:es_pass}` |
51+
| `elasticsearch_index` | Optional. Index used to store vectors. Defaults to a sanitized `{dataset}-{column}-{model}` value. | `products-embeddings` |
52+
| `elasticsearch_vector_field` | Optional. Name of the `dense_vector` field in Elasticsearch. Defaults to `{column}_embedding`. | `description_embedding` |
53+
54+
## Overview
55+
56+
When configured as a vector engine, Spice:
57+
58+
1. Reads data from the underlying connector (for example, Parquet on disk or a federated SQL source).
59+
2. Computes embeddings on the configured column using the attached embedding model.
60+
3. Writes vectors and source fields to the configured Elasticsearch index, provisioning the index mapping when needed (`dense_vector` of the correct dimension plus text fields for full-text search).
61+
4. At query time, routes `vector_search`, `text_search`, and `rrf` against the Elasticsearch index using native kNN and BM25 queries.
62+
63+
Source fields on the dataset are indexed as `text` in Elasticsearch so they can be used as full-text search targets. Primary key columns are indexed as `keyword` and included in kNN results so that matches can be joined back to the Spice base table when additional columns are requested.
64+
65+
:::warning[Limitations]
66+
67+
- A dataset or view must be accelerated (`datasets[].acceleration.enabled: true`) for the vector engine to be provided the appropriate data to ingest. See [`acceleration.enabled`](../../reference/spicepod/datasets#accelerationenabled).
68+
- The dataset must have a resolvable primary key, either via the underlying schema or an explicit [`row_id`](../../reference/spicepod/datasets#columnsembeddingsrow_id).
69+
- Elasticsearch kNN uses approximate nearest neighbors and returns probabilistically closest results.
70+
71+
:::
72+
73+
## Configuration
74+
75+
### Embedding Models
76+
77+
Any embedding model supported by Spice can be used to produce the vectors written to Elasticsearch, including local models via [Hugging Face](../embeddings/huggingface), hosted models via [OpenAI](../embeddings/openai), [Bedrock](../embeddings/bedrock), and others. The vector dimension is inferred from the embedding model and used to provision the Elasticsearch `dense_vector` field.
78+
79+
```yaml
80+
embeddings:
81+
- from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
82+
name: local_embedding_model
83+
```
84+
85+
### Primary Keys
86+
87+
Spice requires a primary key to round-trip matches between Elasticsearch and the base dataset. If the source dataset does not carry primary key metadata, specify it on the column embedding:
88+
89+
```yaml
90+
columns:
91+
- name: description
92+
embeddings:
93+
- from: local_embedding_model
94+
row_id: product_id
95+
```
96+
97+
### Custom Index and Vector Field Names
98+
99+
By default the index name is a sanitized `{dataset}-{column}-{model}` and the vector field is `{column}_embedding`. Override either with `elasticsearch_index` and `elasticsearch_vector_field`:
100+
101+
```yaml
102+
vectors:
103+
enabled: true
104+
engine: elasticsearch
105+
params:
106+
elasticsearch_endpoint: https://localhost:9200
107+
elasticsearch_index: products-vectors-v2
108+
elasticsearch_vector_field: desc_vec
109+
```
110+
111+
## Querying
112+
113+
Vector, full-text, and hybrid search use the standard Spice UDTFs. When the dataset is backed by the Elasticsearch vector engine, these UDTFs compile to native Elasticsearch queries rather than local computation.
114+
115+
### Vector Search
116+
117+
```sql
118+
SELECT product_id, name, score
119+
FROM vector_search(products, 'wireless noise cancelling headphones')
120+
ORDER BY score DESC
121+
LIMIT 10;
122+
```
123+
124+
The query text is embedded with the configured embedding model and sent to Elasticsearch as a kNN query. By default the number of candidates considered by Elasticsearch is twice the requested `k`.
125+
126+
### Full-Text Search
127+
128+
Any `Utf8`/`LargeUtf8` column on the dataset is available as a full-text search target:
129+
130+
```sql
131+
SELECT product_id, name, score
132+
FROM text_search(products, 'bluetooth waterproof', description)
133+
ORDER BY score DESC
134+
LIMIT 10;
135+
```
136+
137+
### Hybrid Search (RRF)
138+
139+
Combine vector and full-text results with [Reciprocal Rank Fusion](../../reference/sql/search#reciprocal-rank-fusion-rrf):
140+
141+
```sql
142+
SELECT product_id, name, fused_score
143+
FROM rrf(
144+
vector_search(products, 'wireless noise cancelling headphones'),
145+
text_search(products, 'bluetooth waterproof', description),
146+
join_key => 'product_id'
147+
)
148+
ORDER BY fused_score DESC
149+
LIMIT 10;
150+
```
151+
152+
Advanced RRF options — per-query `rank_weight`, recency decay, and custom smoothing `k` — work identically regardless of the underlying vector engine. See [RRF](../../reference/sql/search#reciprocal-rank-fusion-rrf) for the full reference.
153+
154+
## Authentication
155+
156+
When `elasticsearch_user` and `elasticsearch_pass` are provided, the vector engine uses HTTP basic authentication. Prefer storing credentials in a [secret store](../secret-stores) and referencing them with `${secrets:...}`. TLS is enabled automatically for `https://` endpoints.
157+
158+
## Comparison with the Data Connector
159+
160+
| Use case | Use |
161+
| ------------------------------------------------------------------------ | ------------------------------------------------------------------------- |
162+
| Query an existing Elasticsearch index (with or without `dense_vector`). | [Elasticsearch Data Connector](../data-connectors/elasticsearch). |
163+
| Ingest data from another source and have Spice manage vectors in ES. | Elasticsearch Vector Engine (this page). |
164+
165+
Both paths surface `vector_search`, `text_search`, and `rrf`; pick the one that matches which system owns the data.

website/docs/components/vectors/index.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,13 @@ For the complete reference specification see [datasets](../../reference/spicepod
2626
2727
Supported Vector engines:
2828
29-
| Name | Description |
30-
| ------------------------- | -------------- |
31-
| [`s3_vectors`][s3vectors] | AWS S3 vectors |
29+
| Name | Description |
30+
| ------------------------------- | ----------------- |
31+
| [`s3_vectors`][s3vectors] | AWS S3 vectors |
32+
| [`elasticsearch`][elasticsearch] | Elasticsearch |
3233

3334
[s3vectors]: /docs/components/vectors/s3_vectors.md
35+
[elasticsearch]: /docs/components/vectors/elasticsearch.md
3436

3537
:::warning[Limitations]
3638

website/docs/reference/spicepod/datasets.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -973,6 +973,7 @@ Enable or disable vector storage, defaults to `true`.
973973
The vector engine to use. The following engines are supported:
974974

975975
- [`s3_vectors`](../../components/vectors/s3_vectors) - Vectors are created and indexed into [Amazon S3 Vectors](https://aws.amazon.com/s3/features/vectors/).
976+
- [`elasticsearch`](../../components/vectors/elasticsearch) - Vectors are created and indexed into an [Elasticsearch](https://www.elastic.co/) cluster.
976977

977978
## `vectors.params`
978979

website/docs/tags.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,10 @@ dynamodb:
170170
label: 'DynamoDB'
171171
permalink: '/dynamodb'
172172
description: 'Amazon DynamoDB NoSQL database integration.'
173+
elasticsearch:
174+
label: 'Elasticsearch'
175+
permalink: '/elasticsearch'
176+
description: 'Elasticsearch data connector and vector engine integration.'
173177
embeddings:
174178
label: 'Embeddings'
175179
permalink: '/embeddings'

0 commit comments

Comments
 (0)