Skip to content

Commit 1163bb2

Browse files
claudespicelukekim
authored andcommitted
docs: Add Elasticsearch as a full-text search engine option
Document the new dataset-level `full_text_search` block that allows Elasticsearch to be used as the BM25 FTS engine, including connection parameters, ingestion tuning controls, YAML anchor reuse, and combining with the Elasticsearch vector engine. Also adds the new ingestion tuning parameters to the Elasticsearch vector engine page.
1 parent 3ebf55e commit 1163bb2

3 files changed

Lines changed: 183 additions & 1 deletion

File tree

website/docs/components/vectors/elasticsearch.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,13 @@ The Elasticsearch vector engine is available in the Spice [Enterprise edition](h
5858
| `elasticsearch_max_retries` | Optional. Maximum retry attempts for transient Elasticsearch errors (HTTP 429 / 5xx). Default: `3`. | `3` |
5959
| `elasticsearch_retry_initial_backoff` | Optional. Initial backoff duration between retries, in time unit format. Default: `200ms`. | `200ms` |
6060
| `elasticsearch_batch_write_rows` | Optional. Maximum rows per Elasticsearch `_bulk` request. Controls memory usage and payload size during writes. Default: `1000`. | `1000` |
61+
| `elasticsearch_index_settings` | Optional. JSON object passed as Elasticsearch index settings when creating the index. Existing indexes are not recreated. | `{"index":{"codec":"best_compression"}}` |
62+
| `elasticsearch_number_of_shards` | Optional. ES `number_of_shards` index setting, applied at index creation only. | `1` |
63+
| `elasticsearch_number_of_replicas` | Optional. ES `number_of_replicas` index setting, applied at index creation only. | `0` |
64+
| `elasticsearch_refresh_interval` | Optional. ES `refresh_interval` index setting, applied at index creation only. | `1s` |
65+
| `elasticsearch_bulk_load_refresh_interval` | Optional. Temporary `refresh_interval` during bulk writes, restored afterward. Set to `-1` to disable refresh during loading. | `-1` |
66+
| `elasticsearch_force_merge_after_write` | Optional. Run `_forcemerge` after full/append writes. Default: `false`. | `true` |
67+
| `elasticsearch_force_merge_segments` | Optional. Max segments for `_forcemerge`. Setting this also enables force merge. Default when force merge enabled: `1`. | `1` |
6168

6269
:::warning[Not yet supported]
6370
The Elasticsearch vector engine does **not** currently support:

website/docs/features/search/full-text.md

Lines changed: 141 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,17 @@ Spice provides full-text search functionality with BM25 scoring. This search met
1515

1616
Datasets can be augmented with a full-text search index that enables efficient search. Dataset columns are included in the full-text index based on the column configuration.
1717

18+
## Engines
19+
20+
Spice supports two full-text search engines:
21+
22+
| Engine | Description |
23+
| --- | --- |
24+
| **Tantivy** (default) | Built-in, in-process BM25 engine. No external dependencies. |
25+
| **Elasticsearch** | Delegates BM25 indexing and search to an external Elasticsearch cluster. Useful when Elasticsearch is already part of the infrastructure or when its operational characteristics (sharding, replication, snapshots) are preferred. |
26+
27+
When no engine is specified, Tantivy is used automatically.
28+
1829
## Enabling Full-Text Search
1930

2031
To enable full-text search, configure your dataset columns within your dataset definition as follows:
@@ -38,7 +49,136 @@ datasets:
3849
enabled: true
3950
```
4051
41-
In this example, full-text search indexing is enabled on both the `title` and `body` columns. The `row_id` specifies a unique identifier for referencing search results and retrieving additional data.
52+
In this example, full-text search indexing is enabled on both the `title` and `body` columns using the default Tantivy engine. The `row_id` specifies a unique identifier for referencing search results and retrieving additional data.
53+
54+
## Using Elasticsearch as the FTS Engine
55+
56+
To use Elasticsearch instead of the built-in Tantivy engine, add a dataset-level `full_text_search` block with `engine: elasticsearch` and the connection parameters:
57+
58+
```yaml
59+
datasets:
60+
- from: file:./articles.parquet
61+
name: articles
62+
acceleration:
63+
enabled: true
64+
engine: arrow
65+
full_text_search:
66+
engine: elasticsearch
67+
params:
68+
elasticsearch_endpoint: http://localhost:9200
69+
elasticsearch_user: ${secrets:ES_USER}
70+
elasticsearch_pass: ${secrets:ES_PASS}
71+
elasticsearch_index: articles-fts
72+
columns:
73+
- name: title
74+
full_text_search:
75+
enabled: true
76+
row_id:
77+
- id
78+
- name: body
79+
full_text_search:
80+
enabled: true
81+
row_id:
82+
- id
83+
```
84+
85+
The dataset-level `full_text_search` block selects the engine and provides connection parameters. Column-level `full_text_search.enabled` controls which columns are indexed.
86+
87+
:::note[Enterprise edition]
88+
The Elasticsearch full-text search engine is available in the Spice [Enterprise edition](https://docs.spice.ai/docs/enterprise/getting-started/distributions).
89+
:::
90+
91+
### Elasticsearch FTS Parameters
92+
93+
| Parameter | Description | Example |
94+
| --- | --- | --- |
95+
| `elasticsearch_endpoint` | Required. Elasticsearch cluster URL. | `http://localhost:9200` |
96+
| `elasticsearch_user` | Optional. Username for HTTP basic authentication. | `${secrets:ES_USER}` |
97+
| `elasticsearch_pass` | Optional. Password for HTTP basic authentication. | `${secrets:ES_PASS}` |
98+
| `elasticsearch_index` | Optional. ES index name for FTS documents. Defaults to the dataset name. | `articles-fts` |
99+
| `client_timeout` | Optional. Total HTTP request timeout. Default: `30s`. | `30s` |
100+
| `connect_timeout` | Optional. HTTP connect timeout. Default: `10s`. | `10s` |
101+
102+
### Elasticsearch Ingestion Tuning
103+
104+
Optional parameters to control Elasticsearch index creation and write behavior:
105+
106+
| Parameter | Description | Default |
107+
| --- | --- | --- |
108+
| `number_of_shards` | ES `number_of_shards` index setting (applied at index creation). | ES default |
109+
| `number_of_replicas` | ES `number_of_replicas` index setting (applied at index creation). | ES default |
110+
| `refresh_interval` | ES `refresh_interval` index setting (applied at index creation). | ES default |
111+
| `bulk_load_refresh_interval` | Temporary `refresh_interval` during bulk writes. Set to `-1` to disable refresh during loading. | Not set |
112+
| `force_merge_after_write` | Run `_forcemerge` after full/append writes. | `false` |
113+
| `force_merge_segments` | Max segments for `_forcemerge`. Setting this also enables force merge. | `1` (when force merge enabled) |
114+
| `batch_write_rows` | Max rows per `_bulk` request. | `1000` |
115+
| `index_settings` | JSON object passed as ES index settings at creation. | Not set |
116+
117+
### YAML Anchor Reuse
118+
119+
When multiple datasets or columns share the same Elasticsearch connection, use YAML anchors to avoid repeating config:
120+
121+
```yaml
122+
x-elasticsearch-fts: &elasticsearch_fts
123+
enabled: true
124+
engine: elasticsearch
125+
params:
126+
elasticsearch_endpoint: http://localhost:9200
127+
elasticsearch_user: ${secrets:ES_USER}
128+
elasticsearch_pass: ${secrets:ES_PASS}
129+
130+
datasets:
131+
- from: file:./articles.parquet
132+
name: articles
133+
acceleration:
134+
enabled: true
135+
full_text_search:
136+
<<: *elasticsearch_fts
137+
params:
138+
elasticsearch_endpoint: http://localhost:9200
139+
elasticsearch_index: articles-fts
140+
columns:
141+
- name: title
142+
full_text_search:
143+
enabled: true
144+
row_id:
145+
- id
146+
```
147+
148+
### Combining with the Elasticsearch Vector Engine
149+
150+
Elasticsearch can serve as both the vector engine and the FTS engine for the same dataset. Configure `vectors` and `full_text_search` independently:
151+
152+
```yaml
153+
datasets:
154+
- from: file:./articles.parquet
155+
name: articles
156+
acceleration:
157+
enabled: true
158+
vectors:
159+
enabled: true
160+
engine: elasticsearch
161+
params:
162+
elasticsearch_endpoint: http://localhost:9200
163+
elasticsearch_index: articles-vectors
164+
full_text_search:
165+
engine: elasticsearch
166+
params:
167+
elasticsearch_endpoint: http://localhost:9200
168+
elasticsearch_index: articles-fts
169+
columns:
170+
- name: body
171+
embeddings:
172+
- from: my_embedding_model
173+
row_id:
174+
- id
175+
full_text_search:
176+
enabled: true
177+
row_id:
178+
- id
179+
```
180+
181+
Use [`rrf()`](../../reference/sql/search#reciprocal-rank-fusion-rrf) to combine vector and full-text results with hybrid search.
42182

43183
## Searching with the HTTP API
44184

website/docs/reference/spicepod/datasets.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1026,6 +1026,41 @@ The `metadata` field serves two purposes:
10261026

10271027
If a data file already contains a column with the same name as a metadata column, the metadata column is not added.
10281028

1029+
## `full_text_search` {#dataset-full-text-search}
1030+
1031+
Optional. Dataset-level full-text search engine configuration. When absent, the built-in Tantivy in-process engine is used (controlled by column-level [`columns[*].full_text_search`](#columns-search-full-text) settings).
1032+
1033+
## `full_text_search.enabled`
1034+
1035+
Enable or disable the dataset-level FTS engine, defaults to `true`.
1036+
1037+
## `full_text_search.engine`
1038+
1039+
The full-text search engine to use. Currently only `elasticsearch` is supported. When absent, the built-in Tantivy engine is used.
1040+
1041+
## `full_text_search.params`
1042+
1043+
Optional. Engine-specific connection and tuning parameters. See [Full-Text Search — Elasticsearch](../../features/search/full-text#using-elasticsearch-as-the-fts-engine) for available parameters.
1044+
1045+
```yaml
1046+
datasets:
1047+
- from: file:./articles.parquet
1048+
name: articles
1049+
acceleration:
1050+
enabled: true
1051+
full_text_search:
1052+
engine: elasticsearch
1053+
params:
1054+
elasticsearch_endpoint: http://localhost:9200
1055+
elasticsearch_index: articles-fts
1056+
columns:
1057+
- name: body
1058+
full_text_search:
1059+
enabled: true
1060+
row_id:
1061+
- id
1062+
```
1063+
10291064
## `vectors`
10301065

10311066
## `vectors.enabled`

0 commit comments

Comments
 (0)