docs: Add Elasticsearch as a full-text search engine option

claudespice · claudespice · commit 6234d9f5b7f5 · 2026-05-07T03:15:41.000-07:00
Document the new dataset-level `full_text_search` block that allows
Elasticsearch to be used as the BM25 FTS engine, including connection
parameters, ingestion tuning controls, YAML anchor reuse, and
combining with the Elasticsearch vector engine. Also adds the new
ingestion tuning parameters to the Elasticsearch vector engine page.
diff --git a/website/docs/components/vectors/elasticsearch.md b/website/docs/components/vectors/elasticsearch.md
@@ -58,6 +58,13 @@ The Elasticsearch vector engine is available in the Spice [Enterprise edition](h
 | `elasticsearch_max_retries` | Optional. Maximum retry attempts for transient Elasticsearch errors (HTTP 429 / 5xx). Default: `3`.              | `3`                           |
 | `elasticsearch_retry_initial_backoff` | Optional. Initial backoff duration between retries, in time unit format. Default: `200ms`.                | `200ms`                       |
 | `elasticsearch_batch_write_rows` | Optional. Maximum rows per Elasticsearch `_bulk` request. Controls memory usage and payload size during writes. Default: `1000`. | `1000`       |
+| `elasticsearch_index_settings` | Optional. JSON object passed as Elasticsearch index settings when creating the index. Existing indexes are not recreated. | `{"index":{"codec":"best_compression"}}` |
+| `elasticsearch_number_of_shards` | Optional. ES `number_of_shards` index setting, applied at index creation only. | `1` |
+| `elasticsearch_number_of_replicas` | Optional. ES `number_of_replicas` index setting, applied at index creation only. | `0` |
+| `elasticsearch_refresh_interval` | Optional. ES `refresh_interval` index setting, applied at index creation only. | `1s` |
+| `elasticsearch_bulk_load_refresh_interval` | Optional. Temporary `refresh_interval` during bulk writes, restored afterward. Set to `-1` to disable refresh during loading. | `-1` |
+| `elasticsearch_force_merge_after_write` | Optional. Run `_forcemerge` after full/append writes. Default: `false`. | `true` |
+| `elasticsearch_force_merge_segments` | Optional. Max segments for `_forcemerge`. Setting this also enables force merge. Default when force merge enabled: `1`. | `1` |
 
 :::warning[Not yet supported]
 The Elasticsearch vector engine does **not** currently support:
diff --git a/website/docs/features/search/full-text.md b/website/docs/features/search/full-text.md
@@ -15,6 +15,17 @@ Spice provides full-text search functionality with BM25 scoring. This search met
 
 Datasets can be augmented with a full-text search index that enables efficient search. Dataset columns are included in the full-text index based on the column configuration.
 
+## Engines
+
+Spice supports two full-text search engines:
+
+| Engine | Description |
+| --- | --- |
+| **Tantivy** (default) | Built-in, in-process BM25 engine. No external dependencies. |
+| **Elasticsearch** | Delegates BM25 indexing and search to an external Elasticsearch cluster. Useful when Elasticsearch is already part of the infrastructure or when its operational characteristics (sharding, replication, snapshots) are preferred. |
+
+When no engine is specified, Tantivy is used automatically.
+
 ## Enabling Full-Text Search
 
 To enable full-text search, configure your dataset columns within your dataset definition as follows:
@@ -38,7 +49,136 @@ datasets:
           enabled: true
 ```
 
-In this example, full-text search indexing is enabled on both the `title` and `body` columns. The `row_id` specifies a unique identifier for referencing search results and retrieving additional data.
+In this example, full-text search indexing is enabled on both the `title` and `body` columns using the default Tantivy engine. The `row_id` specifies a unique identifier for referencing search results and retrieving additional data.
+
+## Using Elasticsearch as the FTS Engine
+
+To use Elasticsearch instead of the built-in Tantivy engine, add a dataset-level `full_text_search` block with `engine: elasticsearch` and the connection parameters:
+
+```yaml
+datasets:
+  - from: file:./articles.parquet
+    name: articles
+    acceleration:
+      enabled: true
+      engine: arrow
+    full_text_search:
+      engine: elasticsearch
+      params:
+        elasticsearch_endpoint: http://localhost:9200
+        elasticsearch_user: ${secrets:ES_USER}
+        elasticsearch_pass: ${secrets:ES_PASS}
+        elasticsearch_index: articles-fts
+    columns:
+      - name: title
+        full_text_search:
+          enabled: true
+          row_id:
+            - id
+      - name: body
+        full_text_search:
+          enabled: true
+          row_id:
+            - id
+```
+
+The dataset-level `full_text_search` block selects the engine and provides connection parameters. Column-level `full_text_search.enabled` controls which columns are indexed.
+
+:::note[Enterprise edition]
+The Elasticsearch full-text search engine is available in the Spice [Enterprise edition](https://docs.spice.ai/docs/enterprise/getting-started/distributions).
+:::
+
+### Elasticsearch FTS Parameters
+
+| Parameter | Description | Example |
+| --- | --- | --- |
+| `elasticsearch_endpoint` | Required. Elasticsearch cluster URL. | `http://localhost:9200` |
+| `elasticsearch_user` | Optional. Username for HTTP basic authentication. | `${secrets:ES_USER}` |
+| `elasticsearch_pass` | Optional. Password for HTTP basic authentication. | `${secrets:ES_PASS}` |
+| `elasticsearch_index` | Optional. ES index name for FTS documents. Defaults to the dataset name. | `articles-fts` |
+| `client_timeout` | Optional. Total HTTP request timeout. Default: `30s`. | `30s` |
+| `connect_timeout` | Optional. HTTP connect timeout. Default: `10s`. | `10s` |
+
+### Elasticsearch Ingestion Tuning
+
+Optional parameters to control Elasticsearch index creation and write behavior:
+
+| Parameter | Description | Default |
+| --- | --- | --- |
+| `number_of_shards` | ES `number_of_shards` index setting (applied at index creation). | ES default |
+| `number_of_replicas` | ES `number_of_replicas` index setting (applied at index creation). | ES default |
+| `refresh_interval` | ES `refresh_interval` index setting (applied at index creation). | ES default |
+| `bulk_load_refresh_interval` | Temporary `refresh_interval` during bulk writes. Set to `-1` to disable refresh during loading. | Not set |
+| `force_merge_after_write` | Run `_forcemerge` after full/append writes. | `false` |
+| `force_merge_segments` | Max segments for `_forcemerge`. Setting this also enables force merge. | `1` (when force merge enabled) |
+| `batch_write_rows` | Max rows per `_bulk` request. | `1000` |
+| `index_settings` | JSON object passed as ES index settings at creation. | Not set |
+
+### YAML Anchor Reuse
+
+When multiple datasets or columns share the same Elasticsearch connection, use YAML anchors to avoid repeating config:
+
+```yaml
+x-elasticsearch-fts: &elasticsearch_fts
+  enabled: true
+  engine: elasticsearch
+  params:
+    elasticsearch_endpoint: http://localhost:9200
+    elasticsearch_user: ${secrets:ES_USER}
+    elasticsearch_pass: ${secrets:ES_PASS}
+
+datasets:
+  - from: file:./articles.parquet
+    name: articles
+    acceleration:
+      enabled: true
+    full_text_search:
+      <<: *elasticsearch_fts
+      params:
+        elasticsearch_endpoint: http://localhost:9200
+        elasticsearch_index: articles-fts
+    columns:
+      - name: title
+        full_text_search:
+          enabled: true
+          row_id:
+            - id
+```
+
+### Combining with the Elasticsearch Vector Engine
+
+Elasticsearch can serve as both the vector engine and the FTS engine for the same dataset. Configure `vectors` and `full_text_search` independently:
+
+```yaml
+datasets:
+  - from: file:./articles.parquet
+    name: articles
+    acceleration:
+      enabled: true
+    vectors:
+      enabled: true
+      engine: elasticsearch
+      params:
+        elasticsearch_endpoint: http://localhost:9200
+        elasticsearch_index: articles-vectors
+    full_text_search:
+      engine: elasticsearch
+      params:
+        elasticsearch_endpoint: http://localhost:9200
+        elasticsearch_index: articles-fts
+    columns:
+      - name: body
+        embeddings:
+          - from: my_embedding_model
+            row_id:
+              - id
+        full_text_search:
+          enabled: true
+          row_id:
+            - id
+```
+
+Use [`rrf()`](../../reference/sql/search#reciprocal-rank-fusion-rrf) to combine vector and full-text results with hybrid search.
 
 ## Searching with the HTTP API
 
diff --git a/website/docs/reference/spicepod/datasets.md b/website/docs/reference/spicepod/datasets.md
@@ -1026,6 +1026,41 @@ The `metadata` field serves two purposes:
 
     If a data file already contains a column with the same name as a metadata column, the metadata column is not added.
 
+## `full_text_search` {#dataset-full-text-search}
+
+Optional. Dataset-level full-text search engine configuration. When absent, the built-in Tantivy in-process engine is used (controlled by column-level [`columns[*].full_text_search`](#columns-search-full-text) settings).
+
+## `full_text_search.enabled`
+
+Enable or disable the dataset-level FTS engine, defaults to `true`.
+
+## `full_text_search.engine`
+
+The full-text search engine to use. Currently only `elasticsearch` is supported. When absent, the built-in Tantivy engine is used.
+
+## `full_text_search.params`
+
+Optional. Engine-specific connection and tuning parameters. See [Full-Text Search — Elasticsearch](../../features/search/full-text#using-elasticsearch-as-the-fts-engine) for available parameters.
+
+```yaml
+datasets:
+  - from: file:./articles.parquet
+    name: articles
+    acceleration:
+      enabled: true
+    full_text_search:
+      engine: elasticsearch
+      params:
+        elasticsearch_endpoint: http://localhost:9200
+        elasticsearch_index: articles-fts
+    columns:
+      - name: body
+        full_text_search:
+          enabled: true
+          row_id:
+            - id
+```
+
 ## `vectors`
 
 ## `vectors.enabled`