You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Add Elasticsearch as a full-text search engine option
Document the new dataset-level `full_text_search` block that allows
Elasticsearch to be used as the BM25 FTS engine, including connection
parameters, ingestion tuning controls, YAML anchor reuse, and
combining with the Elasticsearch vector engine. Also adds the new
ingestion tuning parameters to the Elasticsearch vector engine page.
Copy file name to clipboardExpand all lines: website/docs/components/vectors/elasticsearch.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,6 +58,13 @@ The Elasticsearch vector engine is available in the Spice [Enterprise edition](h
58
58
| `elasticsearch_max_retries` | Optional. Maximum retry attempts for transient Elasticsearch errors (HTTP 429 / 5xx). Default: `3`. | `3` |
59
59
| `elasticsearch_retry_initial_backoff` | Optional. Initial backoff duration between retries, in time unit format. Default: `200ms`. | `200ms` |
60
60
| `elasticsearch_batch_write_rows` | Optional. Maximum rows per Elasticsearch `_bulk` request. Controls memory usage and payload size during writes. Default: `1000`. | `1000` |
61
+
| `elasticsearch_index_settings` | Optional. JSON object passed as Elasticsearch index settings when creating the index. Existing indexes are not recreated. | `{"index":{"codec":"best_compression"}}` |
62
+
| `elasticsearch_number_of_shards` | Optional. ES `number_of_shards` index setting, applied at index creation only. | `1` |
63
+
| `elasticsearch_number_of_replicas` | Optional. ES `number_of_replicas` index setting, applied at index creation only. | `0` |
64
+
| `elasticsearch_refresh_interval` | Optional. ES `refresh_interval` index setting, applied at index creation only. | `1s` |
65
+
| `elasticsearch_bulk_load_refresh_interval` | Optional. Temporary `refresh_interval` during bulk writes, restored afterward. Set to `-1` to disable refresh during loading. | `-1` |
66
+
| `elasticsearch_force_merge_after_write` | Optional. Run `_forcemerge` after full/append writes. Default: `false`. | `true` |
67
+
| `elasticsearch_force_merge_segments` | Optional. Max segments for `_forcemerge`. Setting this also enables force merge. Default when force merge enabled: `1`. | `1` |
61
68
62
69
:::warning[Not yet supported]
63
70
The Elasticsearch vector engine does **not** currently support:
Copy file name to clipboardExpand all lines: website/docs/features/search/full-text.md
+141-1Lines changed: 141 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,17 @@ Spice provides full-text search functionality with BM25 scoring. This search met
15
15
16
16
Datasets can be augmented with a full-text search index that enables efficient search. Dataset columns are included in the full-text index based on the column configuration.
|**Elasticsearch**| Delegates BM25 indexing and search to an external Elasticsearch cluster. Useful when Elasticsearch is already part of the infrastructure or when its operational characteristics (sharding, replication, snapshots) are preferred. |
26
+
27
+
When no engine is specified, Tantivy is used automatically.
28
+
18
29
## Enabling Full-Text Search
19
30
20
31
To enable full-text search, configure your dataset columns within your dataset definition as follows:
@@ -38,7 +49,136 @@ datasets:
38
49
enabled: true
39
50
```
40
51
41
-
In this example, full-text search indexing is enabled on both the `title` and `body` columns. The `row_id` specifies a unique identifier for referencing search results and retrieving additional data.
52
+
In this example, full-text search indexing is enabled on both the `title` and `body` columns using the default Tantivy engine. The `row_id` specifies a unique identifier for referencing search results and retrieving additional data.
53
+
54
+
## Using Elasticsearch as the FTS Engine
55
+
56
+
To use Elasticsearch instead of the built-in Tantivy engine, add a dataset-level `full_text_search` block with `engine: elasticsearch` and the connection parameters:
57
+
58
+
```yaml
59
+
datasets:
60
+
- from: file:./articles.parquet
61
+
name: articles
62
+
acceleration:
63
+
enabled: true
64
+
engine: arrow
65
+
full_text_search:
66
+
engine: elasticsearch
67
+
params:
68
+
elasticsearch_endpoint: http://localhost:9200
69
+
elasticsearch_user: ${secrets:ES_USER}
70
+
elasticsearch_pass: ${secrets:ES_PASS}
71
+
elasticsearch_index: articles-fts
72
+
columns:
73
+
- name: title
74
+
full_text_search:
75
+
enabled: true
76
+
row_id:
77
+
- id
78
+
- name: body
79
+
full_text_search:
80
+
enabled: true
81
+
row_id:
82
+
- id
83
+
```
84
+
85
+
The dataset-level `full_text_search` block selects the engine and provides connection parameters. Column-level `full_text_search.enabled` controls which columns are indexed.
86
+
87
+
:::note[Enterprise edition]
88
+
The Elasticsearch full-text search engine is available in the Spice [Enterprise edition](https://docs.spice.ai/docs/enterprise/getting-started/distributions).
Copy file name to clipboardExpand all lines: website/docs/reference/spicepod/datasets.md
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1026,6 +1026,41 @@ The `metadata` field serves two purposes:
1026
1026
1027
1027
If a data file already contains a column with the same name as a metadata column, the metadata column is not added.
1028
1028
1029
+
## `full_text_search` {#dataset-full-text-search}
1030
+
1031
+
Optional. Dataset-level full-text search engine configuration. When absent, the built-in Tantivy in-process engine is used (controlled by column-level [`columns[*].full_text_search`](#columns-search-full-text) settings).
1032
+
1033
+
## `full_text_search.enabled`
1034
+
1035
+
Enable or disable the dataset-level FTS engine, defaults to `true`.
1036
+
1037
+
## `full_text_search.engine`
1038
+
1039
+
The full-text search engine to use. Currently only `elasticsearch` is supported. When absent, the built-in Tantivy engine is used.
1040
+
1041
+
## `full_text_search.params`
1042
+
1043
+
Optional. Engine-specific connection and tuning parameters. See [Full-Text Search — Elasticsearch](../../features/search/full-text#using-elasticsearch-as-the-fts-engine) for available parameters.
0 commit comments