|
| 1 | +--- |
| 2 | +title: 'Elasticsearch Vector Engine' |
| 3 | +sidebar_label: 'Elasticsearch' |
| 4 | +description: 'Use Elasticsearch as a vector engine in Spice for kNN vector search, full-text search, and hybrid search.' |
| 5 | +sidebar_position: 2 |
| 6 | +pagination_next: null |
| 7 | +--- |
| 8 | + |
| 9 | +Elasticsearch can be used as a vector engine in Spice to store embeddings and execute kNN similarity search, full-text search (BM25), and hybrid search (RRF) natively in the Elasticsearch cluster. This is useful when Elasticsearch is already the system of record for a workload, or when the operational characteristics of a managed Elasticsearch cluster (replication, sharding, snapshots) are preferred over a dedicated vector store. |
| 10 | + |
| 11 | +Unlike the [Elasticsearch Data Connector](../data-connectors/elasticsearch), which reads an existing Elasticsearch index as a Spice dataset, the Elasticsearch vector engine accepts data from any Spice data connector, generates embeddings using the configured embedding model, and writes vectors (and source fields) to an Elasticsearch index that Spice manages. |
| 12 | + |
| 13 | +```yaml |
| 14 | +datasets: |
| 15 | + - from: file:products.parquet |
| 16 | + name: products |
| 17 | + acceleration: |
| 18 | + enabled: true |
| 19 | + vectors: |
| 20 | + enabled: true |
| 21 | + engine: elasticsearch |
| 22 | + params: |
| 23 | + elasticsearch_endpoint: https://localhost:9200 |
| 24 | + elasticsearch_user: ${secrets:es_user} |
| 25 | + elasticsearch_pass: ${secrets:es_pass} |
| 26 | + elasticsearch_index: products-embeddings |
| 27 | + columns: |
| 28 | + - name: description |
| 29 | + embeddings: |
| 30 | + - from: bedrock_titan |
| 31 | + |
| 32 | +embeddings: |
| 33 | + - from: bedrock:amazon.titan-embed-text-v2:0 |
| 34 | + name: bedrock_titan |
| 35 | + params: |
| 36 | + aws_region: us-east-2 |
| 37 | + dimensions: '1024' |
| 38 | +``` |
| 39 | +
|
| 40 | +:::note[Build requirement] |
| 41 | +The Elasticsearch vector engine requires the `elasticsearch` Cargo feature, which is not enabled in the default distribution. Build Spice from source with `cargo build --release --features elasticsearch`, or use a distribution that enables it. |
| 42 | +::: |
| 43 | + |
| 44 | +## Parameters |
| 45 | + |
| 46 | +| Parameter | Description | Example Value | |
| 47 | +| ------------------------ | -------------------------------------------------------------------------------------------------------------------- | ----------------------------- | |
| 48 | +| `elasticsearch_endpoint` | Required. Cluster URL. | `https://localhost:9200` | |
| 49 | +| `elasticsearch_user` | Optional. Username for HTTP basic authentication. | `${secrets:es_user}` | |
| 50 | +| `elasticsearch_pass` | Optional. Password for HTTP basic authentication. | `${secrets:es_pass}` | |
| 51 | +| `elasticsearch_index` | Optional. Index used to store vectors. Defaults to a sanitized `{dataset}-{column}-{model}` value. | `products-embeddings` | |
| 52 | +| `elasticsearch_vector_field` | Optional. Name of the `dense_vector` field in Elasticsearch. Defaults to `{column}_embedding`. | `description_embedding` | |
| 53 | + |
| 54 | +## Overview |
| 55 | + |
| 56 | +When configured as a vector engine, Spice: |
| 57 | + |
| 58 | +1. Reads data from the underlying connector (for example, Parquet on disk or a federated SQL source). |
| 59 | +2. Computes embeddings on the configured column using the attached embedding model. |
| 60 | +3. Writes vectors and source fields to the configured Elasticsearch index, provisioning the index mapping when needed (`dense_vector` of the correct dimension plus text fields for full-text search). |
| 61 | +4. At query time, routes `vector_search`, `text_search`, and `rrf` against the Elasticsearch index using native kNN and BM25 queries. |
| 62 | + |
| 63 | +Source fields on the dataset are indexed as `text` in Elasticsearch so they can be used as full-text search targets. Primary key columns are indexed as `keyword` and included in kNN results so that matches can be joined back to the Spice base table when additional columns are requested. |
| 64 | + |
| 65 | +:::warning[Limitations] |
| 66 | + |
| 67 | +- A dataset or view must be accelerated (`datasets[].acceleration.enabled: true`) for the vector engine to be provided the appropriate data to ingest. See [`acceleration.enabled`](../../reference/spicepod/datasets#accelerationenabled). |
| 68 | +- The dataset must have a resolvable primary key, either via the underlying schema or an explicit [`row_id`](../../reference/spicepod/datasets#columnsembeddingsrow_id). |
| 69 | +- Elasticsearch kNN uses approximate nearest neighbors and returns probabilistically closest results. |
| 70 | + |
| 71 | +::: |
| 72 | + |
| 73 | +## Configuration |
| 74 | + |
| 75 | +### Embedding Models |
| 76 | + |
| 77 | +Any embedding model supported by Spice can be used to produce the vectors written to Elasticsearch, including local models via [Hugging Face](../embeddings/huggingface), hosted models via [OpenAI](../embeddings/openai), [Bedrock](../embeddings/bedrock), and others. The vector dimension is inferred from the embedding model and used to provision the Elasticsearch `dense_vector` field. |
| 78 | + |
| 79 | +```yaml |
| 80 | +embeddings: |
| 81 | + - from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2 |
| 82 | + name: local_embedding_model |
| 83 | +``` |
| 84 | + |
| 85 | +### Primary Keys |
| 86 | + |
| 87 | +Spice requires a primary key to round-trip matches between Elasticsearch and the base dataset. If the source dataset does not carry primary key metadata, specify it on the column embedding: |
| 88 | + |
| 89 | +```yaml |
| 90 | +columns: |
| 91 | + - name: description |
| 92 | + embeddings: |
| 93 | + - from: local_embedding_model |
| 94 | + row_id: product_id |
| 95 | +``` |
| 96 | + |
| 97 | +### Custom Index and Vector Field Names |
| 98 | + |
| 99 | +By default the index name is a sanitized `{dataset}-{column}-{model}` and the vector field is `{column}_embedding`. Override either with `elasticsearch_index` and `elasticsearch_vector_field`: |
| 100 | + |
| 101 | +```yaml |
| 102 | +vectors: |
| 103 | + enabled: true |
| 104 | + engine: elasticsearch |
| 105 | + params: |
| 106 | + elasticsearch_endpoint: https://localhost:9200 |
| 107 | + elasticsearch_index: products-vectors-v2 |
| 108 | + elasticsearch_vector_field: desc_vec |
| 109 | +``` |
| 110 | + |
| 111 | +## Querying |
| 112 | + |
| 113 | +Vector, full-text, and hybrid search use the standard Spice UDTFs. When the dataset is backed by the Elasticsearch vector engine, these UDTFs compile to native Elasticsearch queries rather than local computation. |
| 114 | + |
| 115 | +### Vector Search |
| 116 | + |
| 117 | +```sql |
| 118 | +SELECT product_id, name, score |
| 119 | +FROM vector_search(products, 'wireless noise cancelling headphones') |
| 120 | +ORDER BY score DESC |
| 121 | +LIMIT 10; |
| 122 | +``` |
| 123 | + |
| 124 | +The query text is embedded with the configured embedding model and sent to Elasticsearch as a kNN query. By default the number of candidates considered by Elasticsearch is twice the requested `k`. |
| 125 | + |
| 126 | +### Full-Text Search |
| 127 | + |
| 128 | +Any `Utf8`/`LargeUtf8` column on the dataset is available as a full-text search target: |
| 129 | + |
| 130 | +```sql |
| 131 | +SELECT product_id, name, score |
| 132 | +FROM text_search(products, 'bluetooth waterproof', description) |
| 133 | +ORDER BY score DESC |
| 134 | +LIMIT 10; |
| 135 | +``` |
| 136 | + |
| 137 | +### Hybrid Search (RRF) |
| 138 | + |
| 139 | +Combine vector and full-text results with [Reciprocal Rank Fusion](../../reference/sql/search#reciprocal-rank-fusion-rrf): |
| 140 | + |
| 141 | +```sql |
| 142 | +SELECT product_id, name, fused_score |
| 143 | +FROM rrf( |
| 144 | + vector_search(products, 'wireless noise cancelling headphones'), |
| 145 | + text_search(products, 'bluetooth waterproof', description), |
| 146 | + join_key => 'product_id' |
| 147 | +) |
| 148 | +ORDER BY fused_score DESC |
| 149 | +LIMIT 10; |
| 150 | +``` |
| 151 | + |
| 152 | +Advanced RRF options — per-query `rank_weight`, recency decay, and custom smoothing `k` — work identically regardless of the underlying vector engine. See [RRF](../../reference/sql/search#reciprocal-rank-fusion-rrf) for the full reference. |
| 153 | + |
| 154 | +## Authentication |
| 155 | + |
| 156 | +When `elasticsearch_user` and `elasticsearch_pass` are provided, the vector engine uses HTTP basic authentication. Prefer storing credentials in a [secret store](../secret-stores) and referencing them with `${secrets:...}`. TLS is enabled automatically for `https://` endpoints. |
| 157 | + |
| 158 | +## Comparison with the Data Connector |
| 159 | + |
| 160 | +| Use case | Use | |
| 161 | +| ------------------------------------------------------------------------ | ------------------------------------------------------------------------- | |
| 162 | +| Query an existing Elasticsearch index (with or without `dense_vector`). | [Elasticsearch Data Connector](../data-connectors/elasticsearch). | |
| 163 | +| Ingest data from another source and have Spice manage vectors in ES. | Elasticsearch Vector Engine (this page). | |
| 164 | + |
| 165 | +Both paths surface `vector_search`, `text_search`, and `rrf`; pick the one that matches which system owns the data. |
0 commit comments