Skip to content

Commit 21b1f40

Browse files
authored
fix: Correct Elasticsearch docs for 10k row cap, object type mapping, and timeout (#1662)
1 parent 03b2230 commit 21b1f40

2 files changed

Lines changed: 5 additions & 3 deletions

File tree

website/docs/components/data-connectors/elasticsearch/deployment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,14 +50,14 @@ Retry tuning is exposed only on the [Elasticsearch Vector Engine](../../vectors/
5050
| Setting | Default | Behavior |
5151
| --------------- | ------- | ---------------------------------------------------------------------------- |
5252
| Connect timeout | `10s` | Maximum time to establish a TCP/TLS connection to the cluster. |
53-
| Request timeout | `30s` | Maximum time for the entire request/response cycle, including retries. |
53+
| Request timeout | `30s` | Maximum time for each individual HTTP request. |
5454

5555
Long-running search responses (very large `LIMIT`, deep pagination, or expensive aggregations) may exceed the default request timeout. Either narrow the query, accelerate the dataset, or use the [vector engine](../../vectors/elasticsearch) `client_timeout` parameter when running the workload through the embedding-write path.
5656

5757
## Capacity & Sizing
5858

5959
- **Throughput**: Bounded by the Elasticsearch cluster's request handling and (for kNN) HNSW search cost. Plan refresh intervals and concurrent query load to stay within the cluster's tested capacity.
60-
- **Result size**: Each `_search` request returns up to `size` hits. The connector translates `LIMIT` to `size`; very large limits incur higher cluster memory and network cost.
60+
- **Result size**: Each `_search` request returns up to `size` hits, hard-capped at **10,000** (the Elasticsearch default `index.max_result_window`). The connector translates `LIMIT` to `size` but clamps the value to 10,000; queries without `LIMIT` also default to 10,000. For full-index access, accelerate the dataset into a local engine.
6161
- **Mapping fetches**: At dataset registration the connector fetches the index mapping once via `GET /<index>/_mapping`. Mapping changes after registration are not picked up until the runtime restarts.
6262
- **Pagination**: Spice does not currently use Elasticsearch's `search_after` or scroll APIs from the data connector. For full-table scans of very large indexes, prefer accelerating into a local engine.
6363

website/docs/components/data-connectors/elasticsearch/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,8 @@ The connector derives an Arrow schema from each index's mapping via `GET /<index
7171
| `ip` | `Utf8` | |
7272
| `dense_vector` (with `dims`) | `FixedSizeList<Float32, dims>` | Required `dims` field must fit in `i32`. |
7373
| `dense_vector` (missing `dims`) | `Utf8` | Falls back to raw JSON when dims cannot be resolved. |
74-
| `object`, `nested` | `Utf8` | Serialized JSON. |
74+
| `object` (with sub-fields) | _(flattened)_ | Expanded into dot-separated columns (e.g. `address.city`). |
75+
| `object` (no sub-fields), `nested` | `Utf8` | Serialized JSON. |
7576
| Any other mapping type | `Utf8` | Fallback — the raw JSON value is preserved as a string. |
7677

7778
Nested `object` fields are flattened by concatenating field names with dots (e.g. `address.city`). `nested` fields are preserved as JSON strings because per-document ordering must be retained.
@@ -137,6 +138,7 @@ TLS is enabled automatically for `https://` endpoints.
137138
- Nested object fields are exposed as JSON strings rather than structured columns.
138139
- `date` and `date_nanos` fields are preserved as strings because Elasticsearch accepts heterogeneous date formats; cast to a timestamp in SQL when numeric comparison is required.
139140
- `dense_vector` fields without a declared `dims` value fall back to `Utf8` and are not usable as a vector column.
141+
- Queries return at most **10,000 hits** per scan. The connector translates SQL `LIMIT` to the Elasticsearch `size` parameter, capped at 10,000 (the Elasticsearch default maximum). Queries without `LIMIT` also return at most 10,000 results. For full-index access, accelerate the dataset into a local engine.
140142
- Pushdown of SQL predicates to Elasticsearch query DSL is limited; complex filter expressions are evaluated locally by DataFusion after fetching results.
141143

142144
Elasticsearch can also be configured as a [Vector Engine](../vectors/elasticsearch) for datasets sourced from other connectors (storing Spice-managed embeddings in Elasticsearch rather than querying an existing index).

0 commit comments

Comments
 (0)