fix: Correct Elasticsearch docs for 10k row cap, object type mapping, and timeout (#1662)

claudespice · web-flow · commit 21b1f405d0ce · 2026-05-06T09:47:36.000-07:00
diff --git a/website/docs/components/data-connectors/elasticsearch/deployment.md b/website/docs/components/data-connectors/elasticsearch/deployment.md
@@ -50,14 +50,14 @@ Retry tuning is exposed only on the [Elasticsearch Vector Engine](../../vectors/
 | Setting         | Default | Behavior                                                                     |
 | --------------- | ------- | ---------------------------------------------------------------------------- |
 | Connect timeout | `10s`   | Maximum time to establish a TCP/TLS connection to the cluster.               |
-| Request timeout | `30s`   | Maximum time for the entire request/response cycle, including retries.       |
+| Request timeout | `30s`   | Maximum time for each individual HTTP request.                               |
 
 Long-running search responses (very large `LIMIT`, deep pagination, or expensive aggregations) may exceed the default request timeout. Either narrow the query, accelerate the dataset, or use the [vector engine](../../vectors/elasticsearch) `client_timeout` parameter when running the workload through the embedding-write path.
 
 ## Capacity & Sizing
 
 - **Throughput**: Bounded by the Elasticsearch cluster's request handling and (for kNN) HNSW search cost. Plan refresh intervals and concurrent query load to stay within the cluster's tested capacity.
-- **Result size**: Each `_search` request returns up to `size` hits. The connector translates `LIMIT` to `size`; very large limits incur higher cluster memory and network cost.
+- **Result size**: Each `_search` request returns up to `size` hits, hard-capped at **10,000** (the Elasticsearch default `index.max_result_window`). The connector translates `LIMIT` to `size` but clamps the value to 10,000; queries without `LIMIT` also default to 10,000. For full-index access, accelerate the dataset into a local engine.
 - **Mapping fetches**: At dataset registration the connector fetches the index mapping once via `GET /<index>/_mapping`. Mapping changes after registration are not picked up until the runtime restarts.
 - **Pagination**: Spice does not currently use Elasticsearch's `search_after` or scroll APIs from the data connector. For full-table scans of very large indexes, prefer accelerating into a local engine.
 
diff --git a/website/docs/components/data-connectors/elasticsearch/index.md b/website/docs/components/data-connectors/elasticsearch/index.md
@@ -71,7 +71,8 @@ The connector derives an Arrow schema from each index's mapping via `GET /<index
 | `ip`                                                             | `Utf8`                               |                                                             |
 | `dense_vector` (with `dims`)                                     | `FixedSizeList<Float32, dims>`       | Required `dims` field must fit in `i32`.                    |
 | `dense_vector` (missing `dims`)                                  | `Utf8`                               | Falls back to raw JSON when dims cannot be resolved.        |
-| `object`, `nested`                                               | `Utf8`                               | Serialized JSON.                                            |
+| `object` (with sub-fields)                                       | _(flattened)_                        | Expanded into dot-separated columns (e.g. `address.city`).  |
+| `object` (no sub-fields), `nested`                               | `Utf8`                               | Serialized JSON.                                            |
 | Any other mapping type                                           | `Utf8`                               | Fallback — the raw JSON value is preserved as a string.     |
 
 Nested `object` fields are flattened by concatenating field names with dots (e.g. `address.city`). `nested` fields are preserved as JSON strings because per-document ordering must be retained.
@@ -137,6 +138,7 @@ TLS is enabled automatically for `https://` endpoints.
 - Nested object fields are exposed as JSON strings rather than structured columns.
 - `date` and `date_nanos` fields are preserved as strings because Elasticsearch accepts heterogeneous date formats; cast to a timestamp in SQL when numeric comparison is required.
 - `dense_vector` fields without a declared `dims` value fall back to `Utf8` and are not usable as a vector column.
+- Queries return at most **10,000 hits** per scan. The connector translates SQL `LIMIT` to the Elasticsearch `size` parameter, capped at 10,000 (the Elasticsearch default maximum). Queries without `LIMIT` also return at most 10,000 results. For full-index access, accelerate the dataset into a local engine.
 - Pushdown of SQL predicates to Elasticsearch query DSL is limited; complex filter expressions are evaluated locally by DataFusion after fetching results.
 
 Elasticsearch can also be configured as a [Vector Engine](../vectors/elasticsearch) for datasets sourced from other connectors (storing Spice-managed embeddings in Elasticsearch rather than querying an existing index).