Skip to content

Commit 5b192f4

Browse files
committed
fix: Correct Elasticsearch pagination — single _search, no PIT/search_after
The docs incorrectly claimed the connector uses point-in-time (PIT) queries with search_after to page through the entire index. The code actually issues a single _search request capped at 10,000 hits (query_table.rs:178). Queries without LIMIT return at most 10,000 rows, not the entire index. Verified against spiceai/spiceai at trunk — crates/data_components/src/elasticsearch/query_table.rs:177-183
1 parent d1bf5cf commit 5b192f4

2 files changed

Lines changed: 2 additions & 2 deletions

File tree

website/docs/components/data-connectors/elasticsearch/deployment.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Long-running search responses (very large `LIMIT`, deep pagination, or expensive
5757
## Capacity & Sizing
5858

5959
- **Throughput**: Bounded by the Elasticsearch cluster's request handling and (for kNN) HNSW search cost. Plan refresh intervals and concurrent query load to stay within the cluster's tested capacity.
60-
- **Result size**: The connector pages through results using point-in-time (PIT) queries with `search_after`, fetching up to 10,000 hits per batch (bounded by the Elasticsearch `index.max_result_window` setting). Queries with `LIMIT` fetch only the requested number of rows. Queries without `LIMIT` page through the entire index.
60+
- **Result size**: The connector issues a single `_search` request per query, returning at most 10,000 hits (bounded by the Elasticsearch `index.max_result_window` setting). Queries with `LIMIT N` fetch `min(N, 10000)` rows. For result sets larger than 10,000, accelerate the dataset.
6161
- **Mapping fetches**: At dataset registration the connector fetches the index mapping once via `GET /<index>/_mapping`. Mapping changes after registration are not picked up until the runtime restarts.
6262

6363
## Search Routing

website/docs/components/data-connectors/elasticsearch/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ TLS is enabled automatically for `https://` endpoints.
138138
- Nested object fields are exposed as JSON strings rather than structured columns.
139139
- `date` and `date_nanos` fields are preserved as strings because Elasticsearch accepts heterogeneous date formats; cast to a timestamp in SQL when numeric comparison is required.
140140
- `dense_vector` fields without a declared `dims` value fall back to `Utf8` and are not usable as a vector column.
141-
- The connector pages through results in batches of up to 10,000 hits using point-in-time (PIT) queries with `search_after`. Queries with `LIMIT` fetch only the requested number of rows; queries without `LIMIT` page through the entire index. Individual batch size is bounded by the Elasticsearch `index.max_result_window` setting (default 10,000).
141+
- The connector issues a single `_search` request per query. The result set is capped at 10,000 hits (the Elasticsearch `index.max_result_window` default). Queries with `LIMIT N` fetch `min(N, 10000)` rows; queries without `LIMIT` return at most 10,000 rows. For larger result sets, accelerate the dataset.
142142
- Pushdown of SQL predicates to Elasticsearch query DSL is limited; complex filter expressions are evaluated locally by DataFusion after fetching results.
143143

144144
Elasticsearch can also be configured as a [Vector Engine](../vectors/elasticsearch) for datasets sourced from other connectors (storing Spice-managed embeddings in Elasticsearch rather than querying an existing index).

0 commit comments

Comments
 (0)