fix: Correct Elasticsearch pagination — single _search, no PIT/search_after

claudespice · claudespice · commit 5b192f4cc3fa · 2026-05-09T01:14:59.000-07:00
The docs incorrectly claimed the connector uses point-in-time (PIT)
queries with search_after to page through the entire index. The code
actually issues a single _search request capped at 10,000 hits
(query_table.rs:178). Queries without LIMIT return at most 10,000
rows, not the entire index.

Verified against spiceai/spiceai at trunk —
crates/data_components/src/elasticsearch/query_table.rs:177-183
diff --git a/website/docs/components/data-connectors/elasticsearch/deployment.md b/website/docs/components/data-connectors/elasticsearch/deployment.md
@@ -57,7 +57,7 @@ Long-running search responses (very large `LIMIT`, deep pagination, or expensive
 ## Capacity & Sizing
 
 - **Throughput**: Bounded by the Elasticsearch cluster's request handling and (for kNN) HNSW search cost. Plan refresh intervals and concurrent query load to stay within the cluster's tested capacity.
-- **Result size**: The connector pages through results using point-in-time (PIT) queries with `search_after`, fetching up to 10,000 hits per batch (bounded by the Elasticsearch `index.max_result_window` setting). Queries with `LIMIT` fetch only the requested number of rows. Queries without `LIMIT` page through the entire index.
+- **Result size**: The connector issues a single `_search` request per query, returning at most 10,000 hits (bounded by the Elasticsearch `index.max_result_window` setting). Queries with `LIMIT N` fetch `min(N, 10000)` rows. For result sets larger than 10,000, accelerate the dataset.
 - **Mapping fetches**: At dataset registration the connector fetches the index mapping once via `GET /<index>/_mapping`. Mapping changes after registration are not picked up until the runtime restarts.
 
 ## Search Routing
diff --git a/website/docs/components/data-connectors/elasticsearch/index.md b/website/docs/components/data-connectors/elasticsearch/index.md
@@ -138,7 +138,7 @@ TLS is enabled automatically for `https://` endpoints.
 - Nested object fields are exposed as JSON strings rather than structured columns.
 - `date` and `date_nanos` fields are preserved as strings because Elasticsearch accepts heterogeneous date formats; cast to a timestamp in SQL when numeric comparison is required.
 - `dense_vector` fields without a declared `dims` value fall back to `Utf8` and are not usable as a vector column.
-- The connector pages through results in batches of up to 10,000 hits using point-in-time (PIT) queries with `search_after`. Queries with `LIMIT` fetch only the requested number of rows; queries without `LIMIT` page through the entire index. Individual batch size is bounded by the Elasticsearch `index.max_result_window` setting (default 10,000).
+- The connector issues a single `_search` request per query. The result set is capped at 10,000 hits (the Elasticsearch `index.max_result_window` default). Queries with `LIMIT N` fetch `min(N, 10000)` rows; queries without `LIMIT` return at most 10,000 rows. For larger result sets, accelerate the dataset.
 - Pushdown of SQL predicates to Elasticsearch query DSL is limited; complex filter expressions are evaluated locally by DataFusion after fetching results.
 
 Elasticsearch can also be configured as a [Vector Engine](../vectors/elasticsearch) for datasets sourced from other connectors (storing Spice-managed embeddings in Elasticsearch rather than querying an existing index).