[pull] trunk from spiceai:trunk by pull[bot] · Pull Request #750 · TheRakeshPurohit/spiceai

pull · 2026-04-18T09:06:14Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Release notes for v2.0.0-rc.3 * add refresh token config example * Update release notes for v2.0.0-rc.3 with improved descriptions and links for key features * fix http connector link --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>

* Use spiceai-macos for PR Lint * Remove Makefile Targets workflow and integrate make install steps into PR workflow * Remove make install-odbc step from build-test (we don't build with ODBC) * fix: override RUST_PROFILE for install targets in CI workflow (#10366)

* ci: skip artifact compression for test binaries/archives The test archives are already zstd-compressed and test binaries are native executables, so running actions/upload-artifact's zip compression on them adds CI time without meaningful size savings. Set compression-level: 0 and bump retention to 3 days. * ci: remove outdated Pittsburgh mirror from apt configuration

…ivy, rand (#10379) * chore(deps): update candle dependencies to latest revisions * chore(deps): bump aws-lc-rs 1.15.4 -> 1.16.3 (aws-lc-sys 0.37.1 -> 0.40.0) * chore(deps): bump spiceai/mistral.rs to 27405ba1 * chore(deps): update tantivy to version 0.26.0 and downgrade windows-sys to 0.60.2 * chore(deps): update rand and windows-sys dependencies to latest versions * chore(deps): drop unused OpenSSL license allowance after aws-lc-sys 0.40 bump * chore(deps): update rand usage to RngExt across multiple modules * Update crates/cache/Cargo.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * chore(deps): update rand usage to RngExt across multiple modules * chore(deps): replace gen_range with random_range for improved randomness in benchmarks --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* build(deps): bump tantivy from 0.25.0 to 0.26.0 Bumps [tantivy](https://github.com/quickwit-oss/tantivy) from 0.25.0 to 0.26.0. - [Release notes](https://github.com/quickwit-oss/tantivy/releases) - [Changelog](https://github.com/quickwit-oss/tantivy/blob/main/CHANGELOG.md) - [Commits](https://github.com/quickwit-oss/tantivy/compare/0.25.0...0.26.0) --- updated-dependencies: - dependency-name: tantivy dependency-version: 0.26.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * chore(deps): update lz4_flex and rustls-webpki to latest compatible versions * chore(deps): update lz4_flex and rustls-webpki (#10378) * Update candle and mistral.rs lock-step pins (#10278) * Update candle and mistral.rs lock-step pins * Update dependencies to use new git revisions for candle packages * Add provenance comments to candle and mistralrs git dependency pins * Bump mistral.rs and text-embeddings-inference pins - mistral.rs: ac7063cd -> 9b4758762d6ebed08a42af7211c616ebc512c557 - text-embeddings-inference: 58b44fbb -> 88b7a84a2c2ad83707555183f8f18dd201897f12 Adapt to download_safetensors now taking Arc<ApiRepo>. * docs: fix status badges in README (#10350) * docs: fix status badges in README - Use markdown badge syntax for all badges. - Fix wrong workflow references: - spiced_docker_nightly.yml -> spiced_docker_dev.yml - build_and_release.yml -> build_nightly.yml - benchmarks.yml -> testoperator_run_bench.yml - Wrap badges in centered div blocks so GitHub renders markdown inside. * docs: point unit tests badge to build_and_release on trunk * docs: align CodeQL badge link filter with image filter * Migrate ecrets to envs (#10354) * Add limit pushdown and improve sort pushdown for Oracle and MSSQL (#10351) * Implement sort pushdown support and fix pushdown gaps across providers Implement DataFusion v52 `try_pushdown_sort` for transparent wrapper execution plans (CayenneAccelerationExec, SchemaCastScanExec, BytesProcessedExec) by delegating to their child plans, and for SQL providers (MSSQL, Oracle, FlightSQL) by generating ORDER BY clauses. Also fix limit pushdown consistency in wrappers (delegate supports_limit_pushdown/with_fetch/fetch to child plans instead of returning mismatched values), and extend MSSQL filter pushdown to support NotEq, And, Or, Not, IsNull, IsNotNull, Like, InList, and Between expressions. * fix: enhance sort pushdown error handling and improve filter classification logic * Address PR review comments: improve sort pushdown correctness - FlightSQL: Replace filter_map with fallible map in sql() ORDER BY generation to return an error instead of silently dropping non-Column sort expressions. Add InvalidSortExpression error variant. - MSSQL: Make classify_mssql_filter recursively check time-related expressions in And/Or/Not/IsNull/IsNotNull/Like sub-expressions to prevent time-related filters from being pushed down via compound exprs. - SchemaCastScanExec: Propagate input ordering through equivalence properties and set maintains_input_order to true, since schema casting preserves row order. - FlightSQL tests: Add unit tests for try_pushdown_sort (unsupported for non-column, exact for column) and sql() ORDER BY clause generation. * Remove unsafe ordering propagation from SchemaCastScanExec Do not copy input ordering into EquivalenceProperties since schema casting can change data types and projected columns, making the input ordering invalid for the output schema. Retain maintains_input_order=true since row order is preserved. * Fix CI failures: restore SchemaCastScanExec ordering and fix SQL double-space - Restore ordering propagation in SchemaCastScanExec::new() that was incorrectly removed, fixing SortPreservingMergeExec invariant violations in partition integration tests. - Fix double-space in generated SQL for Oracle, FlightSQL, and MSSQL execution plans when order_expr is empty. Build SQL incrementally, appending clauses only when non-empty. - Update oracle test-framework snapshots to match corrected SQL output. * Upgrade datafusion-table-providers to 4e8b2b0bd0f0 (pushdown support) (#10341) * Refactor BytesProcessedExec to simplify fetch and pushdown sort methods * Fix schema_cast ordering: remap column indices by name and add tests - Remap sort expression column indices from input to output schema by name, since SchemaCastScanExec may reorder columns relative to input - Only propagate ordering when ordered columns have identical types - Add 3 unit tests: ordering propagated (same types), not propagated (type differs), and indices remapped (reordered columns) - Add branch comment to datafusion-federation git dependency * Update refresh_max_timestamp_df plan snapshot * Update cluster::distributed_cayenne_catalog snapshots * Update duckdb_json_functions snapshots * Update datafusion version * Update datafusion version * Update to datafusion-federation rev 42245bdd58ee3d7da8276e83d85fb1c52aec916e * Revert "Update refresh_max_timestamp_df plan snapshot" This reverts commit 244fb05060d3787555fff13fc62dd6df16c50bfe. * Update distributed_acceleration snapshot * Add limit pushdown and improve sort pushdown for Oracle and MSSQL * Fix Exact->Inexact * Revert "Fix Exact->Inexact" This reverts commit f423db9007ea20de0c55eee3f0a74af465998371. --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> * Fix ubuntu mirror configuration (#10359) * Add step to verify apt mirror configuration in GitHub Action * Fix apt mirror substitution to match only archive.ubuntu.com Improve awk regex to avoid greedy matching and add error check to fail if archive.ubuntu.com remains after substitution. * Improve apt mirror substitution check for Pittsburgh mirror The script now verifies that the Pittsburgh mirror is present in ubuntu.sources after substitution, rather than checking for the absence of archive.ubuntu.com, which is intentionally retained as a fallback. This avoids false negatives and ensures the mirror substitution is effective. * Simplify deb822 mirror substitution using sed for archive URIs * Update apt mirror check to use PRIMARY variable Check for the configured primary mirror in ubuntu.sources using the PRIMARY variable instead of a hardcoded hostname. Update error message to include the actual PRIMARY value for clarity. * fix: Increase throughput test default ready_wait from 30s to 300s (fixes #8207) (#10344) The throughput workflow's `ready_wait` input defaulted to 30 seconds, which is insufficient for tests loading data from external sources like MongoDB. The dispatch configs specify adequate timeouts (e.g. 600s for mongodb-arrow), but manual workflow triggers via the GitHub UI used the low default, causing "Spiced instance not ready within 30s" failures. Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * Add auth headers support to OTEL metrics exporter (#10347) * Add auth headers support to OTEL metrics exporter Add a 'headers' field to OtelExporterConfig for sending authentication headers with OTEL metric export requests. Headers are applied as HTTP headers for HTTP protocol or gRPC metadata entries for gRPC protocol. Header values support secret replacement syntax (e.g. ${secrets:api_key}). This enables authentication with services like Datadog (DD-API-KEY header) and Grafana Cloud (Authorization header). * Add default value for headers in Spicepod schema and enhance gRPC exporter error handling * Refactor assertions in HTTP and gRPC exporter tests for improved readability * Fix gRPC exporter tests to use tokio runtime * Address review: rename shadowed vars, fix test runtime setup * Address review: pass owned headers, document gRPC key constraint, add tokio runtime to test * Add Clippy expectation for implicit hasher in create_otel_periodic_reader * Address review: document resolved_headers vs config.headers; note gRPC key constraint in schema * Fix YAML string formatting in test for OTEL exporter headers * fix linter warnings --------- Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: ewgenius <hey@ewgenius.me> * fix(github): shrink GraphQL page size on gateway errors; lower comment defaults (#10355) * fix(github): shrink GraphQL page size on gateway errors; lower comment defaults - Lower default `github_max_comments_fetched` from 75 to 25 to reduce worst-case node count per page and keep queries within GitHub's secondary rate limit budget. Cap remains 75. - Reduce PR outer `first:` page size from 100 to 25 when `include_comments` is enabled (review/discussion/all). Without comments the page stays at 100. - Reduce inner `comments(first: ...)` in issues query from 100 to 25 to match. - Add `PullRequestTableArgs::check_node_limit()` that estimates per-page node count and rejects configurations that would exceed GitHub's 500,000 node hard limit with an actionable error. Invoked eagerly from `read_provider` so misconfigurations fail fast rather than at query time with an opaque 502. - Graphql client: on 502/503/504 gateway errors, shrink the outer `first:` page size via a reverse-Fibonacci ladder (100,55,34,21,13,8,5,3,2,1) and rewrite the query AST on retry. This lets very large queries against overloaded GitHub endpoints succeed on a subsequent attempt with a smaller payload instead of replaying the same oversized query. Fixes the `spicehq_spiceai.pulls` 502 Bad Gateway errors observed with `include_comments: all` against large repositories. * fix(github): improve pagination handling and node count estimation for pull requests * fix(graphql): improve error handling for page size override locking * fix(graphql): preserve LIMIT 0 semantics, only clamp page_size_override --------- Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> * Relax apt mirror substitution failure to warning in CI action (#10361) * Relax apt mirror substitution failure to warning in CI action * Update .github/actions/configure-apt/action.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat(http): Add OAuth2 refresh-token auth to HTTP connector (#10348) * feat(http): Add OAuth2 refresh-token auth to HTTP connector Adds RFC 6749 §6 refresh-token grant support to the HTTP connector. When configured, the connector exchanges the refresh token at startup, stamps `Authorization: Bearer <access_token>` on every data request, and refreshes in the background before the token expires. Rotated refresh tokens are honored across the lifetime of the process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): Address review comments on OAuth2 refresh-token auth - Sanitize & cap token endpoint error body (collapse whitespace, 512 byte cap) - Reject non-Bearer `token_type` from the token endpoint - Store preformatted, sensitive `HeaderValue` in the watch channel so every data request is a cheap header clone instead of a new format!() allocation - React to shutdown immediately via `tokio::select!` on `tx.closed()` instead of waiting the full sleep interval - Add `refresh_loop_uses_rotated_refresh_token` test that drives the live background loop and asserts the rotated refresh token is used on the second exchange - Use `Parameters::user_param(...)` so configuration errors surface the prefixed user-facing keys (`http_auth_refresh_token`, etc.) instead of internal names Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): Further review feedback on OAuth2 refresh-token auth - Reject blank/whitespace-only http_auth_refresh_token at config time, before any network round-trip to the token endpoint - Classify RefreshTokenAuth errors: InvalidTokenUrl / InsecureTokenUrl / UnsupportedTokenType and 4xx responses from the token endpoint surface as InvalidConfiguration so users see configuration guidance instead of "Cannot connect …"; network / 5xx / parse errors stay as UnableToConnectInternal - Inline Parameters::user_param() directly into format!() calls and rename the Option<SecretString> local to client_credential so CodeQL's cleartext-logging heuristic stops flagging parameter-*name* strings as sensitive - OauthServerState in the integration test now uses tokio::sync::Mutex so the axum handlers do not block a Tokio worker thread - Mock OAuth server no longer swallows serve errors via .unwrap_or_default — .expect() surfaces failures directly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(http): cargo fmt + switch test URL to https://example.invalid - Apply rustfmt to auth.rs / https.rs / tests/http/mod.rs to clear the Rust Lint CI failure (cargo fmt --check diffs) - Swap the non-sent test URL in refresh_token_auth_applies_bearer_header to https://example.invalid so CodeQL's rust/non-https-url rule stops flagging a hardcoded http:// in test code. The request is only passed through `.build()` to inspect the Authorization header, never sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): Address clippy pedantic warnings in auth.rs - Wrap `OAuth2` in backticks in three doc-comments to satisfy clippy::doc_markdown - `sanitize_error_body` now takes `&str` instead of `String` (removes clippy::needless_pass_by_value); updated call sites - `MockResponse::{ok, status}` take `&serde_json::Value` (same reason); updated callers in the test module - Replace two `if … { panic!(...) }` blocks inside loop bodies with `assert!(… < deadline, …)` to satisfy clippy::panic_in_result_fn / clippy::manual_assert Lint-only changes; no behavior change. Runtime tests and data_components auth tests continue to pass (9/9 and 10/10 respectively). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): More Copilot-reviewer nits on OAuth2 refresh-token auth - ParameterSpec descriptions now cross-reference the correct user-facing key names: `auth_token_url` points at `http_auth_refresh_token`, and the `http_auth_client_id` / `http_auth_client_secret` descriptions mention each other explicitly - `map_auth_error` narrows `InvalidConfiguration` mapping from "any 4xx" to `matches!(status, 400 | 401 | 403)`. Transient 4xx (408 Request Timeout, 429 Too Many Requests, etc.) now route to `UnableToConnectInternal`, consistent with the doc-comment - Test doc-comments in `tests/http/mod.rs` updated to name the exact user- facing keys (`http_auth_refresh_token` + `auth_token_url`) - Clarified the whitespace-sanitize comment in `sanitize_error_body`: whitespace characters are *replaced* with a regular space (runs become runs of spaces); the comment no longer implies run-collapse Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): Bounded streaming read of token endpoint error body + test cleanup - Replace unbounded `response.text().await` on the error path with a new `read_bounded_error_body` that streams via `Response::chunk()` and stops after `MAX_ERROR_BODY_BYTES * 2` raw bytes. A hostile token endpoint can no longer force us to buffer a large body just so we can surface the first 512 bytes of it as a diagnostic. - Rework `refresh_loop_uses_rotated_refresh_token` to use `tokio::time:: Instant` (instead of `std::time::Instant`) for its 5s deadline poll, keeping real time so reqwest's `connect_timeout` doesn't race with paused virtual time. Added a comment explaining why `start_paused = true` is unsafe here (verified locally: 5/5 runs at 1.06s each, no flakes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix clippy lints in https connector - Backtick OAuth2 in doc comment (clippy::doc_markdown). - Collapse redundant match-guard into pattern literals (clippy::redundant_guards). * fix(http): Address more reviewer nits on OAuth2 auth module - `current_bearer_value` is now `#[cfg(test)] pub(crate) fn` so downstream callers in production code paths cannot accidentally surface the live bearer token via this API - Rename `sanitize_error_body_collapses_whitespace_and_truncates` to `…_replaces_whitespace_and_truncates` — the implementation *replaces* whitespace characters with spaces (runs preserved as runs of spaces), which the old test name misrepresented as "collapses" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): Enforce hard cap on sanitized error body size `sanitize_error_body` previously filled `out` up to `MAX_ERROR_BODY_BYTES` and then unconditionally appended `…<truncated>`, so the returned string could exceed the documented 512-byte cap by the marker's length. Introduce `TRUNCATION_MARKER` and a `CONTENT_BUDGET` constant that reserves room for the marker, so the final returned string (content + marker) is always ≤ `MAX_ERROR_BODY_BYTES`. Add a `debug_assert!` on the post-condition and tighten the unit test to check `out.len()` directly rather than just the content portion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): merge identical match arms in read_bounded_error_body Clippy's `match_same_arms` lint (enabled via `-Dclippy::pedantic` in CI) flagged the `Ok(None)` and `Err(_)` arms both returning `break`. Merged them into a single `Ok(None) | Err(_)` arm with a comment explaining that end-of-body and mid-stream transport errors are both terminal for a best-effort error diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(http): Improve error message for insecure token URL and add custom headers access method * fix(https): Refactor conditional check for Authorization custom header * fix(http): Collapse identical match arms in map_auth_error clippy::match_same_arms fired after introducing the TokenEndpointStatus { status: 400 | 401 | 403, .. } arm that maps to the same InvalidConfiguration variant as the URL-validation arms. Merge the patterns into a single arm. * Update crates/runtime/src/dataconnector/https.rs Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * Replace .is_err() and .is_ok() in tests with expect * fix clippy - add backticks * refactor(http): Move bounded error body helpers into resilient_http Address phillipleblanc review feedback: promote read_bounded_error_body and sanitize_error_body from auth.rs into resilient_http as public helpers so other HTTP-based data connectors can reuse them. Generalize the cap to a parameter and expose TRUNCATED_BODY_MARKER as part of the public API. * fix(http): Small-cap sanitize_error_body + drop case-sensitive one_of - `sanitize_error_body` previously returned an empty string when `max_bytes < TRUNCATED_BODY_MARKER.len()` because the content budget underflowed to 0. It now uses the full `max_bytes` as content budget without a marker in that case, matching the documented upper bound. New `sanitize_error_body_small_cap_fills_content_without_marker` test. - Removed `.one_of(&["basic", "body"])` from the `auth_client_auth` ParameterSpec: `Parameters::try_new` enforces `one_of` with exact string matching, which rejected `BASIC` / `BODY` before `ClientAuthMethod::parse` (case-insensitive) could accept them. Parser is the single source of truth for validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: ewgenius <hey@ewgenius.me> * Upgrade Rust toolchain to 1.94.1 (#10353) * Upgrade Rust toolchain to 1.94.1 Bump workspace rust-version, rust-toolchain.toml channel, Dockerfiles, and workflow rustup installs from 1.93.1 to 1.94.1. Updates the agent/copilot instructions baseline accordingly. * Add #[expect(clippy::result_large_err)] annotations to multiple TryFrom implementations and update println! to use saturating_sub * Handle order by and sort in PartitionedTableScanRewrite (#9656) * Handle order by and sort in PartitionedTableScanRewrite * formatting * fix(ci): add merge_group trigger to pr-develop and scope trunk merge_group to target branches * PR comments * clean * fix projection, change order of partition rule * PR comments * clippy * fix distributed acceleration testing * fix integ tests * redact flight * some better snapshots * federation-revert * update federation * fix test * linting * remove comment * fix UNION ALL * improvements * support filter predicates in topK pushdown * fix ownership * clippy * Upgrade datafusion-table-providers to d1b911a5 and bump adbc to 0.23 * table provider and federation update * snapshots * comment * snapshots * snapshots * FlightSQLTable cannot federate Extension LogicalPlans * test updates * fix join * preserve table alias * snapshots * fix test * snapshots * dependencies * snapshots * update table-provider * don't update table-providers and federation * snapshots * flightsql doesn't need federation; but physical optimisation to pushdown * Fix clippy * Executor: always clear partition_by setting at start * Fix cluster::distributed_acceleration tests * Update cluster::distributed_cayenne tests snapshots * Fix doc comments * Add test_distributed_acceleration_order_by_limit_pushdown integration test * Fix lint --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> * Fix OTEL Exporter (#10363) * Fix OTLP Exporter * Fix * Fix formatting in documentation comments * Pin spiceai candle / TEI forks to merged revs; drop local [patch] overrides (#10362) * Update candle and mistral.rs lock-step pins * Update dependencies to use new git revisions for candle packages * Add provenance comments to candle and mistralrs git dependency pins * Bump mistral.rs and text-embeddings-inference pins - mistral.rs: ac7063cd -> 9b4758762d6ebed08a42af7211c616ebc512c557 - text-embeddings-inference: 58b44fbb -> 88b7a84a2c2ad83707555183f8f18dd201897f12 Adapt to download_safetensors now taking Arc<ApiRepo>. * Remove unnecessary Clippy expectations from tests * Pin spiceai candle/TEI forks to merged revs; drop local [patch] overrides All six fork PRs have merged; bump each git rev to the merge SHA on the fork's `spiceai` branch, and drop the temporary /tmp/... [patch.*] tables that were used while those fixes were in flight. - spiceai/candle-cublaslt -> b74d30e0 (port to candle 0.10.1 / cudarc 0.19) - spiceai/candle-layer-norm -> 62f936a1 (port to candle 0.10.1 / cudarc 0.19) - spiceai/candle-rotary -> a4c4efcd (port to candle 0.10.1 / cudarc 0.19) - spiceai/candle -> c87b9bc5 (run_mha FFI softcap dedup) - spiceai/text-embeddings-inference -> b958dca5 (compute_cap -> cudarc 0.19; bumps sibling crate pins to the revs above) - candle-index-select-cu (crates.io -> spiceai fork) -> 397d7338 (fallback shim for candle 0.10.1 / cudarc 0.19; patched via [patch.crates-io]) Verified with `cargo check --release --features cuda` on CUDA 12.6 / CUDA_COMPUTE_CAP=90: clean finish in 10m52s. * Update Cargo.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Implement sort pushdown and fix pushdown gaps across providers (#10337) * Implement sort pushdown support and fix pushdown gaps across providers Implement DataFusion v52 `try_pushdown_sort` for transparent wrapper execution plans (CayenneAccelerationExec, SchemaCastScanExec, BytesProcessedExec) by delegating to their child plans, and for SQL providers (MSSQL, Oracle, FlightSQL) by generating ORDER BY clauses. Also fix limit pushdown consistency in wrappers (delegate supports_limit_pushdown/with_fetch/fetch to child plans instead of returning mismatched values), and extend MSSQL filter pushdown to support NotEq, And, Or, Not, IsNull, IsNotNull, Like, InList, and Between expressions. * fix: enhance sort pushdown error handling and improve filter classification logic * Address PR review comments: improve sort pushdown correctness - FlightSQL: Replace filter_map with fallible map in sql() ORDER BY generation to return an error instead of silently dropping non-Column sort expressions. Add InvalidSortExpression error variant. - MSSQL: Make classify_mssql_filter recursively check time-related expressions in And/Or/Not/IsNull/IsNotNull/Like sub-expressions to prevent time-related filters from being pushed down via compound exprs. - SchemaCastScanExec: Propagate input ordering through equivalence properties and set maintains_input_order to true, since schema casting preserves row order. - FlightSQL tests: Add unit tests for try_pushdown_sort (unsupported for non-column, exact for column) and sql() ORDER BY clause generation. * Remove unsafe ordering propagation from SchemaCastScanExec Do not copy input ordering into EquivalenceProperties since schema casting can change data types and projected columns, making the input ordering invalid for the output schema. Retain maintains_input_order=true since row order is preserved. * Fix CI failures: restore SchemaCastScanExec ordering and fix SQL double-space - Restore ordering propagation in SchemaCastScanExec::new() that was incorrectly removed, fixing SortPreservingMergeExec invariant violations in partition integration tests. - Fix double-space in generated SQL for Oracle, FlightSQL, and MSSQL execution plans when order_expr is empty. Build SQL incrementally, appending clauses only when non-empty. - Update oracle test-framework snapshots to match corrected SQL output. * Upgrade datafusion-table-providers to 4e8b2b0bd0f0 (pushdown support) (#10341) * Refactor BytesProcessedExec to simplify fetch and pushdown sort methods * Fix schema_cast ordering: remap column indices by name and add tests - Remap sort expression column indices from input to output schema by name, since SchemaCastScanExec may reorder columns relative to input - Only propagate ordering when ordered columns have identical types - Add 3 unit tests: ordering propagated (same types), not propagated (type differs), and indices remapped (reordered columns) - Add branch comment to datafusion-federation git dependency * Update refresh_max_timestamp_df plan snapshot * Update cluster::distributed_cayenne_catalog snapshots * Update duckdb_json_functions snapshots * Update datafusion version * Update datafusion version * Update to datafusion-federation rev 42245bdd58ee3d7da8276e83d85fb1c52aec916e * Revert "Update refresh_max_timestamp_df plan snapshot" This reverts commit 244fb05060d3787555fff13fc62dd6df16c50bfe. * Update distributed_acceleration snapshot --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: Jack Eadie <jack@spice.ai> * Merge develop to trunk (2026-04-16) (#10345) * fix: Update test snapshots (#10219) Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> * fix: Update Search integration test snapshots (#10240) * fix: Place search index filters below pre_limit for pushdown (fixes #10149) SearchQueryProvider::scan() was adding the pre_limit to the logical plan BEFORE adding filters. Since DataFusion's PushDownFilter optimizer cannot push filters past a Limit node, filters never reached the underlying search index table provider (e.g. S3VectorsQueryExec). This caused both worse performance (server-side filtering not used) and incorrect results (top-K-then-filter instead of top-K-of-filtered-set). The fix restructures the plan building to add filters BEFORE the pre_limit, allowing DataFusion to push them through the SubqueryAlias into the inner TableScan. * style: cargo fmt * fix: Make filter pushdown test assertions more robust Use `filters.iter().any(...)` instead of `filters[0]` to assert that at least one recorded scan call contains the expected pushed-down predicate. This avoids potential flakiness if DataFusion's physical planning invokes scan() more than once during optimization. Addresses copilot review feedback on PR #10157. * refactor: Replace unit tests with insta snapshot test for filter pushdown Replace the unit tests in SearchQueryProvider with a new VectorSearchSqlFilteredIndexOnly snapshot test case in the megascience integration test suite. The snapshot test exercises the index-only path with a WHERE filter, verifying both correct query results and the EXPLAIN plan structure (filter placement relative to pre_limit). Addresses review feedback from @Jeadie on PR #10157. * fix: cargo fmt formatting in megascience test match arm * fix: Update Search integration test snapshots * fix: Use pre_limit argument instead of SQL LIMIT in filtered index test Address review feedback from @Jeadie: the VectorSearchSqlFilteredIndexOnly test should pass the limit as the pre_limit argument to vector_search() rather than using a SQL LIMIT clause, since the test is specifically designed to verify filter pushdown below the pre_limit. * fix: Update github workflows snapshot after features.yml removal The `check all features` workflow (.github/workflows/features.yml) was removed from the repository, shifting the top-10 workflows query result. * fix: Update search snapshot for s3vectors_chunking_view_with_where Score for id 551 shifted from 0.28 to 0.29 (consistent across retries), changing result order when tied with id 1035. Update snapshot to match. * fix: Make search snapshot tests robust to cross-runner score variance model2vec similarity scores vary ±0.01 across CI runners (different macOS versions), causing snapshot tests to fail when scores land on different sides of truncation boundaries. Two fixes: 1. normalize_search_response_json: use round() instead of trunc() for score display and sorting. Scores like 0.289 now consistently round to 0.29 instead of truncating to 0.28 on some runners. 2. SQL test queries: reduce trunc(_score, 3) to trunc(_score, 2) to avoid flakiness at the 3rd decimal place (e.g., 0.556 vs 0.557). * fix: Apply cargo fmt to search test normalization * fix: Update OpenAI search snapshots for embedding model score shift OpenAI's text-embedding-3-small model scores shifted by +0.01, causing snapshot mismatches in the openai_test_search CI check. * fix: Scope score rounding to s3vectors tests only The previous change to use `round` instead of `trunc` for score display in `normalize_search_response_json` was applied globally, causing cascading snapshot failures in OpenAI search tests (0.65→0.66, etc.). This fix adds a `round_scores` flag to `SearchTestCase` and `run_search_w_explain` so that only s3vectors tests (which have non-deterministic model2vec scores that vary ±0.002 across CI runners) use rounding for display. All other tests (OpenAI, HF, text search) continue to use truncation, preserving their existing snapshots. Sort comparison still uses rounding universally to stabilize ordering. * fix: Revert OpenAI snapshots to truncated score values The previous commit incorrectly updated these snapshots to rounded values when the normalization was unconditionally using round(). Now that rounding is scoped to s3vectors tests only, OpenAI tests use truncation again - restore the original snapshot values. * fix: Also scope sort rounding to round_scores flag The sort comparison was unconditionally using rounded values, causing ordering mismatches with truncated display values in OpenAI tests. Now both sort and display use the same precision mode: raw floats when round_scores is false, rounded when true. * fix: Use score rounding for OpenAI search tests OpenAI embeddings are non-deterministic — scores vary by ±0.01 across CI runs, causing snapshot failures when truncation amplifies boundary effects. Switch OpenAI search tests to use score rounding (same as model2vec/s3vectors tests) for more stable comparisons. * fix: handle Utf8View/LargeUtf8 in GitHub connector ref filters (#10217) * fix: handle Utf8View/LargeUtf8 in GitHub connector ref filters DataFusion 52 defaults to map_string_types_to_utf8view=true, so string literals in WHERE clauses arrive as ScalarValue::Utf8View instead of ScalarValue::Utf8. The GitHub connector's ref filter extraction only matched Utf8, causing WHERE ref='...' to silently fail. Changes: - Add scalar_utf8_value() helpers to extract strings from all three ScalarValue string variants (Utf8, LargeUtf8, Utf8View) - Update ref filter pushdown in files, commits, and workflow_runs - Change files table ref filter from Inexact to Exact (ref is fully handled by the connector, no residual filter needed) - Fix validate_installation_access to skip when token-based auth is active, preventing autoloaded app credentials from interfering - Add GitHub App auth integration tests for commits, files, and issues * fix: streamline ref value handling in commits filter pushdown * fix: Correct round_scores=false for OpenAI tests, remove unused builder, update github workflows snapshot - OpenAI tests should use truncation (round_scores=false) since their embeddings are deterministic - Remove unused round_scores() builder method that triggered lint error - Update github workflows snapshot to reflect removed integration.yml workflow * fix: Update snapshot expression headers to match new function signatures All normalize_search_response and normalize_search_response_json calls now include the round_scores parameter. Update snapshot expression lines to match so insta doesn't flag expression mismatches. * fix: Update snapshot column aliases from trunc(_score,3) to trunc(_score,2) SQL test queries were changed from trunc(_score, 3) to trunc(_score, 2) in a previous commit. Update all snapshot files that reference the old Int64(3) column alias to use Int64(2). * ci: Revert autogenerated PR base branch back to trunk in GitHub workflows (#10222) * ci: Revert autogenerated PR base branch back to trunk in GitHub workflows * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: chunk MERGE delete filters and update Vortex for stack-safe IN-lists (#10207) * fix(databricks): Fix schema introspection and timestamp overflow (#10226) * fix(databricks): fix schema introspection and timestamp overflow - Fall back from full_data_type to data_type column when Databricks information_schema does not have full_data_type (UNRESOLVED_COLUMN) - Parse parameterless complex types (ARRAY, MAP, STRUCT, DECIMAL) gracefully for the data_type fallback path - Change declared timestamp types from Nanosecond to Microsecond to match what Databricks actually sends in Arrow IPC, preventing arithmetic overflow on far-future sentinel values (e.g. year 9999) - Add safe-cast fallback in try_cast_to for timestamp unit conversions that overflow, producing NULL instead of crashing * fix: match ArithmeticOverflow error variant in timestamp safe-cast fallback * refactor: simplify array and batch creation in tests for clarity * fix(databricks): enhance error handling for unresolved columns in schema retrieval * fix(databricks): Fix schema introspection failures for non-Unity-Catalog environments (#10227) * fix(databricks): fix schema introspection and timestamp overflow - Fall back from full_data_type to data_type column when Databricks information_schema does not have full_data_type (UNRESOLVED_COLUMN) - Parse parameterless complex types (ARRAY, MAP, STRUCT, DECIMAL) gracefully for the data_type fallback path - Change declared timestamp types from Nanosecond to Microsecond to match what Databricks actually sends in Arrow IPC, preventing arithmetic overflow on far-future sentinel values (e.g. year 9999) - Add safe-cast fallback in try_cast_to for timestamp unit conversions that overflow, producing NULL instead of crashing * fix: match ArithmeticOverflow error variant in timestamp safe-cast fallback * refactor: simplify array and batch creation in tests for clarity * fix: update comments to clarify overflow conditions for timestamp handling * Add test for UNRESOLVED_COLUMN on columns other than full_data_type; tighten fallback condition * Fix MapArray entries nullability: Arrow spec requires Map entries to be non-null The Databricks type parser was creating Map types with nullable entries (entries struct field nullable=true), but Arrow's MapArray validation requires entries to always be non-null. When the wire data arrived with non-null entries and the declared schema had nullable entries, the cast failed with 'MapArray entries cannot contain nulls'. Changed both parameterized (MAP<K,V>) and parameterless (MAP) parsing to set entries nullable=false, matching the Arrow specification. * Add cohorted spend view schema tests from real Databricks CSV * Replace edw references with generic test names * Enhance error handling in parser tests: replace unwrap_or_else with expect for clearer failure messages * Address PR review: add table to tracing, use expect in tests, fix doc backticks * Add GEOMETRY type support (maps to Binary/WKB); add mixed-type schema tests * Properly mark dataset as Ready on Scheduler (#10215) * Properly mark dataset as Ready on Scheduler * Lint and fixes * Fix * Lint * Enable reqwest compression and optimize HTTP client settings (#10154) * fix: report text_search validation errors as execution errors, not planning errors * fix(bedrock): return specific error messages for auth and stream failures Replace the generic TODO catch-all with specific error matching for each ConverseStreamOutputError variant (ThrottlingException, ValidationException, ModelStreamErrorException). For non-service SDK errors, detect authentication failures (UnrecognizedClientException, AccessDeniedException, etc.) and return a clear "authentication failed" message instead of "unhandled error". Fixes #6771 * refactor(bedrock): use into_service_error() for idiomatic error handling Replaces e.err() with e.into_service_error() which is the standard AWS SDK pattern for consuming SdkError and matching on the inner service error type. Addresses review feedback from @Jeadie. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(bedrock): use into_service_error() for typed error extraction * ci: Add merge_group trigger to integration test workflows * Add support for DF-native DML (#9931) * Add support for DF native DML * Lint * Cleanup DeletionTableProvider * Lint * Fix * Fix * Fix * Lint * Lint * Fix * Lint * Fix partition test * Add update to cayenne and polytable * Lint * Fix test * Fix Cargo.toml * Fix * Fix * Fix * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Update autogenerated PR base branch to develop in GitHub workflows (#10034) * Update autogenerated PR base branch to develop in GitHub workflows * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Cleanup worktree path (#10033) * Cleanup worktree path * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * fix: Normalize Arrow Dictionary types for DuckDB and SQLite acceleration (#9959) * fix: Normalize Arrow Dictionary types for DuckDB and SQLite acceleration Arrow Dictionary-encoded columns (used for enums and categorical data) are not natively supported by the DuckDB and SQLite data accelerators. This causes failures when accelerating datasets that contain enum/dictionary type columns. Add `normalize_dictionary_types()` to convert Dictionary fields to their underlying value types (e.g. Dictionary(Int32, Utf8) -> Utf8) in the schema before it reaches the accelerator. The existing `SchemaCastScanExec` pipeline automatically casts the record batch data to match. Fixes #2889 Fixes #2891 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Address review comments for Dictionary type normalization - Handle nested container types (List, Struct, Map, etc.) in both `normalize_dictionary_data_type` and `has_dictionary_types`, not just top-level Dictionary fields - Preserve field metadata by using `field.with_data_type()` instead of `Field::new()` which drops metadata - Gate Dictionary normalization to only DuckDB, SQLite, and Turso engines that cannot handle Dictionary encoding natively, leaving Arrow/Cayenne/ PostgreSQL unaffected - Add read-back verification to the SQLite test to match the DuckDB test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: Fix rustfmt formatting and clippy doc_markdown lints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix clippy doc_markdown lints in DuckDB and SQLite test comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: Add coverage for nested dictionary types and field metadata preservation Add tests verifying that: - normalize_dictionary_types handles Dictionary inside List and Struct - has_dictionary_types detects Dictionary inside nested container types - Field-level metadata is preserved (not just schema-level metadata) These tests strengthen coverage for the review feedback on PR #9959. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix rustfmt formatting in test code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Add Union type support to dictionary normalization and improve test assertions Extend `data_type_contains_dictionary` and `normalize_dictionary_data_type` to handle `DataType::Union` variants, ensuring dictionary types nested inside unions are properly detected and normalized. Also strengthen the SQLite dictionary round-trip test assertion to check actual row counts instead of just non-emptiness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix clippy needless_collect in Union type normalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Remove accidentally committed worktree reference * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Claude <claude@spices-MacBook.localdomain> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * fix: Enforce target_chunk_size as hard maximum in chunking (#9973) * fix: Enforce target_chunk_size as hard maximum in chunking When chunking is enabled for embeddings, the underlying text_splitter library may produce chunks that slightly exceed the configured target_chunk_size (e.g. 513-514 tokens with a 500 target). This causes embedding failures when the model has a strict token input limit (e.g. 512 tokens). Add post-processing enforcement that re-splits any oversized chunks using binary search to find the longest prefix fitting within the target. This ensures target_chunk_size acts as a hard maximum rather than a soft target. Overlap from text_splitter is also bounded since the enforcement applies to all output chunks. Fixes #3326 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix rustfmt formatting in chunking tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Enable reqwest compression and optimize HTTP client settings - Enable gzip, brotli, zstd, deflate compression features on reqwest workspace dependency so all HTTP clients negotiate compressed responses - Fix per-request client construction in spice_cloud::post_json by moving the reqwest::Client to a struct field built once in new() - Replace reqwest::get() one-off in metrics status probe with a lazily-initialized shared client with tight timeouts - Add connect/request timeouts to clients missing them: - Kubernetes secret store: 10s connect, 30s request - OTEL HTTP exporter: 10s connect, 30s request - Google GenAI: 10s connect, 300s request - Databricks SQL Warehouse: 10s connect, 30s request * Change Spice Cloud client timeout from 30min to 15min * refactor: Simplify conditional expressions for clarity in multiple files * Add connect_timeout and timeouts to CLI HTTP clients - Add connect_timeout(10s) to RuntimeContext shared client - Add connect_timeout(10s) to GitHubClient - Add connect_timeout(10s) and timeout(30s) to login flow clients * Fix lint issues from develop merge - Add missing # Errors doc sections in cayenne catalog_provider.rs - Fix collapsible_if in file_based_retention_delete_test.rs - Remove unused import in on_conflict_edge_cases_test.rs - Add missing .await on csv_run_and_verify_query calls - Fix needless borrow in schema_evolution test - Fix formatting in cayenne source files * Add connect_timeout to remaining HTTP clients - Zipkin exporter and reachability check in spiced tracing - NSQL request client in repl - Spice Cloud catalog connector - CloudClient (new and with_timeout) - LLM provider create_http_client - All spidapter HTTP clients * Address review: propagate client build errors instead of silent fallback - Move reqwest::Client construction from SpiceExtension::new() to initialize(), returning a structured error on failure - Store client as Option<reqwest::Client> with ClientNotInitialized error - Use LazyLock<Result<...>> for metrics client to surface build errors instead of unwrap_or_default() * Propagate client build error in repl NSQL request path * Guard setup-spiceio steps with runner.os == 'macOS' Add if: runner.os == 'macOS' guards to all setup-spiceio, setup-sccache, sccache stats, and kill spiceio steps across workflow files. This prevents failures when jobs run on non-macOS runners (e.g. spiceai-dev-runners) where the gh CLI and spiceio dependencies are unavailable. Matches the existing pattern in codeql-analysis.yml. * Guard setup-spiceio on UNAS_SMB_PASS secret availability Add secrets.UNAS_SMB_PASS != '' condition to setup-spiceio steps and use steps.setup-spiceio.outputs.endpoint != '' for downstream sccache steps. This ensures the entire spiceio/sccache chain is skipped when the secret is unavailable (e.g. fork PRs, new runner environments). * Fix cayenne add_column snapshot and retry non-JSON 403 responses - Update cayenne add_column snapshot to include the lname column, matching the duckdb and sqlite snapshots. - Treat non-JSON HTTP 403 responses as retriable in the GraphQL client. A 403 with a non-JSON body indicates a transient upstream proxy or abuse-detection block (e.g. GitHub's 'Request forbidden by administrative rules'), not a genuine credentials/permissions error (which returns valid JSON). This prevents test flakiness from temporary GitHub abuse detection. * Address review: update 403 retry test, fix formatting - Update test_json_decode_client_error_not_retriable to exclude 403 and add dedicated test_json_decode_forbidden_retriable test. - Fix formatting for user_agent format strings (cargo fmt). * Increase SpiceExtension client timeout to 1800s (30 min) * feat: Initial support for write-through accelerated tables (#10115) * wip: mvp write through accelerated tables * docs: Delete notes * refactor: Address comments, simplify staging append, add cayenne partitioned staging write * review: Address comments * fix: Update partition expr from table provider multi-partition-by-expr * refactor: Replace WriteThroughAcceleratedTableProvider into AcceleratedTable * chore: fmt * chore: clippy * review: address comment * Revert "fix: executor startup failures" and "When executor connects, send DDL for existing tables." (#10175) * Revert "fix: executor startup failures (#10155)" This reverts commit 1b639be2fcf90a996fe900e05ba6614eff061b29. * Revert "When executor connects, send DDL for existing tables. (#9904)" This reverts commit 7c6abaa8d43ac0cb428dd3887667aa0960d63840. * fix formatting * fix missing import * fix linter * fix linter * fix: add missing `# Errors` doc section to satisfy clippy::missing_errors_doc * fix: add missing `# Errors` doc sections in staging_wal.rs --------- Co-authored-by: Jack Eadie <jack@spice.ai> * fix: remove PARTITION BY forwarding to Cayenne executors (#10182) * dont have partition by in executors * cleanup * revert: restore partition-table-provider feature for cayenne dependency --------- Co-authored-by: jeadie <jack@spice.ai> * fix: correct dispatch test assertions and runner type typo Fix ready_wait assertions in LoadArgs deserialization tests to expect None, since the load deserializer explicitly strips ready_wait as it is unsupported by the load workflow. Also fix a typo in the test YAML: spicehq-dev-large-runners -> spiceai-dev-large-runners, which caused SingleOrVec deserialization failures. * fix: Update tpch benchmark snapshots for federated/glue[csv].yaml * fix: Update tpch benchmark snapshots for federated/s3[parquet].yaml * fix: Update tpch benchmark snapshots for federated/mongodb.yaml * fix: Update tpch benchmark snapshots for federated/abfs[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[catalog].yaml * fix: Update tpch benchmark snapshots for federated/odbc[databricks].yaml * fix: Update tpch benchmark snapshots for federated/mssql.yaml * fix: Update tpch benchmark snapshots for federated/glue[catalog].yaml * fix: Update tpch benchmark snapshots for federated/dynamodb.yaml * fix: Update tpch benchmark snapshots for federated/oracle.yaml * fix: Update tpch benchmark snapshots for federated/odbc[athena].yaml * fix: Update tpch benchmark snapshots for federated/glue[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[hadoop].yaml * fix: Update tpch benchmark snapshots for federated/abfs_standard_versioned[parquet].yaml * fix: Update tpch benchmark snapshots for federated/file[parquet].yaml * fix: Update tpch benchmark snapshots for federated/spicecloud[catalog].yaml * Use balanced expression tree for partition filter combination (#10185) * fix: resolve clippy lint errors in cayenne, write_through, iceberg_ddl, and planner (#10186) * fix: resolve clippy lints in write_through, iceberg_ddl, and planner * fix: use module-level #[expect(dead_code)] in cayenne test common module * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-cayenne[file]-partitioned.yaml (#10189) * fix: correct dispatch test assertions and runner type typo Fix ready_wait assertions in LoadArgs deserialization tests to expect None, since the load deserializer explicitly strips ready_wait as it is unsupported by the load workflow. Also fix a typo in the test YAML: spicehq-dev-large-runners -> spiceai-dev-large-runners, which caused SingleOrVec deserialization failures. * fix: Update tpch benchmark snapshots for federated/glue[csv].yaml * fix: Update tpch benchmark snapshots for federated/s3[parquet].yaml * fix: Update tpch benchmark snapshots for federated/mongodb.yaml * fix: Update tpch benchmark snapshots for federated/abfs[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[catalog].yaml * fix: Update tpch benchmark snapshots for federated/odbc[databricks].yaml * fix: Update tpch benchmark snapshots for federated/mssql.yaml * fix: Update tpch benchmark snapshots for federated/glue[catalog].yaml * fix: Update tpch benchmark snapshots for federated/dynamodb.yaml * fix: Update tpch benchmark snapshots for federated/oracle.yaml * fix: Update tpch benchmark snapshots for federated/odbc[athena].yaml * fix: Update tpch benchmark snapshots for federated/glue[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[hadoop].yaml * fix: Update tpch benchmark snapshots for federated/abfs_standard_versioned[parquet].yaml * fix: Update tpch benchmark snapshots for federated/file[parquet].yaml * fix: Update tpch benchmark snapshots for federated/spicecloud[catalog].yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-cayenne[file]-partitioned.yaml * fix: Update tpch benchmark snapshots for accelerated/indexes/file[parquet]-cayenne[file]-indexes.yaml * fix: Update tpch benchmark snapshots for accelerated/spicecloud-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/indexes/file[parquet]-arrow-indexes.yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-arrow-partitioned.yaml * fix: Update tpch benchmark snapshots for accelerated/dynamodb-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/mongodb-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/dynamodb-duckdb[file].yaml * fix: Update tpch benchmark snapshots for accelerated/file[parquet]-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/file[parquet]-cayenne[file]turso.yaml * fix: Update tpch benchmark snapshots for accelerated/file[parquet]-cayenne[file].yaml * fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-cayenne[file]-on_zero_results.yaml * fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[file]-on_zero_results.yaml * fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[memory]-on_zero_results.yaml * fix: Update tpch benchmark snapshots for accelerated/mysql-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-duckdb[file]-partitioned.yaml * fix: Update tpch benchmark snapshots for accelerated/postgres-arrow.yaml * fix: Update tpcds benchmark snapshots for federated/s3[parquet].yaml * fix: Update tpcds benchmark snapshots for federated/abfs[parquet].yaml * fix: Update tpcds benchmark snapshots for federated/file[parquet].yaml * fix: Update tpcds benchmark snapshots for federated/databricks[delta_lake].yaml * fix: Update tpcds benchmark snapshots for accelerated/spicecloud-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/databricks[delta_lake]-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-arrow-partitioned.yaml * fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/file[parquet]-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-cayenne[file].yaml * fix: Update tpcds benchmark snapshots for accelerated/file[parquet]-cayenne[file].yaml * fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-cayenne[file]-on_zero_results.yaml * fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[file]-on_zero_results.yaml * fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[memory]-on_zero_results.yaml * fix: Update tpcds benchmark snapshots for accelerated/postgres-arrow.yaml * Trigger CI * Update workspace configuration in Cargo.toml --------- Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> * feat: Add CREATE TABLE ... (LIKE ...) support (#10180) * feat: Add CREATE TABLE ... (LIKE ...) support * feat: Support struct() IN-list filters in Cayenne position-based delete (#10191) * feat: Support struct() IN-list filters in Cayenne position-based delete Decompose struct(k1, k2) IN (SET) expressions into balanced OR-trees of AND-equalities for Vortex pushdown. DataFusion converts tuple IN-lists (k1, k2) IN ((v1,w1), ...) to struct() IN-list which Vortex cannot handle. - Add try_decompose_struct_in_list() to position_based.rs - Handle CAST-wrapped struct() expressions (type coercion) - Cast literal values to match column types (e.g., Int64 to Int32) - Build balanced binary OR-tree for O(log N) depth vs O(N) linear chain - Add balanced_or_exprs() for MERGE delete filters (same stack overflow fix) * fix: Parse partition expressions to extract column references for MERGE validation (#10192) * fix: Parse partition expressions to extract column references for MERGE validation Restore strict MERGE primary_key/on_conflict validation by parsing partition expressions (e.g., bucket(5, c_nationkey)) with sqlparser AST Visitor to extract referenced column names, instead of requiring exact string matches on the partition column. - Add extract_partition_column_references() using sqlparser Visitor - Handle simple columns, compound identifiers, and transform expressions - Add unit tests for various partition expression formats * feat: Spidapter staging table support and SQL execution extraction * fix(BigQuery): fix Unsupported subqueries in JOIN ON predicates (TPC-H) (#10195) * fix: clippy auto-fixes from merge with develop --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: claudespice <claude@spice.ai> Co-authored-by: Claude <claude@spices-MacBook.localdomain> Co-authored-by: William <98815791+peasee@users.noreply.github.com> Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * feat(databricks): DESCRIBE TABLE fallback and source-native type parsing for Lakehouse Federation (#10229) * feat(databricks): DESCRIBE TABLE fallback and source-native type parsing for Lakehouse Federation Databricks Lakehouse Federation foreign tables (e.g. Neon PostgreSQL) have two issues with schema introspection: 1. information_schema.columns returns no rows (data_array is null) because the table has no column metadata registered in Unity Catalog. 2. When data_type column IS populated, it returns source-native type names (integer, text, numeric, timestamp without time zone) instead of Spark SQL types. Changes: - Add DESCRIBE TABLE as third-tier schema introspection fallback after information_schema.columns full_data_type and data_type attempts fail - Catch ParseError from information_schema path to also fall through to DESCRIBE TABLE for unrecognized native types - Add source-native type parsing: integer, text, numeric, real, double precision, character varying, varchar, timestamp with/without time zone - Extract get_schema_from_information_schema helper method - Add schema_from_describe_json parser for DESCRIBE TABLE responses (defaults all columns to nullable since DESCRIBE lacks nullability info) - Add create_describe_payload with backtick-quoted identifiers - Add comprehensive tests for all fallback paths and native types * fix: address PR review comments - Validate multi-word timestamp types (time/zone tokens) instead of blind advance() calls; add expect_identifier helper - Validate DECIMAL/NUMERIC precision <= 38 and scale <= precision - Validate varchar/character varying length is a Number token - Remove ParseError -> DESCRIBE TABLE fallback (DESCRIBE TABLE returns Spark types, so a ParseError from information_schema indicates a real problem that should surface as an error) * feat: Add pagination support to HTTP data connector (#10228) * feat: Add pagination support to HTTP data connector Add configurable pagination for HTTP API endpoints, supporting two modes: - URL mode: next page URL from response body (JSON pointer) or HTTP Link header with rel="next" - Token mode: cursor/token extracted via JSON pointer and passed as a query parameter in subsequent requests Configuration parameters: - pagination: enabled/disabled - pagination_next_pointer: JSON pointer to next URL/cursor - pagination_link_header: use Link header for pagination - pagination_token_param: query param name for cursor tokens - pagination_data_pointer: JSON pointer to data array per page - pagination_max_pages: safety limit (default: 100) Key design decisions: - Streaming execution via futures::stream::try_unfold yields one RecordBatch per page, avoiding buffering entire result sets - SSRF protection validates next-page URLs share the base URL origin - Works transparently with caching, append, and full (with refresh_sql) acceleration modes - Refactored batch creation into create_batch_from_rows for reuse between paginated and non-paginated paths Includes 18 unit tests covering Link header parsing, JSON pointer extraction, SSRF rejection, data pointer extraction, token/query building, config validation, and edge cases (null, empty, missing). * fix: address PR review feedback for HTTP pagination - Return error instead of silently truncating when max_pages is reached - Fall through from missing JSON pointer to Link header check - Support relative URLs for next-page links via base_url.join() - Use url::form_urlencoded for proper query parameter encoding - Don't stop pagination on empty data rows if next page exists - Fix clippy and formatting issues * fix: fail loudly on pagination JSON/pointer errors - Error on invalid JSON when pagination_next_pointer is configured - Error on non-string/non-null pointer values instead of silently ending - Error on missing/invalid data pointer instead of returning empty rows - Validate JSON Pointer syntax (must start with '/') in with_pagination - Add tests for all new error cases * fix: address pagination review feedback (round 3) - Preserve base URL query params in token pagination via merge_queries() - Track actual per-page path/query for accurate request_* columns - Broaden Link header parsing: handle unquoted rel=next and multi-value rel - Avoid intermediate Vec allocation for content column (from_iter_values) - Add comment noting cache bypass for subsequent pages is intentional - Add tests for merge_queries, unquoted rel, and multi-rel Link headers * feat: change pagination to auto/enabled/disabled mode Default is 'auto' which auto-detects Link headers on every HTTP response without requiring explicit config. 'enabled' requires explicit pagination config. 'disabled' turns off pagination. In auto mode with no other pagination params configured, only Link header detection is active. This means pagination 'just works' for APIs that use standard Link headers (GitHub, etc.). * fix: address pagination review feedback (round 4) - has_dynamic_api_params only true when pagination explicitly configured or pagination params are set (not in auto-detect-only mode) - Loop internally to skip empty pages instead of yielding empty batches - Fix max_pages error message: 'query was aborted' not 'may be incomplete' - Parse response JSON once per page, reuse for both next-page and data extraction (avoids duplicate deserialization on large responses) * fix: return data instead of error when max_pages reached * fix: merge base URL query for page 0, validate pagination paths, treat auto/link/max_pages as dynamic * fix: use top-level splitting for Link header parsing (RFC 8288 compliance) * test: add comprehensive tests for Link header parsing functionality * refactor: simplify query merging logic and enhance pagination handling * fix: correct logic for determining link header usage in pagination * Update snapshot --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> * fix(databricks): harden HTTP retries, compression, and token refresh (#10232) * fix(databricks): harden HTTP retries and encodings * feat(databricks): implement response body draining for retry logic in HTTP requests * refactor(databricks): replace write! macro with write_fmt for header formatting * fix(databricks): clamp short-lived token refreshes * fix(databricks): include nested HTTP error causes * fix(databricks): implement bounded retry delay to clamp maximum backoff duration * refactor: deduplicate HTTP retry logic into shared resilient_http module - Extract send_request_with_retry from databricks token provider into data_components::resilient_http (make pub) - Remove ~210 lines of duplicated retry logic, backoff, Retry-After parsing, and response body draining from databricks.rs - Truncate AWS APN app_name to 50 chars to avoid SDK warning - Add length assertion to app_name test - Remove unused httpdat…

Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>

* Release notes for v2.0.0-rc.3 (#10377) * Release notes for v2.0.0-rc.3 * add refresh token config example * Update release notes for v2.0.0-rc.3 with improved descriptions and links for key features * fix http connector link --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Bump version to 2.0.0-rc.3 --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>

ewgenius and others added 7 commits April 17, 2026 20:26

fix: Update Search integration test snapshots (#10376)

df64891

Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>

pull Bot locked and limited conversation to collaborators Apr 18, 2026

pull Bot added the ⤵️ pull label Apr 18, 2026

pull Bot merged commit ccf11a1 into TheRakeshPurohit:trunk Apr 18, 2026
2 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] trunk from spiceai:trunk#750

[pull] trunk from spiceai:trunk#750
pull[bot] merged 7 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk

pull Bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pull Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull Bot commented Apr 18, 2026 •

edited

Loading