[pull] trunk from spiceai:trunk#750
Merged
Merged
Conversation
* Release notes for v2.0.0-rc.3 * add refresh token config example * Update release notes for v2.0.0-rc.3 with improved descriptions and links for key features * fix http connector link --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* Use spiceai-macos for PR Lint * Remove Makefile Targets workflow and integrate make install steps into PR workflow * Remove make install-odbc step from build-test (we don't build with ODBC) * fix: override RUST_PROFILE for install targets in CI workflow (#10366)
* ci: skip artifact compression for test binaries/archives The test archives are already zstd-compressed and test binaries are native executables, so running actions/upload-artifact's zip compression on them adds CI time without meaningful size savings. Set compression-level: 0 and bump retention to 3 days. * ci: remove outdated Pittsburgh mirror from apt configuration
…ivy, rand (#10379) * chore(deps): update candle dependencies to latest revisions * chore(deps): bump aws-lc-rs 1.15.4 -> 1.16.3 (aws-lc-sys 0.37.1 -> 0.40.0) * chore(deps): bump spiceai/mistral.rs to 27405ba1 * chore(deps): update tantivy to version 0.26.0 and downgrade windows-sys to 0.60.2 * chore(deps): update rand and windows-sys dependencies to latest versions * chore(deps): drop unused OpenSSL license allowance after aws-lc-sys 0.40 bump * chore(deps): update rand usage to RngExt across multiple modules * Update crates/cache/Cargo.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * chore(deps): update rand usage to RngExt across multiple modules * chore(deps): replace gen_range with random_range for improved randomness in benchmarks --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* build(deps): bump tantivy from 0.25.0 to 0.26.0
Bumps [tantivy](https://github.com/quickwit-oss/tantivy) from 0.25.0 to 0.26.0.
- [Release notes](https://github.com/quickwit-oss/tantivy/releases)
- [Changelog](https://github.com/quickwit-oss/tantivy/blob/main/CHANGELOG.md)
- [Commits](https://github.com/quickwit-oss/tantivy/compare/0.25.0...0.26.0)
---
updated-dependencies:
- dependency-name: tantivy
dependency-version: 0.26.0
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* chore(deps): update lz4_flex and rustls-webpki to latest compatible versions
* chore(deps): update lz4_flex and rustls-webpki (#10378)
* Update candle and mistral.rs lock-step pins (#10278)
* Update candle and mistral.rs lock-step pins
* Update dependencies to use new git revisions for candle packages
* Add provenance comments to candle and mistralrs git dependency pins
* Bump mistral.rs and text-embeddings-inference pins
- mistral.rs: ac7063cd -> 9b4758762d6ebed08a42af7211c616ebc512c557
- text-embeddings-inference: 58b44fbb -> 88b7a84a2c2ad83707555183f8f18dd201897f12
Adapt to download_safetensors now taking Arc<ApiRepo>.
* docs: fix status badges in README (#10350)
* docs: fix status badges in README
- Use markdown badge syntax for all badges.
- Fix wrong workflow references:
- spiced_docker_nightly.yml -> spiced_docker_dev.yml
- build_and_release.yml -> build_nightly.yml
- benchmarks.yml -> testoperator_run_bench.yml
- Wrap badges in centered div blocks so GitHub renders markdown inside.
* docs: point unit tests badge to build_and_release on trunk
* docs: align CodeQL badge link filter with image filter
* Migrate ecrets to envs (#10354)
* Add limit pushdown and improve sort pushdown for Oracle and MSSQL (#10351)
* Implement sort pushdown support and fix pushdown gaps across providers
Implement DataFusion v52 `try_pushdown_sort` for transparent wrapper
execution plans (CayenneAccelerationExec, SchemaCastScanExec,
BytesProcessedExec) by delegating to their child plans, and for SQL
providers (MSSQL, Oracle, FlightSQL) by generating ORDER BY clauses.
Also fix limit pushdown consistency in wrappers (delegate
supports_limit_pushdown/with_fetch/fetch to child plans instead of
returning mismatched values), and extend MSSQL filter pushdown to
support NotEq, And, Or, Not, IsNull, IsNotNull, Like, InList, and
Between expressions.
* fix: enhance sort pushdown error handling and improve filter classification logic
* Address PR review comments: improve sort pushdown correctness
- FlightSQL: Replace filter_map with fallible map in sql() ORDER BY
generation to return an error instead of silently dropping non-Column
sort expressions. Add InvalidSortExpression error variant.
- MSSQL: Make classify_mssql_filter recursively check time-related
expressions in And/Or/Not/IsNull/IsNotNull/Like sub-expressions to
prevent time-related filters from being pushed down via compound exprs.
- SchemaCastScanExec: Propagate input ordering through equivalence
properties and set maintains_input_order to true, since schema casting
preserves row order.
- FlightSQL tests: Add unit tests for try_pushdown_sort (unsupported
for non-column, exact for column) and sql() ORDER BY clause generation.
* Remove unsafe ordering propagation from SchemaCastScanExec
Do not copy input ordering into EquivalenceProperties since schema
casting can change data types and projected columns, making the
input ordering invalid for the output schema. Retain
maintains_input_order=true since row order is preserved.
* Fix CI failures: restore SchemaCastScanExec ordering and fix SQL double-space
- Restore ordering propagation in SchemaCastScanExec::new() that was
incorrectly removed, fixing SortPreservingMergeExec invariant
violations in partition integration tests.
- Fix double-space in generated SQL for Oracle, FlightSQL, and MSSQL
execution plans when order_expr is empty. Build SQL incrementally,
appending clauses only when non-empty.
- Update oracle test-framework snapshots to match corrected SQL output.
* Upgrade datafusion-table-providers to 4e8b2b0bd0f0 (pushdown support) (#10341)
* Refactor BytesProcessedExec to simplify fetch and pushdown sort methods
* Fix schema_cast ordering: remap column indices by name and add tests
- Remap sort expression column indices from input to output schema by
name, since SchemaCastScanExec may reorder columns relative to input
- Only propagate ordering when ordered columns have identical types
- Add 3 unit tests: ordering propagated (same types), not propagated
(type differs), and indices remapped (reordered columns)
- Add branch comment to datafusion-federation git dependency
* Update refresh_max_timestamp_df plan snapshot
* Update cluster::distributed_cayenne_catalog snapshots
* Update duckdb_json_functions snapshots
* Update datafusion version
* Update datafusion version
* Update to datafusion-federation rev 42245bdd58ee3d7da8276e83d85fb1c52aec916e
* Revert "Update refresh_max_timestamp_df plan snapshot"
This reverts commit 244fb05060d3787555fff13fc62dd6df16c50bfe.
* Update distributed_acceleration snapshot
* Add limit pushdown and improve sort pushdown for Oracle and MSSQL
* Fix Exact->Inexact
* Revert "Fix Exact->Inexact"
This reverts commit f423db9007ea20de0c55eee3f0a74af465998371.
---------
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Co-authored-by: Jack Eadie <jack@spice.ai>
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
* Fix ubuntu mirror configuration (#10359)
* Add step to verify apt mirror configuration in GitHub Action
* Fix apt mirror substitution to match only archive.ubuntu.com
Improve awk regex to avoid greedy matching and add error check to fail
if archive.ubuntu.com remains after substitution.
* Improve apt mirror substitution check for Pittsburgh mirror
The script now verifies that the Pittsburgh mirror is present in
ubuntu.sources after substitution, rather than checking for the absence
of archive.ubuntu.com, which is intentionally retained as a fallback.
This avoids false negatives and ensures the mirror substitution is
effective.
* Simplify deb822 mirror substitution using sed for archive URIs
* Update apt mirror check to use PRIMARY variable
Check for the configured primary mirror in ubuntu.sources using the
PRIMARY
variable instead of a hardcoded hostname. Update error message to
include the
actual PRIMARY value for clarity.
* fix: Increase throughput test default ready_wait from 30s to 300s (fixes #8207) (#10344)
The throughput workflow's `ready_wait` input defaulted to 30 seconds,
which is insufficient for tests loading data from external sources like
MongoDB. The dispatch configs specify adequate timeouts (e.g. 600s for
mongodb-arrow), but manual workflow triggers via the GitHub UI used
the low default, causing "Spiced instance not ready within 30s" failures.
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Co-authored-by: Phillip LeBlanc <phillip@spice.ai>
* Add auth headers support to OTEL metrics exporter (#10347)
* Add auth headers support to OTEL metrics exporter
Add a 'headers' field to OtelExporterConfig for sending authentication
headers with OTEL metric export requests. Headers are applied as HTTP
headers for HTTP protocol or gRPC metadata entries for gRPC protocol.
Header values support secret replacement syntax (e.g. ${secrets:api_key}).
This enables authentication with services like Datadog (DD-API-KEY header)
and Grafana Cloud (Authorization header).
* Add default value for headers in Spicepod schema and enhance gRPC exporter error handling
* Refactor assertions in HTTP and gRPC exporter tests for improved readability
* Fix gRPC exporter tests to use tokio runtime
* Address review: rename shadowed vars, fix test runtime setup
* Address review: pass owned headers, document gRPC key constraint, add tokio runtime to test
* Add Clippy expectation for implicit hasher in create_otel_periodic_reader
* Address review: document resolved_headers vs config.headers; note gRPC key constraint in schema
* Fix YAML string formatting in test for OTEL exporter headers
* fix linter warnings
---------
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
Co-authored-by: ewgenius <hey@ewgenius.me>
* fix(github): shrink GraphQL page size on gateway errors; lower comment defaults (#10355)
* fix(github): shrink GraphQL page size on gateway errors; lower comment defaults
- Lower default `github_max_comments_fetched` from 75 to 25 to reduce
worst-case node count per page and keep queries within GitHub's secondary
rate limit budget. Cap remains 75.
- Reduce PR outer `first:` page size from 100 to 25 when
`include_comments` is enabled (review/discussion/all). Without
comments the page stays at 100.
- Reduce inner `comments(first: ...)` in issues query from 100 to 25 to
match.
- Add `PullRequestTableArgs::check_node_limit()` that estimates per-page
node count and rejects configurations that would exceed GitHub's
500,000 node hard limit with an actionable error. Invoked eagerly from
`read_provider` so misconfigurations fail fast rather than at query
time with an opaque 502.
- Graphql client: on 502/503/504 gateway errors, shrink the outer
`first:` page size via a reverse-Fibonacci ladder
(100,55,34,21,13,8,5,3,2,1) and rewrite the query AST on retry. This
lets very large queries against overloaded GitHub endpoints succeed on
a subsequent attempt with a smaller payload instead of replaying the
same oversized query.
Fixes the `spicehq_spiceai.pulls` 502 Bad Gateway errors observed with
`include_comments: all` against large repositories.
* fix(github): improve pagination handling and node count estimation for pull requests
* fix(graphql): improve error handling for page size override locking
* fix(graphql): preserve LIMIT 0 semantics, only clamp page_size_override
---------
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
* Relax apt mirror substitution failure to warning in CI action (#10361)
* Relax apt mirror substitution failure to warning in CI action
* Update .github/actions/configure-apt/action.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* feat(http): Add OAuth2 refresh-token auth to HTTP connector (#10348)
* feat(http): Add OAuth2 refresh-token auth to HTTP connector
Adds RFC 6749 §6 refresh-token grant support to the HTTP connector. When
configured, the connector exchanges the refresh token at startup, stamps
`Authorization: Bearer <access_token>` on every data request, and refreshes
in the background before the token expires. Rotated refresh tokens are
honored across the lifetime of the process.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): Address review comments on OAuth2 refresh-token auth
- Sanitize & cap token endpoint error body (collapse whitespace, 512 byte cap)
- Reject non-Bearer `token_type` from the token endpoint
- Store preformatted, sensitive `HeaderValue` in the watch channel so every
data request is a cheap header clone instead of a new format!() allocation
- React to shutdown immediately via `tokio::select!` on `tx.closed()` instead
of waiting the full sleep interval
- Add `refresh_loop_uses_rotated_refresh_token` test that drives the live
background loop and asserts the rotated refresh token is used on the
second exchange
- Use `Parameters::user_param(...)` so configuration errors surface the
prefixed user-facing keys (`http_auth_refresh_token`, etc.) instead of
internal names
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): Further review feedback on OAuth2 refresh-token auth
- Reject blank/whitespace-only http_auth_refresh_token at config time,
before any network round-trip to the token endpoint
- Classify RefreshTokenAuth errors: InvalidTokenUrl / InsecureTokenUrl /
UnsupportedTokenType and 4xx responses from the token endpoint surface as
InvalidConfiguration so users see configuration guidance instead of
"Cannot connect …"; network / 5xx / parse errors stay as
UnableToConnectInternal
- Inline Parameters::user_param() directly into format!() calls and rename
the Option<SecretString> local to client_credential so CodeQL's
cleartext-logging heuristic stops flagging parameter-*name* strings as
sensitive
- OauthServerState in the integration test now uses tokio::sync::Mutex so
the axum handlers do not block a Tokio worker thread
- Mock OAuth server no longer swallows serve errors via .unwrap_or_default
— .expect() surfaces failures directly
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* style(http): cargo fmt + switch test URL to https://example.invalid
- Apply rustfmt to auth.rs / https.rs / tests/http/mod.rs to clear the Rust
Lint CI failure (cargo fmt --check diffs)
- Swap the non-sent test URL in refresh_token_auth_applies_bearer_header to
https://example.invalid so CodeQL's rust/non-https-url rule stops flagging
a hardcoded http:// in test code. The request is only passed through
`.build()` to inspect the Authorization header, never sent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): Address clippy pedantic warnings in auth.rs
- Wrap `OAuth2` in backticks in three doc-comments to satisfy
clippy::doc_markdown
- `sanitize_error_body` now takes `&str` instead of `String` (removes
clippy::needless_pass_by_value); updated call sites
- `MockResponse::{ok, status}` take `&serde_json::Value` (same reason);
updated callers in the test module
- Replace two `if … { panic!(...) }` blocks inside loop bodies with
`assert!(… < deadline, …)` to satisfy clippy::panic_in_result_fn /
clippy::manual_assert
Lint-only changes; no behavior change. Runtime tests and data_components
auth tests continue to pass (9/9 and 10/10 respectively).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): More Copilot-reviewer nits on OAuth2 refresh-token auth
- ParameterSpec descriptions now cross-reference the correct user-facing
key names: `auth_token_url` points at `http_auth_refresh_token`, and the
`http_auth_client_id` / `http_auth_client_secret` descriptions mention
each other explicitly
- `map_auth_error` narrows `InvalidConfiguration` mapping from "any 4xx" to
`matches!(status, 400 | 401 | 403)`. Transient 4xx (408 Request Timeout,
429 Too Many Requests, etc.) now route to `UnableToConnectInternal`,
consistent with the doc-comment
- Test doc-comments in `tests/http/mod.rs` updated to name the exact user-
facing keys (`http_auth_refresh_token` + `auth_token_url`)
- Clarified the whitespace-sanitize comment in `sanitize_error_body`:
whitespace characters are *replaced* with a regular space (runs become
runs of spaces); the comment no longer implies run-collapse
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): Bounded streaming read of token endpoint error body + test cleanup
- Replace unbounded `response.text().await` on the error path with a new
`read_bounded_error_body` that streams via `Response::chunk()` and stops
after `MAX_ERROR_BODY_BYTES * 2` raw bytes. A hostile token endpoint can
no longer force us to buffer a large body just so we can surface the
first 512 bytes of it as a diagnostic.
- Rework `refresh_loop_uses_rotated_refresh_token` to use `tokio::time::
Instant` (instead of `std::time::Instant`) for its 5s deadline poll,
keeping real time so reqwest's `connect_timeout` doesn't race with paused
virtual time. Added a comment explaining why `start_paused = true` is
unsafe here (verified locally: 5/5 runs at 1.06s each, no flakes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix clippy lints in https connector
- Backtick OAuth2 in doc comment (clippy::doc_markdown).
- Collapse redundant match-guard into pattern literals
(clippy::redundant_guards).
* fix(http): Address more reviewer nits on OAuth2 auth module
- `current_bearer_value` is now `#[cfg(test)] pub(crate) fn` so downstream
callers in production code paths cannot accidentally surface the live
bearer token via this API
- Rename `sanitize_error_body_collapses_whitespace_and_truncates` to
`…_replaces_whitespace_and_truncates` — the implementation *replaces*
whitespace characters with spaces (runs preserved as runs of spaces),
which the old test name misrepresented as "collapses"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): Enforce hard cap on sanitized error body size
`sanitize_error_body` previously filled `out` up to `MAX_ERROR_BODY_BYTES`
and then unconditionally appended `…<truncated>`, so the returned string
could exceed the documented 512-byte cap by the marker's length.
Introduce `TRUNCATION_MARKER` and a `CONTENT_BUDGET` constant that reserves
room for the marker, so the final returned string (content + marker) is
always ≤ `MAX_ERROR_BODY_BYTES`. Add a `debug_assert!` on the post-condition
and tighten the unit test to check `out.len()` directly rather than just
the content portion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): merge identical match arms in read_bounded_error_body
Clippy's `match_same_arms` lint (enabled via `-Dclippy::pedantic` in CI)
flagged the `Ok(None)` and `Err(_)` arms both returning `break`. Merged
them into a single `Ok(None) | Err(_)` arm with a comment explaining that
end-of-body and mid-stream transport errors are both terminal for a
best-effort error diagnostic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(http): Improve error message for insecure token URL and add custom headers access method
* fix(https): Refactor conditional check for Authorization custom header
* fix(http): Collapse identical match arms in map_auth_error
clippy::match_same_arms fired after introducing the
TokenEndpointStatus { status: 400 | 401 | 403, .. } arm that maps to
the same InvalidConfiguration variant as the URL-validation arms.
Merge the patterns into a single arm.
* Update crates/runtime/src/dataconnector/https.rs
Co-authored-by: Phillip LeBlanc <phillip@spice.ai>
* Replace .is_err() and .is_ok() in tests with expect
* fix clippy - add backticks
* refactor(http): Move bounded error body helpers into resilient_http
Address phillipleblanc review feedback: promote read_bounded_error_body
and sanitize_error_body from auth.rs into resilient_http as public
helpers so other HTTP-based data connectors can reuse them. Generalize
the cap to a parameter and expose TRUNCATED_BODY_MARKER as part of the
public API.
* fix(http): Small-cap sanitize_error_body + drop case-sensitive one_of
- `sanitize_error_body` previously returned an empty string when
`max_bytes < TRUNCATED_BODY_MARKER.len()` because the content budget
underflowed to 0. It now uses the full `max_bytes` as content budget
without a marker in that case, matching the documented upper bound.
New `sanitize_error_body_small_cap_fills_content_without_marker` test.
- Removed `.one_of(&["basic", "body"])` from the `auth_client_auth`
ParameterSpec: `Parameters::try_new` enforces `one_of` with exact
string matching, which rejected `BASIC` / `BODY` before
`ClientAuthMethod::parse` (case-insensitive) could accept them. Parser
is the single source of truth for validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
Co-authored-by: Phillip LeBlanc <phillip@spice.ai>
Co-authored-by: ewgenius <hey@ewgenius.me>
* Upgrade Rust toolchain to 1.94.1 (#10353)
* Upgrade Rust toolchain to 1.94.1
Bump workspace rust-version, rust-toolchain.toml channel, Dockerfiles,
and workflow rustup installs from 1.93.1 to 1.94.1. Updates the
agent/copilot instructions baseline accordingly.
* Add #[expect(clippy::result_large_err)] annotations to multiple TryFrom implementations and update println! to use saturating_sub
* Handle order by and sort in PartitionedTableScanRewrite (#9656)
* Handle order by and sort in PartitionedTableScanRewrite
* formatting
* fix(ci): add merge_group trigger to pr-develop and scope trunk merge_group to target branches
* PR comments
* clean
* fix projection, change order of partition rule
* PR comments
* clippy
* fix distributed acceleration testing
* fix integ tests
* redact flight
* some better snapshots
* federation-revert
* update federation
* fix test
* linting
* remove comment
* fix UNION ALL
* improvements
* support filter predicates in topK pushdown
* fix ownership
* clippy
* Upgrade datafusion-table-providers to d1b911a5 and bump adbc to 0.23
* table provider and federation update
* snapshots
* comment
* snapshots
* snapshots
* FlightSQLTable cannot federate Extension LogicalPlans
* test updates
* fix join
* preserve table alias
* snapshots
* fix test
* snapshots
* dependencies
* snapshots
* update table-provider
* don't update table-providers and federation
* snapshots
* flightsql doesn't need federation; but physical optimisation to pushdown
* Fix clippy
* Executor: always clear partition_by setting at start
* Fix cluster::distributed_acceleration tests
* Update cluster::distributed_cayenne tests snapshots
* Fix doc comments
* Add test_distributed_acceleration_order_by_limit_pushdown integration test
* Fix lint
---------
Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
* Fix OTEL Exporter (#10363)
* Fix OTLP Exporter
* Fix
* Fix formatting in documentation comments
* Pin spiceai candle / TEI forks to merged revs; drop local [patch] overrides (#10362)
* Update candle and mistral.rs lock-step pins
* Update dependencies to use new git revisions for candle packages
* Add provenance comments to candle and mistralrs git dependency pins
* Bump mistral.rs and text-embeddings-inference pins
- mistral.rs: ac7063cd -> 9b4758762d6ebed08a42af7211c616ebc512c557
- text-embeddings-inference: 58b44fbb -> 88b7a84a2c2ad83707555183f8f18dd201897f12
Adapt to download_safetensors now taking Arc<ApiRepo>.
* Remove unnecessary Clippy expectations from tests
* Pin spiceai candle/TEI forks to merged revs; drop local [patch] overrides
All six fork PRs have merged; bump each git rev to the merge SHA on the
fork's `spiceai` branch, and drop the temporary /tmp/... [patch.*]
tables that were used while those fixes were in flight.
- spiceai/candle-cublaslt -> b74d30e0 (port to candle 0.10.1 / cudarc 0.19)
- spiceai/candle-layer-norm -> 62f936a1 (port to candle 0.10.1 / cudarc 0.19)
- spiceai/candle-rotary -> a4c4efcd (port to candle 0.10.1 / cudarc 0.19)
- spiceai/candle -> c87b9bc5 (run_mha FFI softcap dedup)
- spiceai/text-embeddings-inference -> b958dca5 (compute_cap -> cudarc 0.19;
bumps sibling crate pins to the revs above)
- candle-index-select-cu (crates.io -> spiceai fork) -> 397d7338 (fallback
shim for candle 0.10.1 / cudarc 0.19; patched via [patch.crates-io])
Verified with `cargo check --release --features cuda` on CUDA 12.6 /
CUDA_COMPUTE_CAP=90: clean finish in 10m52s.
* Update Cargo.toml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Implement sort pushdown and fix pushdown gaps across providers (#10337)
* Implement sort pushdown support and fix pushdown gaps across providers
Implement DataFusion v52 `try_pushdown_sort` for transparent wrapper
execution plans (CayenneAccelerationExec, SchemaCastScanExec,
BytesProcessedExec) by delegating to their child plans, and for SQL
providers (MSSQL, Oracle, FlightSQL) by generating ORDER BY clauses.
Also fix limit pushdown consistency in wrappers (delegate
supports_limit_pushdown/with_fetch/fetch to child plans instead of
returning mismatched values), and extend MSSQL filter pushdown to
support NotEq, And, Or, Not, IsNull, IsNotNull, Like, InList, and
Between expressions.
* fix: enhance sort pushdown error handling and improve filter classification logic
* Address PR review comments: improve sort pushdown correctness
- FlightSQL: Replace filter_map with fallible map in sql() ORDER BY
generation to return an error instead of silently dropping non-Column
sort expressions. Add InvalidSortExpression error variant.
- MSSQL: Make classify_mssql_filter recursively check time-related
expressions in And/Or/Not/IsNull/IsNotNull/Like sub-expressions to
prevent time-related filters from being pushed down via compound exprs.
- SchemaCastScanExec: Propagate input ordering through equivalence
properties and set maintains_input_order to true, since schema casting
preserves row order.
- FlightSQL tests: Add unit tests for try_pushdown_sort (unsupported
for non-column, exact for column) and sql() ORDER BY clause generation.
* Remove unsafe ordering propagation from SchemaCastScanExec
Do not copy input ordering into EquivalenceProperties since schema
casting can change data types and projected columns, making the
input ordering invalid for the output schema. Retain
maintains_input_order=true since row order is preserved.
* Fix CI failures: restore SchemaCastScanExec ordering and fix SQL double-space
- Restore ordering propagation in SchemaCastScanExec::new() that was
incorrectly removed, fixing SortPreservingMergeExec invariant
violations in partition integration tests.
- Fix double-space in generated SQL for Oracle, FlightSQL, and MSSQL
execution plans when order_expr is empty. Build SQL incrementally,
appending clauses only when non-empty.
- Update oracle test-framework snapshots to match corrected SQL output.
* Upgrade datafusion-table-providers to 4e8b2b0bd0f0 (pushdown support) (#10341)
* Refactor BytesProcessedExec to simplify fetch and pushdown sort methods
* Fix schema_cast ordering: remap column indices by name and add tests
- Remap sort expression column indices from input to output schema by
name, since SchemaCastScanExec may reorder columns relative to input
- Only propagate ordering when ordered columns have identical types
- Add 3 unit tests: ordering propagated (same types), not propagated
(type differs), and indices remapped (reordered columns)
- Add branch comment to datafusion-federation git dependency
* Update refresh_max_timestamp_df plan snapshot
* Update cluster::distributed_cayenne_catalog snapshots
* Update duckdb_json_functions snapshots
* Update datafusion version
* Update datafusion version
* Update to datafusion-federation rev 42245bdd58ee3d7da8276e83d85fb1c52aec916e
* Revert "Update refresh_max_timestamp_df plan snapshot"
This reverts commit 244fb05060d3787555fff13fc62dd6df16c50bfe.
* Update distributed_acceleration snapshot
---------
Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
Co-authored-by: Jack Eadie <jack@spice.ai>
* Merge develop to trunk (2026-04-16) (#10345)
* fix: Update test snapshots (#10219)
Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai>
* fix: Update Search integration test snapshots (#10240)
* fix: Place search index filters below pre_limit for pushdown (fixes #10149)
SearchQueryProvider::scan() was adding the pre_limit to the logical plan
BEFORE adding filters. Since DataFusion's PushDownFilter optimizer cannot
push filters past a Limit node, filters never reached the underlying
search index table provider (e.g. S3VectorsQueryExec).
This caused both worse performance (server-side filtering not used) and
incorrect results (top-K-then-filter instead of top-K-of-filtered-set).
The fix restructures the plan building to add filters BEFORE the
pre_limit, allowing DataFusion to push them through the SubqueryAlias
into the inner TableScan.
* style: cargo fmt
* fix: Make filter pushdown test assertions more robust
Use `filters.iter().any(...)` instead of `filters[0]` to assert that at
least one recorded scan call contains the expected pushed-down predicate.
This avoids potential flakiness if DataFusion's physical planning invokes
scan() more than once during optimization.
Addresses copilot review feedback on PR #10157.
* refactor: Replace unit tests with insta snapshot test for filter pushdown
Replace the unit tests in SearchQueryProvider with a new
VectorSearchSqlFilteredIndexOnly snapshot test case in the megascience
integration test suite. The snapshot test exercises the index-only path
with a WHERE filter, verifying both correct query results and the EXPLAIN
plan structure (filter placement relative to pre_limit).
Addresses review feedback from @Jeadie on PR #10157.
* fix: cargo fmt formatting in megascience test match arm
* fix: Update Search integration test snapshots
* fix: Use pre_limit argument instead of SQL LIMIT in filtered index test
Address review feedback from @Jeadie: the VectorSearchSqlFilteredIndexOnly
test should pass the limit as the pre_limit argument to vector_search()
rather than using a SQL LIMIT clause, since the test is specifically
designed to verify filter pushdown below the pre_limit.
* fix: Update github workflows snapshot after features.yml removal
The `check all features` workflow (.github/workflows/features.yml) was
removed from the repository, shifting the top-10 workflows query result.
* fix: Update search snapshot for s3vectors_chunking_view_with_where
Score for id 551 shifted from 0.28 to 0.29 (consistent across retries),
changing result order when tied with id 1035. Update snapshot to match.
* fix: Make search snapshot tests robust to cross-runner score variance
model2vec similarity scores vary ±0.01 across CI runners (different
macOS versions), causing snapshot tests to fail when scores land on
different sides of truncation boundaries.
Two fixes:
1. normalize_search_response_json: use round() instead of trunc() for
score display and sorting. Scores like 0.289 now consistently round
to 0.29 instead of truncating to 0.28 on some runners.
2. SQL test queries: reduce trunc(_score, 3) to trunc(_score, 2) to
avoid flakiness at the 3rd decimal place (e.g., 0.556 vs 0.557).
* fix: Apply cargo fmt to search test normalization
* fix: Update OpenAI search snapshots for embedding model score shift
OpenAI's text-embedding-3-small model scores shifted by +0.01,
causing snapshot mismatches in the openai_test_search CI check.
* fix: Scope score rounding to s3vectors tests only
The previous change to use `round` instead of `trunc` for score display
in `normalize_search_response_json` was applied globally, causing
cascading snapshot failures in OpenAI search tests (0.65→0.66, etc.).
This fix adds a `round_scores` flag to `SearchTestCase` and
`run_search_w_explain` so that only s3vectors tests (which have
non-deterministic model2vec scores that vary ±0.002 across CI runners)
use rounding for display. All other tests (OpenAI, HF, text search)
continue to use truncation, preserving their existing snapshots.
Sort comparison still uses rounding universally to stabilize ordering.
* fix: Revert OpenAI snapshots to truncated score values
The previous commit incorrectly updated these snapshots to rounded
values when the normalization was unconditionally using round(). Now
that rounding is scoped to s3vectors tests only, OpenAI tests use
truncation again - restore the original snapshot values.
* fix: Also scope sort rounding to round_scores flag
The sort comparison was unconditionally using rounded values, causing
ordering mismatches with truncated display values in OpenAI tests.
Now both sort and display use the same precision mode: raw floats when
round_scores is false, rounded when true.
* fix: Use score rounding for OpenAI search tests
OpenAI embeddings are non-deterministic — scores vary by ±0.01 across
CI runs, causing snapshot failures when truncation amplifies boundary
effects. Switch OpenAI search tests to use score rounding (same as
model2vec/s3vectors tests) for more stable comparisons.
* fix: handle Utf8View/LargeUtf8 in GitHub connector ref filters (#10217)
* fix: handle Utf8View/LargeUtf8 in GitHub connector ref filters
DataFusion 52 defaults to map_string_types_to_utf8view=true, so string
literals in WHERE clauses arrive as ScalarValue::Utf8View instead of
ScalarValue::Utf8. The GitHub connector's ref filter extraction only
matched Utf8, causing WHERE ref='...' to silently fail.
Changes:
- Add scalar_utf8_value() helpers to extract strings from all three
ScalarValue string variants (Utf8, LargeUtf8, Utf8View)
- Update ref filter pushdown in files, commits, and workflow_runs
- Change files table ref filter from Inexact to Exact (ref is fully
handled by the connector, no residual filter needed)
- Fix validate_installation_access to skip when token-based auth is
active, preventing autoloaded app credentials from interfering
- Add GitHub App auth integration tests for commits, files, and issues
* fix: streamline ref value handling in commits filter pushdown
* fix: Correct round_scores=false for OpenAI tests, remove unused builder, update github workflows snapshot
- OpenAI tests should use truncation (round_scores=false) since their embeddings are deterministic
- Remove unused round_scores() builder method that triggered lint error
- Update github workflows snapshot to reflect removed integration.yml workflow
* fix: Update snapshot expression headers to match new function signatures
All normalize_search_response and normalize_search_response_json calls
now include the round_scores parameter. Update snapshot expression lines
to match so insta doesn't flag expression mismatches.
* fix: Update snapshot column aliases from trunc(_score,3) to trunc(_score,2)
SQL test queries were changed from trunc(_score, 3) to trunc(_score, 2)
in a previous commit. Update all snapshot files that reference the old
Int64(3) column alias to use Int64(2).
* ci: Revert autogenerated PR base branch back to trunk in GitHub workflows (#10222)
* ci: Revert autogenerated PR base branch back to trunk in GitHub workflows
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: chunk MERGE delete filters and update Vortex for stack-safe IN-lists (#10207)
* fix(databricks): Fix schema introspection and timestamp overflow (#10226)
* fix(databricks): fix schema introspection and timestamp overflow
- Fall back from full_data_type to data_type column when Databricks
information_schema does not have full_data_type (UNRESOLVED_COLUMN)
- Parse parameterless complex types (ARRAY, MAP, STRUCT, DECIMAL)
gracefully for the data_type fallback path
- Change declared timestamp types from Nanosecond to Microsecond to
match what Databricks actually sends in Arrow IPC, preventing
arithmetic overflow on far-future sentinel values (e.g. year 9999)
- Add safe-cast fallback in try_cast_to for timestamp unit conversions
that overflow, producing NULL instead of crashing
* fix: match ArithmeticOverflow error variant in timestamp safe-cast fallback
* refactor: simplify array and batch creation in tests for clarity
* fix(databricks): enhance error handling for unresolved columns in schema retrieval
* fix(databricks): Fix schema introspection failures for non-Unity-Catalog environments (#10227)
* fix(databricks): fix schema introspection and timestamp overflow
- Fall back from full_data_type to data_type column when Databricks
information_schema does not have full_data_type (UNRESOLVED_COLUMN)
- Parse parameterless complex types (ARRAY, MAP, STRUCT, DECIMAL)
gracefully for the data_type fallback path
- Change declared timestamp types from Nanosecond to Microsecond to
match what Databricks actually sends in Arrow IPC, preventing
arithmetic overflow on far-future sentinel values (e.g. year 9999)
- Add safe-cast fallback in try_cast_to for timestamp unit conversions
that overflow, producing NULL instead of crashing
* fix: match ArithmeticOverflow error variant in timestamp safe-cast fallback
* refactor: simplify array and batch creation in tests for clarity
* fix: update comments to clarify overflow conditions for timestamp handling
* Add test for UNRESOLVED_COLUMN on columns other than full_data_type; tighten fallback condition
* Fix MapArray entries nullability: Arrow spec requires Map entries to be non-null
The Databricks type parser was creating Map types with nullable entries
(entries struct field nullable=true), but Arrow's MapArray validation
requires entries to always be non-null. When the wire data arrived with
non-null entries and the declared schema had nullable entries, the cast
failed with 'MapArray entries cannot contain nulls'.
Changed both parameterized (MAP<K,V>) and parameterless (MAP) parsing
to set entries nullable=false, matching the Arrow specification.
* Add cohorted spend view schema tests from real Databricks CSV
* Replace edw references with generic test names
* Enhance error handling in parser tests: replace unwrap_or_else with expect for clearer failure messages
* Address PR review: add table to tracing, use expect in tests, fix doc backticks
* Add GEOMETRY type support (maps to Binary/WKB); add mixed-type schema tests
* Properly mark dataset as Ready on Scheduler (#10215)
* Properly mark dataset as Ready on Scheduler
* Lint and fixes
* Fix
* Lint
* Enable reqwest compression and optimize HTTP client settings (#10154)
* fix: report text_search validation errors as execution errors, not planning errors
* fix(bedrock): return specific error messages for auth and stream failures
Replace the generic TODO catch-all with specific error matching for each
ConverseStreamOutputError variant (ThrottlingException, ValidationException,
ModelStreamErrorException). For non-service SDK errors, detect authentication
failures (UnrecognizedClientException, AccessDeniedException, etc.) and return
a clear "authentication failed" message instead of "unhandled error".
Fixes #6771
* refactor(bedrock): use into_service_error() for idiomatic error handling
Replaces e.err() with e.into_service_error() which is the standard
AWS SDK pattern for consuming SdkError and matching on the inner
service error type.
Addresses review feedback from @Jeadie.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix(bedrock): use into_service_error() for typed error extraction
* ci: Add merge_group trigger to integration test workflows
* Add support for DF-native DML (#9931)
* Add support for DF native DML
* Lint
* Cleanup DeletionTableProvider
* Lint
* Fix
* Fix
* Fix
* Lint
* Lint
* Fix
* Lint
* Fix partition test
* Add update to cayenne and polytable
* Lint
* Fix test
* Fix Cargo.toml
* Fix
* Fix
* Fix
* ci: Add merge_group trigger to integration test workflows
---------
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* Update autogenerated PR base branch to develop in GitHub workflows (#10034)
* Update autogenerated PR base branch to develop in GitHub workflows
* ci: Add merge_group trigger to integration test workflows
---------
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* Cleanup worktree path (#10033)
* Cleanup worktree path
* ci: Add merge_group trigger to integration test workflows
---------
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* fix: Normalize Arrow Dictionary types for DuckDB and SQLite acceleration (#9959)
* fix: Normalize Arrow Dictionary types for DuckDB and SQLite acceleration
Arrow Dictionary-encoded columns (used for enums and categorical data)
are not natively supported by the DuckDB and SQLite data accelerators.
This causes failures when accelerating datasets that contain enum/dictionary
type columns.
Add `normalize_dictionary_types()` to convert Dictionary fields to their
underlying value types (e.g. Dictionary(Int32, Utf8) -> Utf8) in the
schema before it reaches the accelerator. The existing `SchemaCastScanExec`
pipeline automatically casts the record batch data to match.
Fixes #2889
Fixes #2891
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: Address review comments for Dictionary type normalization
- Handle nested container types (List, Struct, Map, etc.) in both
`normalize_dictionary_data_type` and `has_dictionary_types`, not just
top-level Dictionary fields
- Preserve field metadata by using `field.with_data_type()` instead of
`Field::new()` which drops metadata
- Gate Dictionary normalization to only DuckDB, SQLite, and Turso engines
that cannot handle Dictionary encoding natively, leaving Arrow/Cayenne/
PostgreSQL unaffected
- Add read-back verification to the SQLite test to match the DuckDB test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* style: Fix rustfmt formatting and clippy doc_markdown lints
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: Fix clippy doc_markdown lints in DuckDB and SQLite test comments
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: Add coverage for nested dictionary types and field metadata preservation
Add tests verifying that:
- normalize_dictionary_types handles Dictionary inside List and Struct
- has_dictionary_types detects Dictionary inside nested container types
- Field-level metadata is preserved (not just schema-level metadata)
These tests strengthen coverage for the review feedback on PR #9959.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: Fix rustfmt formatting in test code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: Add Union type support to dictionary normalization and improve test assertions
Extend `data_type_contains_dictionary` and `normalize_dictionary_data_type`
to handle `DataType::Union` variants, ensuring dictionary types nested
inside unions are properly detected and normalized. Also strengthen the
SQLite dictionary round-trip test assertion to check actual row counts
instead of just non-emptiness.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: Fix clippy needless_collect in Union type normalization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: Remove accidentally committed worktree reference
* ci: Add merge_group trigger to integration test workflows
---------
Co-authored-by: Claude <claude@spices-MacBook.localdomain>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* fix: Enforce target_chunk_size as hard maximum in chunking (#9973)
* fix: Enforce target_chunk_size as hard maximum in chunking
When chunking is enabled for embeddings, the underlying text_splitter
library may produce chunks that slightly exceed the configured
target_chunk_size (e.g. 513-514 tokens with a 500 target). This causes
embedding failures when the model has a strict token input limit (e.g.
512 tokens).
Add post-processing enforcement that re-splits any oversized chunks
using binary search to find the longest prefix fitting within the
target. This ensures target_chunk_size acts as a hard maximum rather
than a soft target. Overlap from text_splitter is also bounded since
the enforcement applies to all output chunks.
Fixes #3326
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: Fix rustfmt formatting in chunking tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* ci: Add merge_group trigger to integration test workflows
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* Enable reqwest compression and optimize HTTP client settings
- Enable gzip, brotli, zstd, deflate compression features on reqwest
workspace dependency so all HTTP clients negotiate compressed responses
- Fix per-request client construction in spice_cloud::post_json by
moving the reqwest::Client to a struct field built once in new()
- Replace reqwest::get() one-off in metrics status probe with a
lazily-initialized shared client with tight timeouts
- Add connect/request timeouts to clients missing them:
- Kubernetes secret store: 10s connect, 30s request
- OTEL HTTP exporter: 10s connect, 30s request
- Google GenAI: 10s connect, 300s request
- Databricks SQL Warehouse: 10s connect, 30s request
* Change Spice Cloud client timeout from 30min to 15min
* refactor: Simplify conditional expressions for clarity in multiple files
* Add connect_timeout and timeouts to CLI HTTP clients
- Add connect_timeout(10s) to RuntimeContext shared client
- Add connect_timeout(10s) to GitHubClient
- Add connect_timeout(10s) and timeout(30s) to login flow clients
* Fix lint issues from develop merge
- Add missing # Errors doc sections in cayenne catalog_provider.rs
- Fix collapsible_if in file_based_retention_delete_test.rs
- Remove unused import in on_conflict_edge_cases_test.rs
- Add missing .await on csv_run_and_verify_query calls
- Fix needless borrow in schema_evolution test
- Fix formatting in cayenne source files
* Add connect_timeout to remaining HTTP clients
- Zipkin exporter and reachability check in spiced tracing
- NSQL request client in repl
- Spice Cloud catalog connector
- CloudClient (new and with_timeout)
- LLM provider create_http_client
- All spidapter HTTP clients
* Address review: propagate client build errors instead of silent fallback
- Move reqwest::Client construction from SpiceExtension::new() to
initialize(), returning a structured error on failure
- Store client as Option<reqwest::Client> with ClientNotInitialized error
- Use LazyLock<Result<...>> for metrics client to surface build errors
instead of unwrap_or_default()
* Propagate client build error in repl NSQL request path
* Guard setup-spiceio steps with runner.os == 'macOS'
Add if: runner.os == 'macOS' guards to all setup-spiceio, setup-sccache,
sccache stats, and kill spiceio steps across workflow files. This prevents
failures when jobs run on non-macOS runners (e.g. spiceai-dev-runners)
where the gh CLI and spiceio dependencies are unavailable.
Matches the existing pattern in codeql-analysis.yml.
* Guard setup-spiceio on UNAS_SMB_PASS secret availability
Add secrets.UNAS_SMB_PASS != '' condition to setup-spiceio steps and
use steps.setup-spiceio.outputs.endpoint != '' for downstream sccache
steps. This ensures the entire spiceio/sccache chain is skipped when
the secret is unavailable (e.g. fork PRs, new runner environments).
* Fix cayenne add_column snapshot and retry non-JSON 403 responses
- Update cayenne add_column snapshot to include the lname column,
matching the duckdb and sqlite snapshots.
- Treat non-JSON HTTP 403 responses as retriable in the GraphQL client.
A 403 with a non-JSON body indicates a transient upstream proxy or
abuse-detection block (e.g. GitHub's 'Request forbidden by
administrative rules'), not a genuine credentials/permissions error
(which returns valid JSON). This prevents test flakiness from
temporary GitHub abuse detection.
* Address review: update 403 retry test, fix formatting
- Update test_json_decode_client_error_not_retriable to exclude 403
and add dedicated test_json_decode_forbidden_retriable test.
- Fix formatting for user_agent format strings (cargo fmt).
* Increase SpiceExtension client timeout to 1800s (30 min)
* feat: Initial support for write-through accelerated tables (#10115)
* wip: mvp write through accelerated tables
* docs: Delete notes
* refactor: Address comments, simplify staging append, add cayenne partitioned staging write
* review: Address comments
* fix: Update partition expr from table provider multi-partition-by-expr
* refactor: Replace WriteThroughAcceleratedTableProvider into AcceleratedTable
* chore: fmt
* chore: clippy
* review: address comment
* Revert "fix: executor startup failures" and "When executor connects, send DDL for existing tables." (#10175)
* Revert "fix: executor startup failures (#10155)"
This reverts commit 1b639be2fcf90a996fe900e05ba6614eff061b29.
* Revert "When executor connects, send DDL for existing tables. (#9904)"
This reverts commit 7c6abaa8d43ac0cb428dd3887667aa0960d63840.
* fix formatting
* fix missing import
* fix linter
* fix linter
* fix: add missing `# Errors` doc section to satisfy clippy::missing_errors_doc
* fix: add missing `# Errors` doc sections in staging_wal.rs
---------
Co-authored-by: Jack Eadie <jack@spice.ai>
* fix: remove PARTITION BY forwarding to Cayenne executors (#10182)
* dont have partition by in executors
* cleanup
* revert: restore partition-table-provider feature for cayenne dependency
---------
Co-authored-by: jeadie <jack@spice.ai>
* fix: correct dispatch test assertions and runner type typo
Fix ready_wait assertions in LoadArgs deserialization tests to expect None,
since the load deserializer explicitly strips ready_wait as it is unsupported
by the load workflow. Also fix a typo in the test YAML: spicehq-dev-large-runners
-> spiceai-dev-large-runners, which caused SingleOrVec deserialization failures.
* fix: Update tpch benchmark snapshots for federated/glue[csv].yaml
* fix: Update tpch benchmark snapshots for federated/s3[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/mongodb.yaml
* fix: Update tpch benchmark snapshots for federated/abfs[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/iceberg[catalog].yaml
* fix: Update tpch benchmark snapshots for federated/odbc[databricks].yaml
* fix: Update tpch benchmark snapshots for federated/mssql.yaml
* fix: Update tpch benchmark snapshots for federated/glue[catalog].yaml
* fix: Update tpch benchmark snapshots for federated/dynamodb.yaml
* fix: Update tpch benchmark snapshots for federated/oracle.yaml
* fix: Update tpch benchmark snapshots for federated/odbc[athena].yaml
* fix: Update tpch benchmark snapshots for federated/glue[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/iceberg[hadoop].yaml
* fix: Update tpch benchmark snapshots for federated/abfs_standard_versioned[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/file[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/spicecloud[catalog].yaml
* Use balanced expression tree for partition filter combination (#10185)
* fix: resolve clippy lint errors in cayenne, write_through, iceberg_ddl, and planner (#10186)
* fix: resolve clippy lints in write_through, iceberg_ddl, and planner
* fix: use module-level #[expect(dead_code)] in cayenne test common module
* fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-cayenne[file]-partitioned.yaml (#10189)
* fix: correct dispatch test assertions and runner type typo
Fix ready_wait assertions in LoadArgs deserialization tests to expect None,
since the load deserializer explicitly strips ready_wait as it is unsupported
by the load workflow. Also fix a typo in the test YAML: spicehq-dev-large-runners
-> spiceai-dev-large-runners, which caused SingleOrVec deserialization failures.
* fix: Update tpch benchmark snapshots for federated/glue[csv].yaml
* fix: Update tpch benchmark snapshots for federated/s3[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/mongodb.yaml
* fix: Update tpch benchmark snapshots for federated/abfs[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/iceberg[catalog].yaml
* fix: Update tpch benchmark snapshots for federated/odbc[databricks].yaml
* fix: Update tpch benchmark snapshots for federated/mssql.yaml
* fix: Update tpch benchmark snapshots for federated/glue[catalog].yaml
* fix: Update tpch benchmark snapshots for federated/dynamodb.yaml
* fix: Update tpch benchmark snapshots for federated/oracle.yaml
* fix: Update tpch benchmark snapshots for federated/odbc[athena].yaml
* fix: Update tpch benchmark snapshots for federated/glue[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/iceberg[hadoop].yaml
* fix: Update tpch benchmark snapshots for federated/abfs_standard_versioned[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/file[parquet].yaml
* fix: Update tpch benchmark snapshots for federated/spicecloud[catalog].yaml
* fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-cayenne[file]-partitioned.yaml
* fix: Update tpch benchmark snapshots for accelerated/indexes/file[parquet]-cayenne[file]-indexes.yaml
* fix: Update tpch benchmark snapshots for accelerated/spicecloud-arrow.yaml
* fix: Update tpch benchmark snapshots for accelerated/indexes/file[parquet]-arrow-indexes.yaml
* fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-arrow-partitioned.yaml
* fix: Update tpch benchmark snapshots for accelerated/dynamodb-arrow.yaml
* fix: Update tpch benchmark snapshots for accelerated/mongodb-arrow.yaml
* fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-arrow.yaml
* fix: Update tpch benchmark snapshots for accelerated/dynamodb-duckdb[file].yaml
* fix: Update tpch benchmark snapshots for accelerated/file[parquet]-arrow.yaml
* fix: Update tpch benchmark snapshots for accelerated/file[parquet]-cayenne[file]turso.yaml
* fix: Update tpch benchmark snapshots for accelerated/file[parquet]-cayenne[file].yaml
* fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-cayenne[file]-on_zero_results.yaml
* fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[file]-on_zero_results.yaml
* fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[memory]-on_zero_results.yaml
* fix: Update tpch benchmark snapshots for accelerated/mysql-arrow.yaml
* fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-duckdb[file]-partitioned.yaml
* fix: Update tpch benchmark snapshots for accelerated/postgres-arrow.yaml
* fix: Update tpcds benchmark snapshots for federated/s3[parquet].yaml
* fix: Update tpcds benchmark snapshots for federated/abfs[parquet].yaml
* fix: Update tpcds benchmark snapshots for federated/file[parquet].yaml
* fix: Update tpcds benchmark snapshots for federated/databricks[delta_lake].yaml
* fix: Update tpcds benchmark snapshots for accelerated/spicecloud-arrow.yaml
* fix: Update tpcds benchmark snapshots for accelerated/databricks[delta_lake]-arrow.yaml
* fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-arrow-partitioned.yaml
* fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-arrow.yaml
* fix: Update tpcds benchmark snapshots for accelerated/file[parquet]-arrow.yaml
* fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-cayenne[file].yaml
* fix: Update tpcds benchmark snapshots for accelerated/file[parquet]-cayenne[file].yaml
* fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-cayenne[file]-on_zero_results.yaml
* fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[file]-on_zero_results.yaml
* fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[memory]-on_zero_results.yaml
* fix: Update tpcds benchmark snapshots for accelerated/postgres-arrow.yaml
* Trigger CI
* Update workspace configuration in Cargo.toml
---------
Co-authored-by: ewgenius <hey@ewgenius.me>
Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
* feat: Add CREATE TABLE ... (LIKE ...) support (#10180)
* feat: Add CREATE TABLE ... (LIKE ...) support
* feat: Support struct() IN-list filters in Cayenne position-based delete (#10191)
* feat: Support struct() IN-list filters in Cayenne position-based delete
Decompose struct(k1, k2) IN (SET) expressions into balanced OR-trees
of AND-equalities for Vortex pushdown. DataFusion converts tuple IN-lists
(k1, k2) IN ((v1,w1), ...) to struct() IN-list which Vortex cannot handle.
- Add try_decompose_struct_in_list() to position_based.rs
- Handle CAST-wrapped struct() expressions (type coercion)
- Cast literal values to match column types (e.g., Int64 to Int32)
- Build balanced binary OR-tree for O(log N) depth vs O(N) linear chain
- Add balanced_or_exprs() for MERGE delete filters (same stack overflow fix)
* fix: Parse partition expressions to extract column references for MERGE validation (#10192)
* fix: Parse partition expressions to extract column references for MERGE validation
Restore strict MERGE primary_key/on_conflict validation by parsing
partition expressions (e.g., bucket(5, c_nationkey)) with sqlparser AST
Visitor to extract referenced column names, instead of requiring exact
string matches on the partition column.
- Add extract_partition_column_references() using sqlparser Visitor
- Handle simple columns, compound identifiers, and transform expressions
- Add unit tests for various partition expression formats
* feat: Spidapter staging table support and SQL execution extraction
* fix(BigQuery): fix Unsupported subqueries in JOIN ON predicates (TPC-H) (#10195)
* fix: clippy auto-fixes from merge with develop
---------
Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Jack Eadie <jack@spice.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
Co-authored-by: claudespice <claude@spice.ai>
Co-authored-by: Claude <claude@spices-MacBook.localdomain>
Co-authored-by: William <98815791+peasee@users.noreply.github.com>
Co-authored-by: ewgenius <hey@ewgenius.me>
Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai>
Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Phillip LeBlanc <phillip@spice.ai>
* feat(databricks): DESCRIBE TABLE fallback and source-native type parsing for Lakehouse Federation (#10229)
* feat(databricks): DESCRIBE TABLE fallback and source-native type parsing for Lakehouse Federation
Databricks Lakehouse Federation foreign tables (e.g. Neon PostgreSQL) have
two issues with schema introspection:
1. information_schema.columns returns no rows (data_array is null) because
the table has no column metadata registered in Unity Catalog.
2. When data_type column IS populated, it returns source-native type names
(integer, text, numeric, timestamp without time zone) instead of Spark
SQL types.
Changes:
- Add DESCRIBE TABLE as third-tier schema introspection fallback after
information_schema.columns full_data_type and data_type attempts fail
- Catch ParseError from information_schema path to also fall through to
DESCRIBE TABLE for unrecognized native types
- Add source-native type parsing: integer, text, numeric, real,
double precision, character varying, varchar, timestamp with/without
time zone
- Extract get_schema_from_information_schema helper method
- Add schema_from_describe_json parser for DESCRIBE TABLE responses
(defaults all columns to nullable since DESCRIBE lacks nullability info)
- Add create_describe_payload with backtick-quoted identifiers
- Add comprehensive tests for all fallback paths and native types
* fix: address PR review comments
- Validate multi-word timestamp types (time/zone tokens) instead of
blind advance() calls; add expect_identifier helper
- Validate DECIMAL/NUMERIC precision <= 38 and scale <= precision
- Validate varchar/character varying length is a Number token
- Remove ParseError -> DESCRIBE TABLE fallback (DESCRIBE TABLE returns
Spark types, so a ParseError from information_schema indicates a real
problem that should surface as an error)
* feat: Add pagination support to HTTP data connector (#10228)
* feat: Add pagination support to HTTP data connector
Add configurable pagination for HTTP API endpoints, supporting two
modes:
- URL mode: next page URL from response body (JSON pointer) or HTTP
Link header with rel="next"
- Token mode: cursor/token extracted via JSON pointer and passed as
a query parameter in subsequent requests
Configuration parameters:
- pagination: enabled/disabled
- pagination_next_pointer: JSON pointer to next URL/cursor
- pagination_link_header: use Link header for pagination
- pagination_token_param: query param name for cursor tokens
- pagination_data_pointer: JSON pointer to data array per page
- pagination_max_pages: safety limit (default: 100)
Key design decisions:
- Streaming execution via futures::stream::try_unfold yields one
RecordBatch per page, avoiding buffering entire result sets
- SSRF protection validates next-page URLs share the base URL origin
- Works transparently with caching, append, and full (with
refresh_sql) acceleration modes
- Refactored batch creation into create_batch_from_rows for reuse
between paginated and non-paginated paths
Includes 18 unit tests covering Link header parsing, JSON pointer
extraction, SSRF rejection, data pointer extraction, token/query
building, config validation, and edge cases (null, empty, missing).
* fix: address PR review feedback for HTTP pagination
- Return error instead of silently truncating when max_pages is reached
- Fall through from missing JSON pointer to Link header check
- Support relative URLs for next-page links via base_url.join()
- Use url::form_urlencoded for proper query parameter encoding
- Don't stop pagination on empty data rows if next page exists
- Fix clippy and formatting issues
* fix: fail loudly on pagination JSON/pointer errors
- Error on invalid JSON when pagination_next_pointer is configured
- Error on non-string/non-null pointer values instead of silently ending
- Error on missing/invalid data pointer instead of returning empty rows
- Validate JSON Pointer syntax (must start with '/') in with_pagination
- Add tests for all new error cases
* fix: address pagination review feedback (round 3)
- Preserve base URL query params in token pagination via merge_queries()
- Track actual per-page path/query for accurate request_* columns
- Broaden Link header parsing: handle unquoted rel=next and multi-value rel
- Avoid intermediate Vec allocation for content column (from_iter_values)
- Add comment noting cache bypass for subsequent pages is intentional
- Add tests for merge_queries, unquoted rel, and multi-rel Link headers
* feat: change pagination to auto/enabled/disabled mode
Default is 'auto' which auto-detects Link headers on every HTTP
response without requiring explicit config. 'enabled' requires
explicit pagination config. 'disabled' turns off pagination.
In auto mode with no other pagination params configured, only
Link header detection is active. This means pagination 'just works'
for APIs that use standard Link headers (GitHub, etc.).
* fix: address pagination review feedback (round 4)
- has_dynamic_api_params only true when pagination explicitly configured
or pagination params are set (not in auto-detect-only mode)
- Loop internally to skip empty pages instead of yielding empty batches
- Fix max_pages error message: 'query was aborted' not 'may be incomplete'
- Parse response JSON once per page, reuse for both next-page and data
extraction (avoids duplicate deserialization on large responses)
* fix: return data instead of error when max_pages reached
* fix: merge base URL query for page 0, validate pagination paths, treat auto/link/max_pages as dynamic
* fix: use top-level splitting for Link header parsing (RFC 8288 compliance)
* test: add comprehensive tests for Link header parsing functionality
* refactor: simplify query merging logic and enhance pagination handling
* fix: correct logic for determining link header usage in pagination
* Update snapshot
---------
Co-authored-by: Viktor Yershov <viktor@spice.ai>
* fix(databricks): harden HTTP retries, compression, and token refresh (#10232)
* fix(databricks): harden HTTP retries and encodings
* feat(databricks): implement response body draining for retry logic in HTTP requests
* refactor(databricks): replace write! macro with write_fmt for header formatting
* fix(databricks): clamp short-lived token refreshes
* fix(databricks): include nested HTTP error causes
* fix(databricks): implement bounded retry delay to clamp maximum backoff duration
* refactor: deduplicate HTTP retry logic into shared resilient_http module
- Extract send_request_with_retry from databricks token provider into
data_components::resilient_http (make pub)
- Remove ~210 lines of duplicated retry logic, backoff, Retry-After
parsing, and response body draining from databricks.rs
- Truncate AWS APN app_name to 50 chars to avoid SDK warning
- Add length assertion to app_name test
- Remove unused httpdat…
Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
* Release notes for v2.0.0-rc.3 (#10377) * Release notes for v2.0.0-rc.3 * add refresh token config example * Update release notes for v2.0.0-rc.3 with improved descriptions and links for key features * fix http connector link --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Bump version to 2.0.0-rc.3 --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )