Commit f4e7238

authored

Pin spiceai candle / TEI forks to merged revs; drop local [patch] overrides (spiceai#10362)

* Update candle and mistral.rs lock-step pins * Update dependencies to use new git revisions for candle packages * Add provenance comments to candle and mistralrs git dependency pins * Bump mistral.rs and text-embeddings-inference pins - mistral.rs: ac7063cd -> 9b4758762d6ebed08a42af7211c616ebc512c557 - text-embeddings-inference: 58b44fbb -> 88b7a84a2c2ad83707555183f8f18dd201897f12 Adapt to download_safetensors now taking Arc<ApiRepo>. * Remove unnecessary Clippy expectations from tests * Pin spiceai candle/TEI forks to merged revs; drop local [patch] overrides All six fork PRs have merged; bump each git rev to the merge SHA on the fork's `spiceai` branch, and drop the temporary /tmp/... [patch.*] tables that were used while those fixes were in flight. - spiceai/candle-cublaslt -> b74d30e0 (port to candle 0.10.1 / cudarc 0.19) - spiceai/candle-layer-norm -> 62f936a1 (port to candle 0.10.1 / cudarc 0.19) - spiceai/candle-rotary -> a4c4efcd (port to candle 0.10.1 / cudarc 0.19) - spiceai/candle -> c87b9bc5 (run_mha FFI softcap dedup) - spiceai/text-embeddings-inference -> b958dca5 (compute_cap -> cudarc 0.19; bumps sibling crate pins to the revs above) - candle-index-select-cu (crates.io -> spiceai fork) -> 397d7338 (fallback shim for candle 0.10.1 / cudarc 0.19; patched via [patch.crates-io]) Verified with `cargo check --release --features cuda` on CUDA 12.6 / CUDA_COMPUTE_CAP=90: clean finish in 10m52s. * Update Cargo.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Implement sort pushdown and fix pushdown gaps across providers (#10337) * Implement sort pushdown support and fix pushdown gaps across providers Implement DataFusion v52 `try_pushdown_sort` for transparent wrapper execution plans (CayenneAccelerationExec, SchemaCastScanExec, BytesProcessedExec) by delegating to their child plans, and for SQL providers (MSSQL, Oracle, FlightSQL) by generating ORDER BY clauses. Also fix limit pushdown consistency in wrappers (delegate supports_limit_pushdown/with_fetch/fetch to child plans instead of returning mismatched values), and extend MSSQL filter pushdown to support NotEq, And, Or, Not, IsNull, IsNotNull, Like, InList, and Between expressions. * fix: enhance sort pushdown error handling and improve filter classification logic * Address PR review comments: improve sort pushdown correctness - FlightSQL: Replace filter_map with fallible map in sql() ORDER BY generation to return an error instead of silently dropping non-Column sort expressions. Add InvalidSortExpression error variant. - MSSQL: Make classify_mssql_filter recursively check time-related expressions in And/Or/Not/IsNull/IsNotNull/Like sub-expressions to prevent time-related filters from being pushed down via compound exprs. - SchemaCastScanExec: Propagate input ordering through equivalence properties and set maintains_input_order to true, since schema casting preserves row order. - FlightSQL tests: Add unit tests for try_pushdown_sort (unsupported for non-column, exact for column) and sql() ORDER BY clause generation. * Remove unsafe ordering propagation from SchemaCastScanExec Do not copy input ordering into EquivalenceProperties since schema casting can change data types and projected columns, making the input ordering invalid for the output schema. Retain maintains_input_order=true since row order is preserved. * Fix CI failures: restore SchemaCastScanExec ordering and fix SQL double-space - Restore ordering propagation in SchemaCastScanExec::new() that was incorrectly removed, fixing SortPreservingMergeExec invariant violations in partition integration tests. - Fix double-space in generated SQL for Oracle, FlightSQL, and MSSQL execution plans when order_expr is empty. Build SQL incrementally, appending clauses only when non-empty. - Update oracle test-framework snapshots to match corrected SQL output. * Upgrade datafusion-table-providers to 4e8b2b0bd0f0 (pushdown support) (#10341) * Refactor BytesProcessedExec to simplify fetch and pushdown sort methods * Fix schema_cast ordering: remap column indices by name and add tests - Remap sort expression column indices from input to output schema by name, since SchemaCastScanExec may reorder columns relative to input - Only propagate ordering when ordered columns have identical types - Add 3 unit tests: ordering propagated (same types), not propagated (type differs), and indices remapped (reordered columns) - Add branch comment to datafusion-federation git dependency * Update refresh_max_timestamp_df plan snapshot * Update cluster::distributed_cayenne_catalog snapshots * Update duckdb_json_functions snapshots * Update datafusion version * Update datafusion version * Update to datafusion-federation rev 42245bdd58ee3d7da8276e83d85fb1c52aec916e * Revert "Update refresh_max_timestamp_df plan snapshot" This reverts commit 244fb05060d3787555fff13fc62dd6df16c50bfe. * Update distributed_acceleration snapshot --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: Jack Eadie <jack@spice.ai> * Merge develop to trunk (2026-04-16) (#10345) * fix: Update test snapshots (#10219) Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> * fix: Update Search integration test snapshots (#10240) * fix: Place search index filters below pre_limit for pushdown (fixes #10149) SearchQueryProvider::scan() was adding the pre_limit to the logical plan BEFORE adding filters. Since DataFusion's PushDownFilter optimizer cannot push filters past a Limit node, filters never reached the underlying search index table provider (e.g. S3VectorsQueryExec). This caused both worse performance (server-side filtering not used) and incorrect results (top-K-then-filter instead of top-K-of-filtered-set). The fix restructures the plan building to add filters BEFORE the pre_limit, allowing DataFusion to push them through the SubqueryAlias into the inner TableScan. * style: cargo fmt * fix: Make filter pushdown test assertions more robust Use `filters.iter().any(...)` instead of `filters[0]` to assert that at least one recorded scan call contains the expected pushed-down predicate. This avoids potential flakiness if DataFusion's physical planning invokes scan() more than once during optimization. Addresses copilot review feedback on PR #10157. * refactor: Replace unit tests with insta snapshot test for filter pushdown Replace the unit tests in SearchQueryProvider with a new VectorSearchSqlFilteredIndexOnly snapshot test case in the megascience integration test suite. The snapshot test exercises the index-only path with a WHERE filter, verifying both correct query results and the EXPLAIN plan structure (filter placement relative to pre_limit). Addresses review feedback from @Jeadie on PR #10157. * fix: cargo fmt formatting in megascience test match arm * fix: Update Search integration test snapshots * fix: Use pre_limit argument instead of SQL LIMIT in filtered index test Address review feedback from @Jeadie: the VectorSearchSqlFilteredIndexOnly test should pass the limit as the pre_limit argument to vector_search() rather than using a SQL LIMIT clause, since the test is specifically designed to verify filter pushdown below the pre_limit. * fix: Update github workflows snapshot after features.yml removal The `check all features` workflow (.github/workflows/features.yml) was removed from the repository, shifting the top-10 workflows query result. * fix: Update search snapshot for s3vectors_chunking_view_with_where Score for id 551 shifted from 0.28 to 0.29 (consistent across retries), changing result order when tied with id 1035. Update snapshot to match. * fix: Make search snapshot tests robust to cross-runner score variance model2vec similarity scores vary ±0.01 across CI runners (different macOS versions), causing snapshot tests to fail when scores land on different sides of truncation boundaries. Two fixes: 1. normalize_search_response_json: use round() instead of trunc() for score display and sorting. Scores like 0.289 now consistently round to 0.29 instead of truncating to 0.28 on some runners. 2. SQL test queries: reduce trunc(_score, 3) to trunc(_score, 2) to avoid flakiness at the 3rd decimal place (e.g., 0.556 vs 0.557). * fix: Apply cargo fmt to search test normalization * fix: Update OpenAI search snapshots for embedding model score shift OpenAI's text-embedding-3-small model scores shifted by +0.01, causing snapshot mismatches in the openai_test_search CI check. * fix: Scope score rounding to s3vectors tests only The previous change to use `round` instead of `trunc` for score display in `normalize_search_response_json` was applied globally, causing cascading snapshot failures in OpenAI search tests (0.65→0.66, etc.). This fix adds a `round_scores` flag to `SearchTestCase` and `run_search_w_explain` so that only s3vectors tests (which have non-deterministic model2vec scores that vary ±0.002 across CI runners) use rounding for display. All other tests (OpenAI, HF, text search) continue to use truncation, preserving their existing snapshots. Sort comparison still uses rounding universally to stabilize ordering. * fix: Revert OpenAI snapshots to truncated score values The previous commit incorrectly updated these snapshots to rounded values when the normalization was unconditionally using round(). Now that rounding is scoped to s3vectors tests only, OpenAI tests use truncation again - restore the original snapshot values. * fix: Also scope sort rounding to round_scores flag The sort comparison was unconditionally using rounded values, causing ordering mismatches with truncated display values in OpenAI tests. Now both sort and display use the same precision mode: raw floats when round_scores is false, rounded when true. * fix: Use score rounding for OpenAI search tests OpenAI embeddings are non-deterministic — scores vary by ±0.01 across CI runs, causing snapshot failures when truncation amplifies boundary effects. Switch OpenAI search tests to use score rounding (same as model2vec/s3vectors tests) for more stable comparisons. * fix: handle Utf8View/LargeUtf8 in GitHub connector ref filters (#10217) * fix: handle Utf8View/LargeUtf8 in GitHub connector ref filters DataFusion 52 defaults to map_string_types_to_utf8view=true, so string literals in WHERE clauses arrive as ScalarValue::Utf8View instead of ScalarValue::Utf8. The GitHub connector's ref filter extraction only matched Utf8, causing WHERE ref='...' to silently fail. Changes: - Add scalar_utf8_value() helpers to extract strings from all three ScalarValue string variants (Utf8, LargeUtf8, Utf8View) - Update ref filter pushdown in files, commits, and workflow_runs - Change files table ref filter from Inexact to Exact (ref is fully handled by the connector, no residual filter needed) - Fix validate_installation_access to skip when token-based auth is active, preventing autoloaded app credentials from interfering - Add GitHub App auth integration tests for commits, files, and issues * fix: streamline ref value handling in commits filter pushdown * fix: Correct round_scores=false for OpenAI tests, remove unused builder, update github workflows snapshot - OpenAI tests should use truncation (round_scores=false) since their embeddings are deterministic - Remove unused round_scores() builder method that triggered lint error - Update github workflows snapshot to reflect removed integration.yml workflow * fix: Update snapshot expression headers to match new function signatures All normalize_search_response and normalize_search_response_json calls now include the round_scores parameter. Update snapshot expression lines to match so insta doesn't flag expression mismatches. * fix: Update snapshot column aliases from trunc(_score,3) to trunc(_score,2) SQL test queries were changed from trunc(_score, 3) to trunc(_score, 2) in a previous commit. Update all snapshot files that reference the old Int64(3) column alias to use Int64(2). * ci: Revert autogenerated PR base branch back to trunk in GitHub workflows (#10222) * ci: Revert autogenerated PR base branch back to trunk in GitHub workflows * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: chunk MERGE delete filters and update Vortex for stack-safe IN-lists (#10207) * fix(databricks): Fix schema introspection and timestamp overflow (#10226) * fix(databricks): fix schema introspection and timestamp overflow - Fall back from full_data_type to data_type column when Databricks information_schema does not have full_data_type (UNRESOLVED_COLUMN) - Parse parameterless complex types (ARRAY, MAP, STRUCT, DECIMAL) gracefully for the data_type fallback path - Change declared timestamp types from Nanosecond to Microsecond to match what Databricks actually sends in Arrow IPC, preventing arithmetic overflow on far-future sentinel values (e.g. year 9999) - Add safe-cast fallback in try_cast_to for timestamp unit conversions that overflow, producing NULL instead of crashing * fix: match ArithmeticOverflow error variant in timestamp safe-cast fallback * refactor: simplify array and batch creation in tests for clarity * fix(databricks): enhance error handling for unresolved columns in schema retrieval * fix(databricks): Fix schema introspection failures for non-Unity-Catalog environments (#10227) * fix(databricks): fix schema introspection and timestamp overflow - Fall back from full_data_type to data_type column when Databricks information_schema does not have full_data_type (UNRESOLVED_COLUMN) - Parse parameterless complex types (ARRAY, MAP, STRUCT, DECIMAL) gracefully for the data_type fallback path - Change declared timestamp types from Nanosecond to Microsecond to match what Databricks actually sends in Arrow IPC, preventing arithmetic overflow on far-future sentinel values (e.g. year 9999) - Add safe-cast fallback in try_cast_to for timestamp unit conversions that overflow, producing NULL instead of crashing * fix: match ArithmeticOverflow error variant in timestamp safe-cast fallback * refactor: simplify array and batch creation in tests for clarity * fix: update comments to clarify overflow conditions for timestamp handling * Add test for UNRESOLVED_COLUMN on columns other than full_data_type; tighten fallback condition * Fix MapArray entries nullability: Arrow spec requires Map entries to be non-null The Databricks type parser was creating Map types with nullable entries (entries struct field nullable=true), but Arrow's MapArray validation requires entries to always be non-null. When the wire data arrived with non-null entries and the declared schema had nullable entries, the cast failed with 'MapArray entries cannot contain nulls'. Changed both parameterized (MAP<K,V>) and parameterless (MAP) parsing to set entries nullable=false, matching the Arrow specification. * Add cohorted spend view schema tests from real Databricks CSV * Replace edw references with generic test names * Enhance error handling in parser tests: replace unwrap_or_else with expect for clearer failure messages * Address PR review: add table to tracing, use expect in tests, fix doc backticks * Add GEOMETRY type support (maps to Binary/WKB); add mixed-type schema tests * Properly mark dataset as Ready on Scheduler (#10215) * Properly mark dataset as Ready on Scheduler * Lint and fixes * Fix * Lint * Enable reqwest compression and optimize HTTP client settings (#10154) * fix: report text_search validation errors as execution errors, not planning errors * fix(bedrock): return specific error messages for auth and stream failures Replace the generic TODO catch-all with specific error matching for each ConverseStreamOutputError variant (ThrottlingException, ValidationException, ModelStreamErrorException). For non-service SDK errors, detect authentication failures (UnrecognizedClientException, AccessDeniedException, etc.) and return a clear "authentication failed" message instead of "unhandled error". Fixes #6771 * refactor(bedrock): use into_service_error() for idiomatic error handling Replaces e.err() with e.into_service_error() which is the standard AWS SDK pattern for consuming SdkError and matching on the inner service error type. Addresses review feedback from @Jeadie. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(bedrock): use into_service_error() for typed error extraction * ci: Add merge_group trigger to integration test workflows * Add support for DF-native DML (#9931) * Add support for DF native DML * Lint * Cleanup DeletionTableProvider * Lint * Fix * Fix * Fix * Lint * Lint * Fix * Lint * Fix partition test * Add update to cayenne and polytable * Lint * Fix test * Fix Cargo.toml * Fix * Fix * Fix * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Update autogenerated PR base branch to develop in GitHub workflows (#10034) * Update autogenerated PR base branch to develop in GitHub workflows * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Cleanup worktree path (#10033) * Cleanup worktree path * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * fix: Normalize Arrow Dictionary types for DuckDB and SQLite acceleration (#9959) * fix: Normalize Arrow Dictionary types for DuckDB and SQLite acceleration Arrow Dictionary-encoded columns (used for enums and categorical data) are not natively supported by the DuckDB and SQLite data accelerators. This causes failures when accelerating datasets that contain enum/dictionary type columns. Add `normalize_dictionary_types()` to convert Dictionary fields to their underlying value types (e.g. Dictionary(Int32, Utf8) -> Utf8) in the schema before it reaches the accelerator. The existing `SchemaCastScanExec` pipeline automatically casts the record batch data to match. Fixes #2889 Fixes #2891 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Address review comments for Dictionary type normalization - Handle nested container types (List, Struct, Map, etc.) in both `normalize_dictionary_data_type` and `has_dictionary_types`, not just top-level Dictionary fields - Preserve field metadata by using `field.with_data_type()` instead of `Field::new()` which drops metadata - Gate Dictionary normalization to only DuckDB, SQLite, and Turso engines that cannot handle Dictionary encoding natively, leaving Arrow/Cayenne/ PostgreSQL unaffected - Add read-back verification to the SQLite test to match the DuckDB test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: Fix rustfmt formatting and clippy doc_markdown lints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix clippy doc_markdown lints in DuckDB and SQLite test comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: Add coverage for nested dictionary types and field metadata preservation Add tests verifying that: - normalize_dictionary_types handles Dictionary inside List and Struct - has_dictionary_types detects Dictionary inside nested container types - Field-level metadata is preserved (not just schema-level metadata) These tests strengthen coverage for the review feedback on PR #9959. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix rustfmt formatting in test code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Add Union type support to dictionary normalization and improve test assertions Extend `data_type_contains_dictionary` and `normalize_dictionary_data_type` to handle `DataType::Union` variants, ensuring dictionary types nested inside unions are properly detected and normalized. Also strengthen the SQLite dictionary round-trip test assertion to check actual row counts instead of just non-emptiness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix clippy needless_collect in Union type normalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Remove accidentally committed worktree reference * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Claude <claude@spices-MacBook.localdomain> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * fix: Enforce target_chunk_size as hard maximum in chunking (#9973) * fix: Enforce target_chunk_size as hard maximum in chunking When chunking is enabled for embeddings, the underlying text_splitter library may produce chunks that slightly exceed the configured target_chunk_size (e.g. 513-514 tokens with a 500 target). This causes embedding failures when the model has a strict token input limit (e.g. 512 tokens). Add post-processing enforcement that re-splits any oversized chunks using binary search to find the longest prefix fitting within the target. This ensures target_chunk_size acts as a hard maximum rather than a soft target. Overlap from text_splitter is also bounded since the enforcement applies to all output chunks. Fixes #3326 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: Fix rustfmt formatting in chunking tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: Add merge_group trigger to integration test workflows --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Enable reqwest compression and optimize HTTP client settings - Enable gzip, brotli, zstd, deflate compression features on reqwest workspace dependency so all HTTP clients negotiate compressed responses - Fix per-request client construction in spice_cloud::post_json by moving the reqwest::Client to a struct field built once in new() - Replace reqwest::get() one-off in metrics status probe with a lazily-initialized shared client with tight timeouts - Add connect/request timeouts to clients missing them: - Kubernetes secret store: 10s connect, 30s request - OTEL HTTP exporter: 10s connect, 30s request - Google GenAI: 10s connect, 300s request - Databricks SQL Warehouse: 10s connect, 30s request * Change Spice Cloud client timeout from 30min to 15min * refactor: Simplify conditional expressions for clarity in multiple files * Add connect_timeout and timeouts to CLI HTTP clients - Add connect_timeout(10s) to RuntimeContext shared client - Add connect_timeout(10s) to GitHubClient - Add connect_timeout(10s) and timeout(30s) to login flow clients * Fix lint issues from develop merge - Add missing # Errors doc sections in cayenne catalog_provider.rs - Fix collapsible_if in file_based_retention_delete_test.rs - Remove unused import in on_conflict_edge_cases_test.rs - Add missing .await on csv_run_and_verify_query calls - Fix needless borrow in schema_evolution test - Fix formatting in cayenne source files * Add connect_timeout to remaining HTTP clients - Zipkin exporter and reachability check in spiced tracing - NSQL request client in repl - Spice Cloud catalog connector - CloudClient (new and with_timeout) - LLM provider create_http_client - All spidapter HTTP clients * Address review: propagate client build errors instead of silent fallback - Move reqwest::Client construction from SpiceExtension::new() to initialize(), returning a structured error on failure - Store client as Option<reqwest::Client> with ClientNotInitialized error - Use LazyLock<Result<...>> for metrics client to surface build errors instead of unwrap_or_default() * Propagate client build error in repl NSQL request path * Guard setup-spiceio steps with runner.os == 'macOS' Add if: runner.os == 'macOS' guards to all setup-spiceio, setup-sccache, sccache stats, and kill spiceio steps across workflow files. This prevents failures when jobs run on non-macOS runners (e.g. spiceai-dev-runners) where the gh CLI and spiceio dependencies are unavailable. Matches the existing pattern in codeql-analysis.yml. * Guard setup-spiceio on UNAS_SMB_PASS secret availability Add secrets.UNAS_SMB_PASS != '' condition to setup-spiceio steps and use steps.setup-spiceio.outputs.endpoint != '' for downstream sccache steps. This ensures the entire spiceio/sccache chain is skipped when the secret is unavailable (e.g. fork PRs, new runner environments). * Fix cayenne add_column snapshot and retry non-JSON 403 responses - Update cayenne add_column snapshot to include the lname column, matching the duckdb and sqlite snapshots. - Treat non-JSON HTTP 403 responses as retriable in the GraphQL client. A 403 with a non-JSON body indicates a transient upstream proxy or abuse-detection block (e.g. GitHub's 'Request forbidden by administrative rules'), not a genuine credentials/permissions error (which returns valid JSON). This prevents test flakiness from temporary GitHub abuse detection. * Address review: update 403 retry test, fix formatting - Update test_json_decode_client_error_not_retriable to exclude 403 and add dedicated test_json_decode_forbidden_retriable test. - Fix formatting for user_agent format strings (cargo fmt). * Increase SpiceExtension client timeout to 1800s (30 min) * feat: Initial support for write-through accelerated tables (#10115) * wip: mvp write through accelerated tables * docs: Delete notes * refactor: Address comments, simplify staging append, add cayenne partitioned staging write * review: Address comments * fix: Update partition expr from table provider multi-partition-by-expr * refactor: Replace WriteThroughAcceleratedTableProvider into AcceleratedTable * chore: fmt * chore: clippy * review: address comment * Revert "fix: executor startup failures" and "When executor connects, send DDL for existing tables." (#10175) * Revert "fix: executor startup failures (#10155)" This reverts commit 1b639be2fcf90a996fe900e05ba6614eff061b29. * Revert "When executor connects, send DDL for existing tables. (#9904)" This reverts commit 7c6abaa8d43ac0cb428dd3887667aa0960d63840. * fix formatting * fix missing import * fix linter * fix linter * fix: add missing `# Errors` doc section to satisfy clippy::missing_errors_doc * fix: add missing `# Errors` doc sections in staging_wal.rs --------- Co-authored-by: Jack Eadie <jack@spice.ai> * fix: remove PARTITION BY forwarding to Cayenne executors (#10182) * dont have partition by in executors * cleanup * revert: restore partition-table-provider feature for cayenne dependency --------- Co-authored-by: jeadie <jack@spice.ai> * fix: correct dispatch test assertions and runner type typo Fix ready_wait assertions in LoadArgs deserialization tests to expect None, since the load deserializer explicitly strips ready_wait as it is unsupported by the load workflow. Also fix a typo in the test YAML: spicehq-dev-large-runners -> spiceai-dev-large-runners, which caused SingleOrVec deserialization failures. * fix: Update tpch benchmark snapshots for federated/glue[csv].yaml * fix: Update tpch benchmark snapshots for federated/s3[parquet].yaml * fix: Update tpch benchmark snapshots for federated/mongodb.yaml * fix: Update tpch benchmark snapshots for federated/abfs[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[catalog].yaml * fix: Update tpch benchmark snapshots for federated/odbc[databricks].yaml * fix: Update tpch benchmark snapshots for federated/mssql.yaml * fix: Update tpch benchmark snapshots for federated/glue[catalog].yaml * fix: Update tpch benchmark snapshots for federated/dynamodb.yaml * fix: Update tpch benchmark snapshots for federated/oracle.yaml * fix: Update tpch benchmark snapshots for federated/odbc[athena].yaml * fix: Update tpch benchmark snapshots for federated/glue[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[hadoop].yaml * fix: Update tpch benchmark snapshots for federated/abfs_standard_versioned[parquet].yaml * fix: Update tpch benchmark snapshots for federated/file[parquet].yaml * fix: Update tpch benchmark snapshots for federated/spicecloud[catalog].yaml * Use balanced expression tree for partition filter combination (#10185) * fix: resolve clippy lint errors in cayenne, write_through, iceberg_ddl, and planner (#10186) * fix: resolve clippy lints in write_through, iceberg_ddl, and planner * fix: use module-level #[expect(dead_code)] in cayenne test common module * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-cayenne[file]-partitioned.yaml (#10189) * fix: correct dispatch test assertions and runner type typo Fix ready_wait assertions in LoadArgs deserialization tests to expect None, since the load deserializer explicitly strips ready_wait as it is unsupported by the load workflow. Also fix a typo in the test YAML: spicehq-dev-large-runners -> spiceai-dev-large-runners, which caused SingleOrVec deserialization failures. * fix: Update tpch benchmark snapshots for federated/glue[csv].yaml * fix: Update tpch benchmark snapshots for federated/s3[parquet].yaml * fix: Update tpch benchmark snapshots for federated/mongodb.yaml * fix: Update tpch benchmark snapshots for federated/abfs[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[catalog].yaml * fix: Update tpch benchmark snapshots for federated/odbc[databricks].yaml * fix: Update tpch benchmark snapshots for federated/mssql.yaml * fix: Update tpch benchmark snapshots for federated/glue[catalog].yaml * fix: Update tpch benchmark snapshots for federated/dynamodb.yaml * fix: Update tpch benchmark snapshots for federated/oracle.yaml * fix: Update tpch benchmark snapshots for federated/odbc[athena].yaml * fix: Update tpch benchmark snapshots for federated/glue[parquet].yaml * fix: Update tpch benchmark snapshots for federated/iceberg[hadoop].yaml * fix: Update tpch benchmark snapshots for federated/abfs_standard_versioned[parquet].yaml * fix: Update tpch benchmark snapshots for federated/file[parquet].yaml * fix: Update tpch benchmark snapshots for federated/spicecloud[catalog].yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-cayenne[file]-partitioned.yaml * fix: Update tpch benchmark snapshots for accelerated/indexes/file[parquet]-cayenne[file]-indexes.yaml * fix: Update tpch benchmark snapshots for accelerated/spicecloud-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/indexes/file[parquet]-arrow-indexes.yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-arrow-partitioned.yaml * fix: Update tpch benchmark snapshots for accelerated/dynamodb-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/mongodb-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/dynamodb-duckdb[file].yaml * fix: Update tpch benchmark snapshots for accelerated/file[parquet]-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/file[parquet]-cayenne[file]turso.yaml * fix: Update tpch benchmark snapshots for accelerated/file[parquet]-cayenne[file].yaml * fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-cayenne[file]-on_zero_results.yaml * fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[file]-on_zero_results.yaml * fix: Update tpch benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[memory]-on_zero_results.yaml * fix: Update tpch benchmark snapshots for accelerated/mysql-arrow.yaml * fix: Update tpch benchmark snapshots for accelerated/s3[parquet]-duckdb[file]-partitioned.yaml * fix: Update tpch benchmark snapshots for accelerated/postgres-arrow.yaml * fix: Update tpcds benchmark snapshots for federated/s3[parquet].yaml * fix: Update tpcds benchmark snapshots for federated/abfs[parquet].yaml * fix: Update tpcds benchmark snapshots for federated/file[parquet].yaml * fix: Update tpcds benchmark snapshots for federated/databricks[delta_lake].yaml * fix: Update tpcds benchmark snapshots for accelerated/spicecloud-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/databricks[delta_lake]-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-arrow-partitioned.yaml * fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/file[parquet]-arrow.yaml * fix: Update tpcds benchmark snapshots for accelerated/s3[parquet]-cayenne[file].yaml * fix: Update tpcds benchmark snapshots for accelerated/file[parquet]-cayenne[file].yaml * fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-cayenne[file]-on_zero_results.yaml * fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[file]-on_zero_results.yaml * fix: Update tpcds benchmark snapshots for accelerated/on_zero_results/file[parquet]-duckdb[memory]-on_zero_results.yaml * fix: Update tpcds benchmark snapshots for accelerated/postgres-arrow.yaml * Trigger CI * Update workspace configuration in Cargo.toml --------- Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> * feat: Add CREATE TABLE ... (LIKE ...) support (#10180) * feat: Add CREATE TABLE ... (LIKE ...) support * feat: Support struct() IN-list filters in Cayenne position-based delete (#10191) * feat: Support struct() IN-list filters in Cayenne position-based delete Decompose struct(k1, k2) IN (SET) expressions into balanced OR-trees of AND-equalities for Vortex pushdown. DataFusion converts tuple IN-lists (k1, k2) IN ((v1,w1), ...) to struct() IN-list which Vortex cannot handle. - Add try_decompose_struct_in_list() to position_based.rs - Handle CAST-wrapped struct() expressions (type coercion) - Cast literal values to match column types (e.g., Int64 to Int32) - Build balanced binary OR-tree for O(log N) depth vs O(N) linear chain - Add balanced_or_exprs() for MERGE delete filters (same stack overflow fix) * fix: Parse partition expressions to extract column references for MERGE validation (#10192) * fix: Parse partition expressions to extract column references for MERGE validation Restore strict MERGE primary_key/on_conflict validation by parsing partition expressions (e.g., bucket(5, c_nationkey)) with sqlparser AST Visitor to extract referenced column names, instead of requiring exact string matches on the partition column. - Add extract_partition_column_references() using sqlparser Visitor - Handle simple columns, compound identifiers, and transform expressions - Add unit tests for various partition expression formats * feat: Spidapter staging table support and SQL execution extraction * fix(BigQuery): fix Unsupported subqueries in JOIN ON predicates (TPC-H) (#10195) * fix: clippy auto-fixes from merge with develop --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: claudespice <claude@spice.ai> Co-authored-by: Claude <claude@spices-MacBook.localdomain> Co-authored-by: William <98815791+peasee@users.noreply.github.com> Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * feat(databricks): DESCRIBE TABLE fallback and source-native type parsing for Lakehouse Federation (#10229) * feat(databricks): DESCRIBE TABLE fallback and source-native type parsing for Lakehouse Federation Databricks Lakehouse Federation foreign tables (e.g. Neon PostgreSQL) have two issues with schema introspection: 1. information_schema.columns returns no rows (data_array is null) because the table has no column metadata registered in Unity Catalog. 2. When data_type column IS populated, it returns source-native type names (integer, text, numeric, timestamp without time zone) instead of Spark SQL types. Changes: - Add DESCRIBE TABLE as third-tier schema introspection fallback after information_schema.columns full_data_type and data_type attempts fail - Catch ParseError from information_schema path to also fall through to DESCRIBE TABLE for unrecognized native types - Add source-native type parsing: integer, text, numeric, real, double precision, character varying, varchar, timestamp with/without time zone - Extract get_schema_from_information_schema helper method - Add schema_from_describe_json parser for DESCRIBE TABLE responses (defaults all columns to nullable since DESCRIBE lacks nullability info) - Add create_describe_payload with backtick-quoted identifiers - Add comprehensive tests for all fallback paths and native types * fix: address PR review comments - Validate multi-word timestamp types (time/zone tokens) instead of blind advance() calls; add expect_identifier helper - Validate DECIMAL/NUMERIC precision <= 38 and scale <= precision - Validate varchar/character varying length is a Number token - Remove ParseError -> DESCRIBE TABLE fallback (DESCRIBE TABLE returns Spark types, so a ParseError from information_schema indicates a real problem that should surface as an error) * feat: Add pagination support to HTTP data connector (#10228) * feat: Add pagination support to HTTP data connector Add configurable pagination for HTTP API endpoints, supporting two modes: - URL mode: next page URL from response body (JSON pointer) or HTTP Link header with rel="next" - Token mode: cursor/token extracted via JSON pointer and passed as a query parameter in subsequent requests Configuration parameters: - pagination: enabled/disabled - pagination_next_pointer: JSON pointer to next URL/cursor - pagination_link_header: use Link header for pagination - pagination_token_param: query param name for cursor tokens - pagination_data_pointer: JSON pointer to data array per page - pagination_max_pages: safety limit (default: 100) Key design decisions: - Streaming execution via futures::stream::try_unfold yields one RecordBatch per page, avoiding buffering entire result sets - SSRF protection validates next-page URLs share the base URL origin - Works transparently with caching, append, and full (with refresh_sql) acceleration modes - Refactored batch creation into create_batch_from_rows for reuse between paginated and non-paginated paths Includes 18 unit tests covering Link header parsing, JSON pointer extraction, SSRF rejection, data pointer extraction, token/query building, config validation, and edge cases (null, empty, missing). * fix: address PR review feedback for HTTP pagination - Return error instead of silently truncating when max_pages is reached - Fall through from missing JSON pointer to Link header check - Support relative URLs for next-page links via base_url.join() - Use url::form_urlencoded for proper query parameter encoding - Don't stop pagination on empty data rows if next page exists - Fix clippy and formatting issues * fix: fail loudly on pagination JSON/pointer errors - Error on invalid JSON when pagination_next_pointer is configured - Error on non-string/non-null pointer values instead of silently ending - Error on missing/invalid data pointer instead of returning empty rows - Validate JSON Pointer syntax (must start with '/') in with_pagination - Add tests for all new error cases * fix: address pagination review feedback (round 3) - Preserve base URL query params in token pagination via merge_queries() - Track actual per-page path/query for accurate request_* columns - Broaden Link header parsing: handle unquoted rel=next and multi-value rel - Avoid intermediate Vec allocation for content column (from_iter_values) - Add comment noting cache bypass for subsequent pages is intentional - Add tests for merge_queries, unquoted rel, and multi-rel Link headers * feat: change pagination to auto/enabled/disabled mode Default is 'auto' which auto-detects Link headers on every HTTP response without requiring explicit config. 'enabled' requires explicit pagination config. 'disabled' turns off pagination. In auto mode with no other pagination params configured, only Link header detection is active. This means pagination 'just works' for APIs that use standard Link headers (GitHub, etc.). * fix: address pagination review feedback (round 4) - has_dynamic_api_params only true when pagination explicitly configured or pagination params are set (not in auto-detect-only mode) - Loop internally to skip empty pages instead of yielding empty batches - Fix max_pages error message: 'query was aborted' not 'may be incomplete' - Parse response JSON once per page, reuse for both next-page and data extraction (avoids duplicate deserialization on large responses) * fix: return data instead of error when max_pages reached * fix: merge base URL query for page 0, validate pagination paths, treat auto/link/max_pages as dynamic * fix: use top-level splitting for Link header parsing (RFC 8288 compliance) * test: add comprehensive tests for Link header parsing functionality * refactor: simplify query merging logic and enhance pagination handling * fix: correct logic for determining link header usage in pagination * Update snapshot --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> * fix(databricks): harden HTTP retries, compression, and token refresh (#10232) * fix(databricks): harden HTTP retries and encodings * feat(databricks): implement response body draining for retry logic in HTTP requests * refactor(databricks): replace write! macro with write_fmt for header formatting * fix(databricks): clamp short-lived token refreshes * fix(databricks): include nested HTTP error causes * fix(databricks): implement bounded retry delay to clamp maximum backoff duration * refactor: deduplicate HTTP retry logic into shared resilient_http module - Extract send_request_with_retry from databricks token provider into data_components::resilient_http (make pub) - Remove ~210 lines of duplicated retry logic, backoff, Retry-After parsing, and response body draining from databricks.rs - Truncate AWS APN app_name to 50 chars to avoid SDK warning - Add length assertion to app_name test - Remove unused httpdate dependency from runtime crate * refactor(databricks): simplify access token request logic by removing unnecessary line breaks * refactor(databricks): improve import statement for fibonacci_backoff module * refactor(databricks): update import statement for fibonacci_backoff and remove redundant test * refactor(databricks): enhance SQL Warehouse API with request concurrency control and update HTTP client configuration * refactor(unity_catalog): remove unused Duration import * refactor(databricks): enhance SQL Warehouse API with concurrency control and improve error handling * refactor(databricks): simplify concurrency permit acquisition and clean up assertion formatting * refactor(databricks): improve URL parsing in token endpoint function for better validation * refactor(resilient_http): streamline concurrency permit handling and improve variable naming * refactor(databricks): enhance token endpoint URL validation for localhost and simplify test attributes * fix(databricks): fall back to DESCRIBE TABLE on QueryFailure from information_schema When querying information_schema.columns fails with errors like UNSUPPORTED_DATA_SOURCE (e.g. for Lakehouse Federation foreign tables backed by unsupported providers), fall back to DESCRIBE TABLE instead of propagating the error. This matches the existing fallback behavior for TableSchemaNotRegistered and NoColumnsInDataset. * fix(databricks): add fallback to DESCRIBE TABLE for unsupported data sources in schema retrieval * fix(databricks): simplify error handling in schema retrieval fallback to DESCRIBE TABLE * fix: Full Text Search schema mismatch with ADBC connector (#10235) * fix: Compare batch schema before/after compute_index instead of against plan schema The IndexerExec schema guard was comparing each batch's schema against the execution plan's advertised schema (input_exec.schema()) using full Schema equality including metadata. ADBC connectors can return batches with slightly different schema metadata than advertised, causing a false 'Index full_text changed schema' error even though compute_index returns batches unchanged. Compare the batch schema before vs. after each compute_index call using fields-only comparison to correctly detect actual schema mutations while tolerating benign metadata differences. Fixes #10223 * test: Add regression test for metadata-only schema difference (#10223) Adds pipeline_tolerates_metadata_only_schema_difference test to verify that batches with identical fields but different schema metadata do not trigger a false 'changed schema' error from IndexerExec. * fix: Add spaces in schema changed error message for readability * refactor: Improve formatting and readability in test for schema metadata handling * fix: Validate incoming batch fields against advertised schema Add a fields-only check that incoming batches match the IndexerExec's advertised schema before running indexes. This catches input stream schema violations early with a clear error attributing the mismatch to the input rather than to an index. * docs: Fix regression test docstring to match actual behavior * fix: Update Search integration test snapshots --------- Co-authored-by: Claude <claude@Claudes-Mac-mini.local> Co-authored-by: claudespice <claude@spice.ai> Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Claude <claude@spices-MacBook.localdomain> Co-authored-by: William <98815791+peasee@users.noreply.github.com> Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix: Fix clippy doc_markdown lint in megascience test comments * fix: Use unit-appropriate timestamp methods in Turso RFC3339 parsing Previously, parsing RFC3339 timestamps in the Turso connector always converted through nanoseconds using timestamp_nanos_opt().unwrap_or(0). This silently returned epoch 0 (1970-01-01) for dates outside the i64 nanosecond range (~1677-2262), corrupting query results. Use chrono's unit-specific methods (timestamp(), timestamp_millis(), timestamp_micros()) for Second/Millisecond/Microsecond units to avoid the nanosecond overflow entirely. For Nanosecond unit, return NULL instead of epoch 0 when overflow occurs. Add regression test verifying year 2300 timestamps are handled correctly across all time units. * fix: resolve race condition in HashIndex::insert_or_replace length tracking The global length counter in HashIndex::insert_or_replace was updated by comparing shard.len() before and after the operation, but these reads used separate lock acquisitions. Between them, other threads could modify the shard, causing the global counter to drift (e.g., 103 or 125 instead of 100 for 100 unique keys). Fix: have ShardTable::insert_or_replace return whether the entry was newly inserted (true) or replaced (false), and use that to decide atomically whether to increment the global counter — all within the shard's write lock. Also apply cargo fmt to turso.rs test assertions. * fix: Update Search integration test snapshots (#10281) * Improve Snowflake/ADBC dataset registration performance and observability (#10266) * Improve Snowflake/ADBC dataset registration performance and observability - Add duration_ms timing logs to Snowflake connection pool creation, connectivity validation, schema/table listing, and query execution - Parallelize Snowflake catalog discovery: schema refreshes and table provider creation now use buffer_unordered(10) instead of sequential loops - Add duration_ms timing logs to ADBC catalog discovery and parallelize table provider creation with buffer_unordered(10) - Move dataset_load_parallelism semaphore to only gate schema inference (read_provider) and initial data load, not connector creation or validation - Add general dataset registration timing: connector creation, schema inference, and registration each log duration_ms for all connectors * Improve dataconnector parallelization * fix: Enhance error logging for Snowflake connectivity validation and improve async dataset loading syntax * Lint * Fix * fix: Refactor tracing macros for improved readability and consistency in Snowflake and ADBC modules * fix: address PR review comments - move type alias before statements, add semicolon * fix: narrow load semaphore to only gate read_provider (schema inference) --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> * Fixes for kafka connector (#10263) * Fixes for kafka connector * Lint * Lint * fix(runtime): gate otel code tags, suppress aws sdk noise, and unblock connector init (#10260) * fix(tracing): gate otel code tags and suppress aws sdk noise * fix(lint): address clippy findings * refactor: streamline connector factory retrieval in parameters and mod.rs * refactor: simplify dataset creation and parameter building in mod.rs * fix(lint): address review and cfg warnings * fix(lint): gate feature-specific runtime warnings * fix: handle unsupported acceleration connections in update_fetched_at * refactor: update CatalogResponseItem type definition for clarity * fix(runtime): avoid regionless AWS SDK loads (#10271) * fix(tracing): gate otel code tags and suppress aws sdk noise * fix(lint): address clippy findings * refactor: streamline connector factory retrieval in parameters and mod.rs * refactor: simplify dataset creation and parameter building in mod.rs * fix(lint): address review and cfg warnings * fix(lint): gate feature-specific runtime warnings * fix: handle unsupported acceleration connections in update_fetched_at * refactor: update CatalogResponseItem type definition for clarity * fix(runtime): avoid regionless aws sdk loads * fix(runtime): streamline IAM role source handling in S3 connector * fix(runtime): address aws sdk review feedback * Improve Snowflake/ADBC dataset registration performance and observability (#10266) * Improve Snowflake/ADBC dataset registration performance and observability - Add duration_ms timing logs to Snowflake connection pool creation, connectivity validation, schema/table listing, and query execution - Parallelize Snowflake catalog discovery: schema refreshes and table provider creation now use buffer_unordered(10) instead of sequential loops - Add duration_ms timing logs to ADBC catalog discovery and parallelize table provider creation with buffer_unordered(10) - Move dataset_load_parallelism semaphore to only gate schema inference (read_provider) and initial data load, not connector creation or validation - Add general dataset registration timing: connector creation, schema inference, and registration each log duration_ms for all connectors * Improve dataconnector parallelization * fix: Enhance error logging for Snowflake connectivity validation and improve async dataset loading syntax * Lint * Fix * fix: Refactor tracing macros for improved readability and consistency in Snowflake and ADBC modules * fix: address PR review comments - move type alias before statements, add semicolon * fix: narrow load semaphore to only gate read_provider (schema inference) --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> * fix(runtime): clarify region handling in SDK config documentation --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> * Add versioned release install workflow coverage (#10276) * Add versioned coverage to release install E2E * Fix release install workflow failures * Fix WSL preview runtime restart * Address release install PR feedback * fix: Update Search integration test snapshots --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> * fix: Update Search integration test snapshots (#10268) Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> * fix: error on MSSQL/Oracle timestamp nanosecond overflow The MSSQL `DATETIME2` and `DATETIMEOFFSET` paths and the Oracle `TIMESTAMP WITH TIME ZONE` path used `.timestamp_nanos_opt().unwrap_or_default()` to convert `chrono` values to Arrow nanosecond timestamps. When the input falls outside the i64 nanosecond range (~1677-09-21 to 2262-04-11), `timestamp_nanos_opt()` returns `None` and `unwrap_or_default()` silently substitutes 0, which Arrow then interprets as 1970-01-01 UTC. DATETIME2 and DATETIMEOFFSET support years 0001-9999, and Oracle's TIMESTAMP types do as well, so any value outside the nanosecond window was being silently rewritten to epoch 0 in query results. This is the same class of bug fixed for the Turso connector in #10282. Replace the silent default with a typed error in both connectors. The Oracle change reuses the existing `FailedToConvertNaiveDateTimeToNanos` variant; MSSQL gets a new `FailedToConvertTimestampToNanos` variant. Both connectors now extract the conversion into a small helper so the overflow behavior is unit-testable without a live database. Adds regression tests covering both the in-range happy path and the out-of-range overflow case for `DATETIME2`, `DATETIMEOFFSET`, and Oracle `TIMESTAMP WITH TIME ZONE`. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: Claude <claude@Claudes-Mac-mini.local> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Claude <claude@spices-MacBook.localdomain> Co-authored-by: William <98815791+peasee@users.noreply.github.com> Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> * docs: fix status badges in README (#10350) * docs: fix status badges in README - Use markdown badge syntax for all badges. - Fix wrong workflow references: - spiced_docker_nightly.yml -> spiced_docker_dev.yml - build_and_release.yml -> build_nightly.yml - benchmarks.yml -> testoperator_run_bench.yml - Wrap badges in centered div blocks so GitHub renders markdown inside. * docs: point unit tests badge to build_and_release on trunk * docs: align CodeQL badge link filter with image filter * Migrate ecrets to envs (#10354) * Add limit pushdown and improve sort pushdown for Oracle and MSSQL (#10351) * Implement sort pushdown support and fix pushdown gaps across providers Implement DataFusion v52 `try_pushdown_sort` for transparent wrapper execution plans (CayenneAccelerationExec, SchemaCastScanExec, BytesProcessedExec) by delegating to their child plans, and for SQL providers (MSSQL, Oracle, FlightSQL) by generating ORDER BY clauses. Also fix limit pushdown consistency in wrappers (delegate supports_limit_pushdown/with_fetch/fetch to child plans instead of returning mismatched values), and extend MSSQL filter pushdown to support NotEq, And, Or, Not, IsNull, IsNotNull, Like, InList, and Between expressions. * fix: enhance sort pushdown error handling and improve filter classification logic * Address PR review comments: improve sort pushdown correctness - FlightSQL: Replace filter_map with fallible map in sql() ORDER BY generation to return an error instead of silently dropping non-Column sort expressions. Add InvalidSortExpression error variant. - MSSQL: Make classify_mssql_filter recursively check time-related expressions in And/Or/Not/IsNull/IsNotNull/Like sub-expressions to prevent time-related filters from being pushed down via compound exprs. - SchemaCastScanExec: Propagate input ordering through equivalence properties and set maintains_input_order to true, since schema casting preserves row order. - FlightSQL tests: Add unit tests for try_pushdown_sort (unsupported for non-column, exact for column) and sql() ORDER BY clause generation. * Remove unsafe ordering propagation from SchemaCastScanExec Do not copy input ordering into EquivalenceProperties since schema casting can change data types and projected columns, making the input ordering invalid for the output schema. Retain maintains_input_order=true since row order is preserved. * Fix CI failures: restore SchemaCastScanExec ordering and fix SQL double-space - Restore ordering propagation in SchemaCastScanExec::new() that was incorrectly removed, fixing SortPreservingMergeExec invariant violations in partition integration tests. - Fix double-space in generated SQL for Oracle, FlightSQL, and MSSQL execution plans when order_expr is empty. Build SQL incrementally, appending clauses only when non-empty. - Update oracle test-framework snapshots to match corrected SQL output. * Upgrade datafusion-table-providers to 4e8b2b0bd0f0 (pushdown support) (#10341) * Refactor BytesProcessedExec to simplify fetch and pushdown sort methods * Fix schema_cast ordering: remap column indices by name and add tests - Remap sort expression column indices from input to output schema by name, since SchemaCastScanExec may reorder columns relative to input - Only propagate ordering when ordered columns have identical types - Add 3 unit tests: ordering propagated (same types), not propagated (type differs), and indices remapped (reordered columns) - Add branch comment to datafusion-federation git dependency * Update refresh_max_timestamp_df plan snapshot * Update cluster::distributed_cayenne_catalog snapshots * Update duckdb_json_functions snapshots * Update datafusion version * Update datafusion version * Update to datafusion-federation rev 42245bdd58ee3d7da8276e83d85fb1c52aec916e * Revert "Update refresh_max_timestamp_df plan snapshot" This reverts commit 244fb05060d3787555fff13fc62dd6df16c50bfe. * Update distributed_acceleration snapshot * Add limit pushdown and improve sort pushdown for Oracle and MSSQL * Fix Exact->Inexact * Revert "Fix Exact->Inexact" This reverts commit f423db9007ea20de0c55eee3f0a74af465998371. --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> * Fix ubuntu mirror configuration (#10359) * Add step to verify apt mirror configuration in GitHub Action * Fix apt mirror substitution to match only archive.ubuntu.com Improve awk regex to avoid greedy matching and add error check to fail if archive.ubuntu.com remains after substitution. * Improve apt mirror substitution check for Pittsburgh mirror The script now verifies that the Pittsburgh mirror is present in ubuntu.sources after substitution, rather than checking for the absence of archive.ubuntu.com, which is intentionally retained as a fallback. This avoids false negatives and ensures the mirror substitution is effective. * Simplify deb822 mirror substitution using sed for archive URIs * Update apt mirror check to use PRIMARY variable Check for the configured primary mirror in ubuntu.sources using the PRIMARY variable instead of a hardcoded hostname. Update error message to include the actual PRIMARY value for clarity. * fix: Increase throughput test default ready_wait from 30s to 300s (fixes #8207) (#10344) The throughput workflow's `ready_wait` input defaulted to 30 seconds, which is insufficient for tests loading data from external sources like MongoDB. The dispatch configs specify adequate timeouts (e.g. 600s for mongodb-arrow), but manual workflow triggers via the GitHub UI used the low default, causing "Spiced instance not ready within 30s" failures. Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * Add auth headers support to OTEL metrics exporter (#10347) * Add auth headers support to OTEL metrics exporter Add a 'headers' field to OtelExporterConfig for sending authentication headers with OTEL metric export requests. Headers are applied as HTTP headers for HTTP protocol or gRPC metadata entries for gRPC protocol. Header values support secret replacement syntax (e.g. ${secrets:api_key}). This enables authentication with services like Datadog (DD-API-KEY header) and Grafana Cloud (Authorization header). * Add default value for headers in Spicepod schema and enhance gRPC exporter error handling * Refactor assertions in HTTP and gRPC exporter tests for improved readability * Fix gRPC exporter tests to use tokio runtime * Address review: rename shadowed vars, fix test runtime setup * Address review: pass owned headers, document gRPC key constraint, add tokio runtime to test * Add Clippy expectation for implicit hasher in create_otel_periodic_reader * Address review: document resolved_headers vs config.headers; note gRPC key constraint in schema * Fix YAML string formatting in test for OTEL exporter headers * fix linter warnings --------- Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> Co-authored-by: ewgenius <hey@ewgenius.me> * fix(github): shrink GraphQL page size on gateway errors; lower comment defaults (#10355) * fix(github): shrink GraphQL page size on gateway errors; lower comment defaults - Lower default `github_max_comments_fetched` from 75 to 25 to reduce worst-case node count per page and keep queries within GitHub's secondary rate limit budget. Cap remains 75. - Reduce PR outer `first:` page size from 100 to 25 when `include_comments` is enabled (review/discussion/all). Without comments the page stays at 100. - Reduce inner `comments(first: ...)` in issues query from 100 to 25 to match. - Add `PullRequestTableArgs::check_node_limit()` that estimates per-page node count and rejects configurations that would exceed GitHub's 500,000 node hard limit with an actionable error. Invoked eagerly from `read_provider` so misconfigurations fail fast rather than at query time with an opaque 502. - Graphql client: on 502/503/504 gateway errors, shrink the outer `first:` page size via a reverse-Fibonacci ladder (100,55,34,21,13,8,5,3,2,1) and rewrite the query AST on retry. This lets very large queries against overloaded GitHub endpoints succeed on a subsequent attempt with a smaller payload instead of replaying the same oversized query. Fixes the `spicehq_spiceai.pulls` 502 Bad Gateway errors observed with `include_comments: all` against large repositories. * fix(github): improve pagination handling and node count estimation for pull requests * fix(graphql): improve error handling for page size override locking * fix(graphql): preserve LIMIT 0 semantics, only clamp page_size_override --------- Co-authored-by: Evgenii Khramkov <evgenii@spice.ai> * Relax apt mirror substitution failure to warning in CI action (#10361) * Relax apt mirror substitution failure to warning in CI action * Update .github/actions/configure-apt/action.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update crates/llms/Cargo.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * chore: update cudarc version to 0.19.4 and remove unused 0.12.2 dependency * Bump candle / mistral.rs / TEI forks to pick up moe_gemm_wmma rename Resolves ld.lld duplicate-symbol link error when building spiced with cuda + models features: ld.lld: error: duplicate sym…

1 parent b31f696 commit f4e7238Copy full SHA for f4e7238

6 files changed

Cargo.lock
Cargo.toml
crates
- llms
  - Cargo.toml
- runtime
  - src/cluster/partition
    - write_through.rs
  - tests
    - cayenne
      - mod.rs
    - databricks_sql_warehouse_permissions
      - mod.rs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit f4e7238

File tree

0 commit comments