Add PollNow interrupt for Ballista executors to reduce task scheduling latency by phillipleblanc · Pull Request #9098 · spiceai/spiceai

phillipleblanc · 2026-01-26T07:02:35Z

This feature allows the scheduler to send a PollNow message to connected executors via the bidirectional control stream, which will interrupt the 100ms idle sleep in the Ballista poll_loop and cause immediate polling for new work.

Currently, executors poll the scheduler every 100ms when idle. This means there can be up to 100ms latency between when work becomes available and when an executor picks it up. By sending a PollNow signal, we can reduce this latency to near-zero for interactive workloads.

Architecture

┌─────────────────┐                    ┌─────────────────┐
│    Scheduler    │                    │    Executor     │
│                 │                    │                 │
│  ┌───────────┐  │   ControlStream    │  ┌───────────┐  │
│  │ ClusterSvc│──┼────────────────────┼──│CtrlStream │  │
│  │   Impl    │  │   PollNow msg      │  │  Client   │  │
│  └─────┬─────┘  │   ─────────────►   │  └─────┬─────┘  │
│        │        │                    │        │        │
│        │        │                    │        ▼        │
│        │        │                    │  ┌───────────┐  │
│        │        │                    │  │  Notify   │  │
│        │        │                    │  │(per-sched)│  │
│        │        │                    │  └─────┬─────┘  │
│        │        │                    │        │        │
│        │        │                    │        ▼        │
│  ┌───────────┐  │                    │  ┌───────────┐  │
│  │  poll_work│◄─┼────────────────────┼──│ poll_loop │  │
│  └───────────┘  │   gRPC request     │  └───────────┘  │
└─────────────────┘                    └─────────────────┘

Depends on Add poll_now_notify parameter to poll_loop and on_work_available callback datafusion-ballista#12
Add PollNowCommand proto message and update SchedulerControlMessage
Add ExecutorStreamRegistry to track connected executor control streams
Implement broadcast_poll_now() to send PollNow to all executors
Update ControlStreamManager to provide shared Notify handle for poll wake-up
Wire poll_now_notify through spawn_scheduler_poll_loop and update_scheduler_pollers
Add executor_stream_registry field to DataFusion struct for scheduler access
Update ClusterServiceImpl to use shared ExecutorStreamRegistry
Update Cargo.toml to reference ballista fork commit e0ab27eeef9f

github-actions · 2026-01-26T07:02:47Z

✅ Pull with Spice Passed

Passing checks:

✅ Title meets minimum length requirement (10 characters)
✅ No banned labels detected
✅ Has a label from required category kind/
✅ Has a label from required category area/
✅ Has at least one assignee: phillipleblanc

Copilot

Pull request overview

This pull request implements a "PollNow interrupt" feature for Ballista executors to reduce task scheduling latency from up to 100ms (the idle poll interval) to near-zero for interactive workloads. The scheduler can now broadcast a PollNow command to all connected executors via bidirectional control streams, causing them to immediately poll for new work rather than waiting for the next poll interval.

Changes:

Added PollNow protobuf message for scheduler-to-executor control stream communication
Implemented ExecutorStreamRegistry to track connected executor control streams and broadcast PollNow commands
Wired poll_now_notify through executor control stream client, poll loops, and scheduler initialization
Updated Ballista fork dependency to e0ab27eeef9f which includes poll_loop modifications to accept Notify handle

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
plans/INTERRUPT_POLL.md	Comprehensive design document explaining architecture, implementation details, and testing strategy
crates/runtime-proto/proto/spice.proto	Added PollNowCommand message and updated SchedulerControlMessage with oneof field
crates/runtime/src/cluster/service.rs	Implemented ExecutorStreamRegistry with broadcast_poll_now, register/unregister methods; updated ClusterServiceImpl to use registry
crates/runtime/src/cluster/control_stream_client.rs	Updated to handle PollNow messages and provide shared Notify handle for poll loop wake-up
crates/runtime/src/cluster/mod.rs	Wired poll_now_notify through spawn_scheduler_poll_loop and update_scheduler_pollers; created on_work_available callback
crates/runtime/src/cluster/servers.rs	Updated to use shared ExecutorStreamRegistry when available for scheduler mode
crates/runtime/src/datafusion/mod.rs	Added executor_stream_registry field, bind_executor_stream_registry and getter methods
crates/runtime/src/datafusion/builder.rs	Initialized executor_stream_registry field in DataFusion builder
Cargo.toml	Updated ballista-* dependencies to new fork commit e0ab27eeef9f with PollNow interrupt support
Cargo.lock	Updated ballista dependencies and transitive dependency versions

…g latency - Add PollNowCommand proto message and update SchedulerControlMessage - Add ExecutorStreamRegistry to track connected executor control streams - Implement broadcast_poll_now() to send PollNow to all executors - Update ControlStreamManager to provide shared Notify handle for poll wake-up - Wire poll_now_notify through spawn_scheduler_poll_loop and update_scheduler_pollers - Add executor_stream_registry field to DataFusion struct for scheduler access - Update ClusterServiceImpl to use shared ExecutorStreamRegistry - Update Cargo.toml to reference ballista fork commit e0ab27eeef9f

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated no new comments.

…arity and update references throughout the cluster and datafusion modules.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

…nction

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated no new comments.

…g latency (#9098) * Add PollNow interrupt for Ballista executors to reduce task scheduling latency - Add PollNowCommand proto message and update SchedulerControlMessage - Add ExecutorStreamRegistry to track connected executor control streams - Implement broadcast_poll_now() to send PollNow to all executors - Update ControlStreamManager to provide shared Notify handle for poll wake-up - Wire poll_now_notify through spawn_scheduler_poll_loop and update_scheduler_pollers - Add executor_stream_registry field to DataFusion struct for scheduler access - Update ClusterServiceImpl to use shared ExecutorStreamRegistry - Update Cargo.toml to reference ballista fork commit e0ab27eeef9f * Fix testoperator dispatch (#9097) * Rename ExecutorStreamRegistry to ExecutorControlStreamRegistry for clarity and update references throughout the cluster and datafusion modules. * Remove unused scheduler configuration from create_scheduler_server function --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>

* build(deps): combined dependabot updates (#9128) * build(deps): bump actions/setup-python from 6.1.0 to 6.2.0 Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.1.0 to 6.2.0. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/83679a892e2d95755f2dac6acb0bfd1e9ac5d548...a309ff8b426b58ec0e2a45f0f869d46889d02405) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: 6.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump azure_core from 0.30.1 to 0.31.0 Bumps [azure_core](https://github.com/azure/azure-sdk-for-rust) from 0.30.1 to 0.31.0. - [Release notes](https://github.com/azure/azure-sdk-for-rust/releases) - [Commits](https://github.com/azure/azure-sdk-for-rust/compare/azure_core@0.30.1...azure_core@0.31.0) --- updated-dependencies: - dependency-name: azure_core dependency-version: 0.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump rustyline from 15.0.0 to 17.0.2 Bumps [rustyline](https://github.com/kkawakam/rustyline) from 15.0.0 to 17.0.2. - [Release notes](https://github.com/kkawakam/rustyline/releases) - [Changelog](https://github.com/kkawakam/rustyline/blob/master/History.md) - [Commits](https://github.com/kkawakam/rustyline/compare/v15.0.0...v17.0.2) --- updated-dependencies: - dependency-name: rustyline dependency-version: 17.0.2 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump rdkafka from 0.38.0 to 0.39.0 Bumps [rdkafka](https://github.com/fede1024/rust-rdkafka) from 0.38.0 to 0.39.0. - [Changelog](https://github.com/fede1024/rust-rdkafka/blob/master/changelog.md) - [Commits](https://github.com/fede1024/rust-rdkafka/commits/v0.39.0) --- updated-dependencies: - dependency-name: rdkafka dependency-version: 0.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Cayenne snapshots with shared metadata (#9118) * Cayenne snapshots with shared metadata * Enhance directory extraction with checksum verification and locking - Introduced global locks to prevent race conditions during concurrent extractions. - Added checksum verification for existing files when extracting archives, ensuring data integrity. - Updated `ExtractOptions` to include a flag for checksum verification and a map for expected checksums. - Modified extraction logic to handle checksum mismatches and log relevant information. - Improved tests to validate checksum verification behavior during extraction. * PR fixes * Improve error handling for URL tables with Azure URLs (#9129) - Add validation for Azure URLs (abfs/abfss) to detect missing storage account name - Return proper errors instead of silent Ok(None) when URL table resolution fails - Provide actionable error messages with valid URL formats and env var guidance - Add unit tests for Azure URL validation and error handling * Use datafustion-ballista with object-store passing (#9130) * fix Cargo.lock * Add missing Windows build step for spice CLI in build_and_release workflow (#9143) * Fix install-dev to use debug build path for spice binary (#9142) * Add max column count test for Cayenne and mark Beta criteria as complete (#9126) * fix: CLI builds (#9145) * fix: Broken E2E CLI builds, remove DataFusion dependency from flightrepl * fix: Bad import * fix: Don't feature gate flightrepl to unix * fix: Optimize CLI builds to not build uneccessary dependencies * fix: Specify features CLI was implicitly relying on * fix: Remove ansi_colors * fix: Remove ansi_term * fix * fix: Pass aws_allow_http to executors for Delta Lake distributed queries (#9146) * fix: Make CLI system and asset type detection more robust (#9148) * fix: Broken E2E CLI builds, remove DataFusion dependency from flightrepl * fix: Bad import * fix: Don't feature gate flightrepl to unix * fix: Optimize CLI builds to not build uneccessary dependencies * fix: Specify features CLI was implicitly relying on * fix: Remove ansi_colors * fix: Remove ansi_term * fix * fix: Asset URL parsing for CLI and runtime installs * Always create initial snapshots (unless bootstrapped) + when no snapshots exist (#9119) * Create snapshots only after runtime is ready * Fix * Add lock to renetion check task * Fix * Fix * Improve * Lint * Fix tests * Fix test * Fix substring matching bug in has_existing_snapshots The has_existing_snapshots() method used .contains() for path matching, which could incorrectly match dataset paths with similar prefixes (e.g., 'dataset=foo' would match 'dataset=foobar'). Changed to exact path segment matching using .split('/').any() to ensure only exact matches are detected. Added 6 unit tests covering the fix and edge cases. * fix lint * fix integration tests; rename initial_snapshot_created to checkpoint_counting_enabled --------- Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> * fix(cayenne): Fix upsert with pending deletions causing duplicate PKs (#9152) * fix(cayenne): Fix upsert with pending deletions causing duplicate PKs * Include test * fix(flightrepl): Add chrono-tz feature to flightrepl for timezone formatting (#9153) * fix: Set MISTRALRS_METAL_PRECOMPILE=0 when metal feature is enabled (#9154) * fix(delta_lake): Preserve container name in ABFSS URLs for Azure Delta Lake tables (#9155) * fix: Set query set properly on benchmarks telemetry metrics attributes (#9162) * fix(cli): Several CLI fixes from the Go to Rust migration (#9157) * fix(chat): Add model validation and error handling for model not found * fix(chat): Enhance model selection with dialoguer for improved user experience * Improvements * Consolidate repls * Formatting * Lint * Update snapshots * Fix chat output * Additional fixes * Add missing headers and version check * Add tempfile dependency and improve error handling in utilities * Add tests for JSON array to JSONL conversion and cache control functionality * Lint * Add additional openai tests (#9159) * v1.11.0-stable release notes (#9156) * docs: Add release 1.11 stable release notes * docs: Add grafana dashboards section: * fix: Remove endgame change for release branch target * PM edits * docs: Update changelog * release note edits --------- Co-authored-by: peasee <98815791+peasee@users.noreply.github.com> Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> * Build Docker and binaries nightly (#9167) * Build Docker and CLI nightly * Re-use binary for nightly * Apply suggestions from code review Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * fix: Update publish description and reorder setup steps in nightly build workflow * fix: Update dependencies for Docker build jobs to use correct Linux build jobs --------- Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * fix(ci): remove non-existent Makefile targets from workflow (#9168) Remove install-with-models target (never existed) and fix install-with-odbc to install-odbc (actual target name). * feat: Add nightly installer script for spice CLI and spiced runtime (#9170) * feat: Add nightly installer script for spice CLI and spiced runtime * fix(install): improve error handling for invalid run IDs in nightly installer Add proper error handling in getArtifactDownloadUrl() to gracefully handle cases where the GitHub Actions run ID doesn't exist or is inaccessible. Previously, the script would show a cryptic jq error when iterating over null artifacts. Now it checks for GitHub API error messages and empty artifact arrays, providing clear user-friendly error messages. * Fixes --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> * Fix Windows build by gating cayenne usage with #[cfg(not(windows))] (#9171) * fix: Cayenne TPCH SF100 benchmark shouldn't validate results (#9172) * Update warning message for Cayenne data accelerator (#9169) * fix(testoperator): Increase health check latency threshold to reduce false positives (#9184) * Update dev spicerack base URL to dev-api.spicerack.org (#9177) * Update dev spicerack base URL to dev-api.spicerack.org * Use dev endpoints only for -dev builds, not -unstable * Update release notes: remove Cayenne on_conflict support (#9186) * Use spice-rs v4.0.0 in CLI commands (#9144) * Use spice-rs v4.0.0 in CLI commands * Formatting * fix: Set MISTRALRS_METAL_PRECOMPILE=0 when metal feature is enabled (#9154) * fix(delta_lake): Preserve container name in ABFSS URLs for Azure Delta Lake tables (#9155) * fix: Set query set properly on benchmarks telemetry metrics attributes (#9162) * Use spicepod crate in CLI * fix: update spiceai dependency to version 3.2.0 * refactor: update query API client and improve logging format * Fix search * fix: refactor SQL EXPLAIN queries to use TryStreamExt for better error handling * Formatting --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: William <98815791+peasee@users.noreply.github.com> * Improve snapshot output (#9189) * fix(cayenne): Include protected snapshots in conflict detection keyset scan (#9176) * fix(cayenne): Include protected snapshots in conflict detection keyset scan * Fix lint --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * fix: E2E Release Installation version check (#9190) * fix: Release version check * fix: update CLI version extraction to handle commit hash * fix(cayenne): Fix data loss by preserving protected snapshots during cleanup (#9182) * fix(cayenne): Include protected snapshots in conflict detection keyset scan * Fix lint * fix(cayenne): Fix data loss by preserving protected snapshots during cleanup * Improve * Lint * Update * Use 'DESCRIBE <sql>' for testoperator (#9164) * use 'DESCRIBE <sql>' for testoperator * clippy * fix: Sqlite file Acceleration: "database is locked" (#9195) Fixes #8595 * fix: CLI upgrade command not upgrading CLI (#9198) * fix: CLI upgrade command only runtime * Formatting * feat: implement download_release_asset_with_fallback for improved asset retrieval * feat: enhance install and upgrade commands for improved flavor handling and asset retrieval * Fix Databricks struct field names for reserved keywords (#9196) * Fix Databricks struct field names for reserved keywords * fix lint * 1.11.0 release housekeeping (#9197) * Fix `DESCRIBE <sql>` (#9199) * use 'DESCRIBE <sql>' for testoperator * clippy * dont use schema, use columns * Fix nightly Docker build (#9211) * Cayenne: include append test configuration with PK and retention period (#9206) * feat(runtime-api-types): Add shared API types for runtime HTTP & CLI (#9194) * feat(runtime-api-types): Add shared API types for runtime HTTP endpoints and CLI * feat: Enhance API endpoints and documentation for catalogs, responses, search, tools, and workers * Update order * refactor: Simplify closure usage and remove unused re-exports in various modules * Add test for API key secret replacement (#9214) * Share semaphore across poll loops (#9213) * Support multiple partition expressions in partition_by configuration (#9201) Implements hierarchical/composite partitioning support for the runtime-table-partition crate. This allows users to specify multiple partition expressions (e.g., partition_by: [l_shipmode, l_returnflag]) and have data partitioned by all specified expressions. Key changes: - Changed Partition struct from partition_value: ScalarValue to partition_values: Vec<ScalarValue> - Updated PartitionCreator trait: create_partition now accepts Vec<ScalarValue> - Added encode_composite_key() function to encode multiple partition values as a path-like key (e.g., "2025/10/15") - Updated PartitionTableProvider to handle Vec<PartitionedBy> for multiple partition expressions - Updated scan() pruning logic to prune if ANY partition expression's filter excludes the partition - Updated data filter logic to exclude filters on any simple partition column - Added partition_batch_composite() for multi-expression partitioning (keeps partition_batch() for backwards compatibility) - Updated all callers (cayenne, partitioned_duckdb accelerators) The composite key encoding uses "/" as separator, creating hierarchical partition paths like "year=2025/month=10". Fixes #8539 * feat: Split data connectors into separate crates (#8936) * private: plans for dataconnector extraction * feat: extract postgres connector to separate crate Extract the postgres data connector from the runtime crate into its own connector-postgres crate under crates/data-connectors/. This enables faster incremental builds - changes to the postgres connector only require rebuilding the connector crate, not the entire runtime. Uses linkme distributed slices for automatic connector registration at link time. * feat: extract mysql connector to separate crate * feat: extract clickhouse connector to separate crate * feat: extract mssql connector to separate crate * feat: extract snowflake connector to separate crate * feat: extract duckdb connector to separate crate * feat: extract mongodb connector to separate crate * feat: extract oracle connector to separate crate * feat: extract spark connector to separate crate * feat: extract scylladb connector to separate crate * feat: extract flightsql connector to separate crate * feat: extract dremio connector to separate crate * feat: extract delta_lake connector to separate crate * feat: extract sharepoint connector to separate crate * feat: extract ftp connector to separate crate * feat: extract sftp connector to separate crate * feat: extract imap connector to separate crate * feat: extract nfs connector to separate crate * feat: extract smb connector to separate crate * feat: extract odbc connector to separate crate * fix: resolve clippy lint errors in connector crates * feat: extract databricks connector to separate crate Extracts the Databricks data connector from runtime to a standalone crate in crates/data-connectors/connector-databricks/. This follows the same pattern as other extracted connectors (clickhouse, snowflake, etc.). Key changes: - Create connector-databricks crate with data + catalog connector implementation - Add connector-databricks dependency to bin/spiced with databricks feature - Make token_providers module public in runtime for external connector access - Update catalogconnector/databricks.rs to use token_providers helpers - Remove databricks module from runtime dataconnector * fix: use explicit connector registration for runtime tests Remove feature gates from test connector registration since dev-dependencies are always compiled regardless of feature flags. This ensures all connectors are registered when running integration tests. Note: ODBC connector is excluded from dev-dependencies and test registration because it requires the unixODBC system library which may not be available in all test environments. Key changes: - search.rs: call register_test_connectors() in start_app() before Runtime - schema_evolution: call register_test_connectors() in initialize_runtime() - rehydration: call register_test_connectors() in init_spice_app() The connector registration must be called before each Runtime is created because Runtime::shutdown() clears the global connector registry. * fix lint * `spice nsql analyze` but in rust (#9209) * rewrite 'spice nsql analyze' in rust * fix sql count * formatting * Use shared build-spiced action (#9217) * chore: update runtime-api-types version to 2.0.0-unstable * Use shared `build-spiced` action * feat(build-spiced): enhance target directory handling and add Windows support for binary movement * Delete plans/dataconnector-extraction.md (#9225) * Distributed query executors expose Spice Flight Service. (#9220) * initial * fix 'ListActions' * clippy * Update crates/runtime/src/cluster/composite_flight_service.rs * fix: Disable TPCH SF10 results validation, add DuckDB TPCH SF10 spicepod (#9226) * fix: Disable TPCH SF10 results validation * feat: Add DuckDB TPCH SF10 spicepod * Update tools/testoperator/dispatch/tpch/sf10/accelerated/s3[parquet]-duckdb[file].yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: TPCH SF10 S3 target * fix: Update the tpch benchmark snapshots for: accelerated/s3[parquet]-cayenne[file].yaml * fix: Update the tpch benchmark snapshots for: accelerated/s3[parquet]-duckdb[file].yaml --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> * feat: Update `JobExecutor` to submit jobs via Ballista (#9221) * wip: JobExecutor submits jobs to ballista * fix: Add test utility for PKI generation * fix: Update jobexecutor to submit to ballista, add integration tests * chore: Clippy * fix: Duplicate cargo toml entries * chore: Address review * Update crates/test-framework/src/pki/mod.rs * chore: Clippy * Use claude-3-5-haiku for nsql test (#9229) * Use claude-3-5-haiku * Update * Update openapi.json (#9205) Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> * New `object_store_occ`crate and use in SchedulerRegistryRunner (#9222) * initial * integration tests on S3 * and cargo.lock * Add Object store OCC crate; use in scheduler registry * clippy * format * handle bad deleteion error * use new Version in update method * Upgrade to Turso v0.4.4 (#9120) * Add blog outlines * Socials * Updates * Updates * More vortex * Updates * Update * Updates * Vortex blog improvements * Upgrade to Turso v0.4.3 - Remove with_mvcc as it's on by default - Add busy timeout wait - fixes #8826 * Remove extra blog * chore: Update dependencies for syn, prost, thiserror, and uuid * Update crates/cayenne/tests/shared_metastore_concurrency_test.rs Co-authored-by: William <98815791+peasee@users.noreply.github.com> * Lint * fix: improve documentation for concurrent operation tests in shared metastore * fix: update turso dependency to version 0.4.4 and remove obsolete packages from Cargo.lock * Use claude-3-5-haiku for nsql test (#9229) * Use claude-3-5-haiku * Update * Update openapi.json (#9205) Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> * fix: update variable names for clarity in shared metastore concurrency tests --------- Co-authored-by: William <98815791+peasee@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> * feat: implement pretty printing for Arrow RecordBatches with data types (#9232) * feat: implement pretty printing for Arrow RecordBatches with data types * refactor: improve formatting logic and simplify data type handling in pretty printing * S3Vector client level tracing and simplify retry (#9141) * add tracing in client * pass inputs by reference * clean * clean * clean * mod middleware * test build * Use claude haiku for cost (#9245) * fix: Update Search integration test snapshots (#9242) * fix: Update the test snapshots * CI --------- Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: jeadie <jack@spice.ai> * Refactor S3 express one zone logic from Cayenne accelerator (#9139) * refactor Cayenne S3 code into separate file * clippy * refact * fix: Stream batches from completed async jobs into JobStore (#9228) * wip: JobExecutor submits jobs to ballista * fix: Add test utility for PKI generation * fix: Update jobexecutor to submit to ballista, add integration tests * chore: Clippy * fix: Duplicate cargo toml entries * chore: Address review * Update crates/test-framework/src/pki/mod.rs * chore: Clippy * fix: Stream async job batches into job store * fix: Make stream read error source a concrete error * fix: Apply schema with empty results for write_result_chunks, update tests * chore: clippy * fix: Enable MVCC for Turso 0.4.4 via PRAGMA journal_mode (#9247) In turso 0.4.4, the Builder.with_mvcc() method was removed. MVCC must now be enabled via PRAGMA journal_mode = 'experimental_mvcc' on the connection after database creation. This fixes test failures where BEGIN CONCURRENT was being used without MVCC enabled, causing 'Concurrent transaction mode is only supported when MVCC is enabled' errors. * Add distributed query mode support to testoperator (#9248) Adds --distributed flag to testoperator commands (bench, query, throughput, load) to enable benchmarking queries via the /v1/queries async API used for distributed query execution in cluster mode. Changes: - Add QUERIES_ENDPOINT constant in test-framework/constants.rs - Add --distributed CLI flag to QueryArgs and DatasetTestArgs - Add distributed_mode field to NotStarted test builder and SpiceTestQueryWorker - Implement execute_distributed() method that: - Submits query via POST /v1/queries - Polls /v1/queries/{query_id}/status until completion - Fetches results from /v1/queries/{query_id}/results - Returns error if HTTP client not configured or manifest fields missing - Update execute_query() to route to distributed mode when enabled - Update all commands (bench, query, throughput, load) to pass distributed flag - Add distributed field to dispatch LoadArgs for workflow configuration - Add #[expect] attributes for clippy::struct_excessive_bools and too_many_lines * fix: Provide better job store error handling (#9235) * fix: Provide better job store error handling * chore: fix build * fix: propogate all listing errors * docs: Update docstrings * chore: bad merge * update datafusion (#9249) * Add UDTF serialization support for distributed Ballista execution (#9200) * Add UDTF serialization support for distributed Ballista execution Implement protobuf-based serialization for UDTFs (list_udfs, text_search, vector_search, rrf) to enable distributed query execution in Ballista clusters. When queries with UDTFs are sent to remote executors, SpiceLogicalCodec now: - Encodes UDTF arguments to protobuf in try_encode_table_provider - Decodes and re-invokes the UDTF on the executor in try_decode_table_provider This fixes the error: 'SpiceLogicalCodec could not resolve table reference' when running UDTF queries in distributed mode. Closes #8806 * Switch RRF decay protobuf fields to double and harden UDTF serialization * Add UdtfExec for distributed UDTF execution support Enable UDTFs like list_udfs() to work in cluster/distributed mode by: - Create UdtfExec execution plan that wraps UDTF results with serializable args - Add UdtfExecNode protobuf message for physical plan serialization - Update EnsureSupportedFileScan optimizer to skip MemorySourceConfig validation for DataSourceExec nodes that are children of UdtfExec - Update ListUDFTable to wrap its scan result in UdtfExec - Add encode/decode support for UdtfExec in SpicePhysicalCodec The UdtfExec stores UDTF arguments (already defined in UdtfArgs proto) so that when the plan is distributed to remote executors, the UDTF can be re-invoked to produce the same results locally on each executor. * feat: Add protobuf serialization for UDTFs to enable distributed query execution This PR implements protobuf-based serialization for User-Defined Table Functions (UDTFs) to enable distributed query execution in Ballista clusters. It addresses the error 'SpiceLogicalCodec could not resolve table reference' that occurs when UDTF queries are sent to remote executors. Changes: - Added protobuf schema for serializing UDTF arguments (list_udfs, text_search, vector_search, rrf) - Implemented encoding/decoding logic in SpiceLogicalCodec to serialize UDTF-produced TableProviders - Enhanced UDTF providers to track their original invocation arguments for serialization - Added UdtfExec execution plan wrapper for physical plan serialization - Added EnsureSupportedFileScan physical optimizer to validate plans before distribution Feedback incorporated: - Changed invoke_udtf visibility to pub(crate) as it's an internal implementation detail - Added error handling for malformed RRF nested queries instead of silently skipping - Added documentation for blocking tokio pattern in physical codec deserialization - Made VectorSearchUDTFProvider.args private with getter method for encapsulation - Fixed clippy lints (doc_markdown, collapsible_if, match_same_arms, etc.) * Simplify retention filter expressions before pushdown (#9244) * Simplify retention filter expressions before pushdown * Update crates/runtime/src/datafusion/expr_utils.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update * Update crates/runtime/src/accelerated_table/retention.rs Co-authored-by: Jack Eadie <jack@spice.ai> * Update crates/runtime/src/datafusion/expr_utils.rs Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * Fix --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> * feat: Support multiple `partition_by` expressions for Cayenne (#9241) * feat: Support multiple partition by expressions with Cayenne * chore: clippy * fix: Support LargeUtf8 in runtime partition builder * review: Address comments * feat: Update DuckDB to 1.4.4 (#9252) * feat: Update DuckDB to 1.4.4 * fix: Use spiceai-51 * Cayenne: row-based delete logic improvements (#9237) * cayenne: Per-file row-based deletion vectors with Vortex-native streaming scan * Update vortex ref * PositionBased row_ids: Vec<i64> -> row_ids: Vec<u64> * Update cargo * Upd * Update * Refactor * Improve * Fix headers * don't refresh accelerations on scheduler node (#9250) * don't refresh accelerations on scheduler node * Apply suggestion Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix test_retention_complex_sql (#9270) Updated execution properties to include query execution start time for correct evaluation of 'now()'. * build(deps): bump actions/cache from 5.0.2 to 5.0.3 (#9261) Bumps [actions/cache](https://github.com/actions/cache) from 5.0.2 to 5.0.3. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](https://github.com/actions/cache/compare/8b402f58fbc84540c8b491a91e594a4576fec3d7...cdf6c1fa76f9f475f3d7449005a359c84ca0f306) --- updated-dependencies: - dependency-name: actions/cache dependency-version: 5.0.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Dependabot updates (#9271) * build(deps): bump aws-sdk-cognitoidentityprovider Bumps [aws-sdk-cognitoidentityprovider](https://github.com/awslabs/aws-sdk-rust) from 1.106.0 to 1.107.0. - [Release notes](https://github.com/awslabs/aws-sdk-rust/releases) - [Commits](https://github.com/awslabs/aws-sdk-rust/commits) --- updated-dependencies: - dependency-name: aws-sdk-cognitoidentityprovider dependency-version: 1.107.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump zip from 6.0.0 to 7.2.0 Bumps [zip](https://github.com/zip-rs/zip2) from 6.0.0 to 7.2.0. - [Release notes](https://github.com/zip-rs/zip2/releases) - [Changelog](https://github.com/zip-rs/zip2/blob/master/CHANGELOG.md) - [Commits](https://github.com/zip-rs/zip2/compare/v6.0.0...v7.2.0) --- updated-dependencies: - dependency-name: zip dependency-version: 7.2.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump aws-smithy-runtime from 1.9.8 to 1.10.0 Bumps [aws-smithy-runtime](https://github.com/smithy-lang/smithy-rs) from 1.9.8 to 1.10.0. - [Release notes](https://github.com/smithy-lang/smithy-rs/releases) - [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/smithy-lang/smithy-rs/commits) --- updated-dependencies: - dependency-name: aws-smithy-runtime dependency-version: 1.10.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump docker/login-action from 3.6.0 to 3.7.0 Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](https://github.com/docker/login-action/compare/5e57cd118135c172c3672efd75eb46360885c0ef...c94ce9fb468520275223c153574b00df6fe4bcc9) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 3.7.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump github/codeql-action from 4.31.11 to 4.32.0 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.31.11 to 4.32.0. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](https://github.com/github/codeql-action/compare/19b2f06db2b6f5108140aeb04014ef02b648f789...b20883b0cd1f46c72ae0ba6d1090936928f9fa30) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.32.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump dawidd6/action-download-artifact from 12 to 14 Bumps [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) from 12 to 14. - [Release notes](https://github.com/dawidd6/action-download-artifact/releases) - [Commits](https://github.com/dawidd6/action-download-artifact/compare/0bd50d53a6d7fb5cb921e607957e9cc12b4ce392...5c98f0b039f36ef966fdb7dfa9779262785ecb05) --- updated-dependencies: - dependency-name: dawidd6/action-download-artifact dependency-version: '14' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump pingora-lru from 0.1.1 to 0.7.0 Bumps [pingora-lru](https://github.com/cloudflare/pingora) from 0.1.1 to 0.7.0. - [Release notes](https://github.com/cloudflare/pingora/releases) - [Changelog](https://github.com/cloudflare/pingora/blob/main/CHANGELOG.md) - [Commits](https://github.com/cloudflare/pingora/compare/0.1.1...0.7.0) --- updated-dependencies: - dependency-name: pingora-lru dependency-version: 0.7.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump tiktoken-rs from 0.6.0 to 0.9.1 Bumps [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs) from 0.6.0 to 0.9.1. - [Release notes](https://github.com/zurawiki/tiktoken-rs/releases) - [Commits](https://github.com/zurawiki/tiktoken-rs/compare/v0.6.0...v0.9.1) --- updated-dependencies: - dependency-name: tiktoken-rs dependency-version: 0.9.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * fix: wrap CoreBPE to satisfy chunk sizer trait * fix: use openai chunker helper in parsley * fix: silence doc markdown lint for CoreBpeSizer * fix: avoid Cow as_ref ambiguity in FlightSQL tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Replace deprecated serde_yaml with internal yaml implementation (#9215) * Share semaphore across poll loops * Replace deprecated serde_yaml with internal yaml implementation * Refactor code structure for improved readability and maintainability * chore: update runtime-api-types version to 2.0.0-unstable * Share semaphore across poll loops (#9213) * Support multiple partition expressions in partition_by configuration (#9201) Implements hierarchical/composite partitioning support for the runtime-table-partition crate. This allows users to specify multiple partition expressions (e.g., partition_by: [l_shipmode, l_returnflag]) and have data partitioned by all specified expressions. Key changes: - Changed Partition struct from partition_value: ScalarValue to partition_values: Vec<ScalarValue> - Updated PartitionCreator trait: create_partition now accepts Vec<ScalarValue> - Added encode_composite_key() function to encode multiple partition values as a path-like key (e.g., "2025/10/15") - Updated PartitionTableProvider to handle Vec<PartitionedBy> for multiple partition expressions - Updated scan() pruning logic to prune if ANY partition expression's filter excludes the partition - Updated data filter logic to exclude filters on any simple partition column - Added partition_batch_composite() for multi-expression partitioning (keeps partition_batch() for backwards compatibility) - Updated all callers (cayenne, partitioned_duckdb accelerators) The composite key encoding uses "/" as separator, creating hierarchical partition paths like "year=2025/month=10". Fixes #8539 * feat: Split data connectors into separate crates (#8936) * private: plans for dataconnector extraction * feat: extract postgres connector to separate crate Extract the postgres data connector from the runtime crate into its own connector-postgres crate under crates/data-connectors/. This enables faster incremental builds - changes to the postgres connector only require rebuilding the connector crate, not the entire runtime. Uses linkme distributed slices for automatic connector registration at link time. * feat: extract mysql connector to separate crate * feat: extract clickhouse connector to separate crate * feat: extract mssql connector to separate crate * feat: extract snowflake connector to separate crate * feat: extract duckdb connector to separate crate * feat: extract mongodb connector to separate crate * feat: extract oracle connector to separate crate * feat: extract spark connector to separate crate * feat: extract scylladb connector to separate crate * feat: extract flightsql connector to separate crate * feat: extract dremio connector to separate crate * feat: extract delta_lake connector to separate crate * feat: extract sharepoint connector to separate crate * feat: extract ftp connector to separate crate * feat: extract sftp connector to separate crate * feat: extract imap connector to separate crate * feat: extract nfs connector to separate crate * feat: extract smb connector to separate crate * feat: extract odbc connector to separate crate * fix: resolve clippy lint errors in connector crates * feat: extract databricks connector to separate crate Extracts the Databricks data connector from runtime to a standalone crate in crates/data-connectors/connector-databricks/. This follows the same pattern as other extracted connectors (clickhouse, snowflake, etc.). Key changes: - Create connector-databricks crate with data + catalog connector implementation - Add connector-databricks dependency to bin/spiced with databricks feature - Make token_providers module public in runtime for external connector access - Update catalogconnector/databricks.rs to use token_providers helpers - Remove databricks module from runtime dataconnector * fix: use explicit connector registration for runtime tests Remove feature gates from test connector registration since dev-dependencies are always compiled regardless of feature flags. This ensures all connectors are registered when running integration tests. Note: ODBC connector is excluded from dev-dependencies and test registration because it requires the unixODBC system library which may not be available in all test environments. Key changes: - search.rs: call register_test_connectors() in start_app() before Runtime - schema_evolution: call register_test_connectors() in initialize_runtime() - rehydration: call register_test_connectors() in init_spice_app() The connector registration must be called before each Runtime is created because Runtime::shutdown() clears the global connector registry. * fix lint * `spice nsql analyze` but in rust (#9209) * rewrite 'spice nsql analyze' in rust * fix sql count * formatting * Use shared build-spiced action (#9217) * chore: update runtime-api-types version to 2.0.0-unstable * Use shared `build-spiced` action * feat(build-spiced): enhance target directory handling and add Windows support for binary movement * Delete plans/dataconnector-extraction.md (#9225) * Distributed query executors expose Spice Flight Service. (#9220) * initial * fix 'ListActions' * clippy * Update crates/runtime/src/cluster/composite_flight_service.rs * fix: Disable TPCH SF10 results validation, add DuckDB TPCH SF10 spicepod (#9226) * fix: Disable TPCH SF10 results validation * feat: Add DuckDB TPCH SF10 spicepod * Update tools/testoperator/dispatch/tpch/sf10/accelerated/s3[parquet]-duckdb[file].yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: TPCH SF10 S3 target * fix: Update the tpch benchmark snapshots for: accelerated/s3[parquet]-cayenne[file].yaml * fix: Update the tpch benchmark snapshots for: accelerated/s3[parquet]-duckdb[file].yaml --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> * feat(pki): add PKI test helpers for clustered Spice instances - Introduced a new module `pki` in the test framework to facilitate the creation of a test Public Key Infrastructure (PKI). - Implemented functions to initialize a CA certificate and generate client certificates with support for Subject Alternative Names (SANs). - Added validation for client names to ensure they conform to allowed character sets. - Included comprehensive tests to verify the functionality of PKI initialization and client certificate creation. * Use claude-3-5-haiku for nsql test (#9229) * Use claude-3-5-haiku * Update * Update openapi.json (#9205) Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> * feat: Enhance YAML parsing to handle multi-document errors and improve float comparison * feat(yaml): add multi-document YAML parsing support with comprehensive tests * style(tests): reformat multi-document YAML test cases for improved readability * refactor(tests): replace unwrap with expect for better error handling in YAML serialization and deserialization tests * refactor(tests): replace unwrap with expect for better error handling in YAML serialization and parsing tests * feat(yaml): add tests for YAML serialization and ensure output ends with newline * feat: Enhance Kafka integration by adding broker readiness verification * fix: Add stabilization delay and increase message timeout for Kafka producer * fix: Update topic name retrieval in broker readiness verification * fix: Set query execution start time for time-dependent expression simplification * test: Enhance error handling and assertions for YAML serialization and float equality --------- Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: Jack Eadie <jack@spice.ai> Co-authored-by: William <98815791+peasee@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Spice Benchmark Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> * refactor: Update `Query` to support submitting async queries (#9254) * fix: Refactor async query handling * fix: reuse try_get_cached_result * fix: Ensure parameters are supplied to query builder * refactor: Improve query tracking, async job cancellation * chore: Change info to debug * chore: Clippy * chore: Clippy * review: Address comments * chore: remove dead code * chore: clippy * fix: Update Datafusion for preventing stack overflows on nested BinaryExpr (#9287) * wip * fix: Update Datafusion for preventing stack overflows on nested BinaryExpr * feat(libnfs): Add safe Rust bindings for libnfs C library (#9230) * feat(libnfs): Add safe Rust bindings for libnfs C library - Introduced a new crate `libnfs` for safe Rust bindings to the libnfs C library, enabling NFS client operations. - Implemented a build script to link against the libnfs library and generate bindings using bindgen. - Added functionality for NFS operations such as mounting, file handling, directory listing, and file statistics. - Updated `Cargo.toml` files to include `libnfs` as a member and dependency. - Refactored existing code in `runtime-object-store` to utilize the new `libnfs` bindings for NFS operations. - Enhanced error handling and added documentation for the new API. * fix: Update connector versions from enterprise-beta to unstable * fix: Allow clippy::allow_attributes lint for bindgen-generated code * fix: Update bindgen version and dependencies in Cargo.lock * fix: Update license in Cargo.toml and improve error handling in NFSObjectStore * feat: Add snafu error handling for NFS operations and update dependencies * fix: Import Command for macOS include path detection in build script * fix: Update bindgen version and simplify get_readmax/get_writemax return types * fix: Improve safety in NFS read operations by initializing buffer and clarifying extern block usage * feat: Enhance NFS support with conditional compilation for libnfs API version detection and improve timestamp handling * fix: Update libnfs dependency configuration in workspace and adjust versioning * fix: Update NFS method signatures to use immutable references for improved safety and performance * feat: Allow underscore fields in public structs and improve buffer length error handling * refactor(libnfs): improve error handling and enforce 64-bit pointer width requirement * refactor(nfs): improve code formatting for readability * refactor(nfs): simplify error handling in opendir method * Fix Cayenne partitioned table deletion support (#9267) * Fix Cayenne partitioned table deletion support * Fix cargo.lock * FlightSQL: add cookie middleware support (#9282) * feat(flight-client): add cookie middleware * feat(flightsql): persist cookies for sticky sessions * chore: add flight cookie test server * test: cover cookie persistence * fix(flight-client): trim cookie name/value * fix: appease clippy in flight-cookie-server * Remove scheduled dispatch for Testoperator (#9296) Removed scheduled dispatch step for Testoperator Text-to-SQL. * Apply `SchemaCastScanExec` before applying changes in `process_upsert_batch` (#9297) * Fix schema for DynamoDB/Cayenne * Update clap dependency in Cargo.lock * DynamoDB Streams Benchmarks (#9295) * DynamoDB Streams Benchmarks * Lint * Snapshots * Lint * Lint * Lint * Fixes * Lint * Update * Lint * fix: Respect timeout_seconds and maximum_size for async jobs (#9286) * fix: Respect timeout_seconds and maximum_size for async jobs * Update crates/runtime/src/http/v1/queries.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * review: address comment * review: Address comments, fix linting * chore: clippy and compile * chore: clippy * fix: job store integration test * chore: more clippy what is happening --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * cluster: reduce executor heartbeat timeout from 180s to 30s (#9300) Reduce heartbeat interval from 30s to 10s and executor timeout from 180s to 30s for faster failure detection and task reassignment when executors go down. With 10s heartbeat interval and 30s timeout, executors are detected as dead after missing ~3 consecutive heartbeats. Fixes #9288 * v1/search API to always return an array in matches (#9272) * v1/search API to always return an array in matches * fix(search): Update v1/search API to always return an array for matches * fix newline * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix bad unicode * update missed snapshots --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Log health check issues as warnings instead of failing benchmarks (#9301) During intensive benchmark/throughput runs, health check latency can spike due to CPU utilization, causing false positive failures. This change logs the health check issues as warnings instead of failing the entire test run. The health check metrics are still recorded via OpenTelemetry for monitoring purposes. * DuckDB (partition_mode: tables): rename partitioned_write_flush_threshold to partitioned_write_flush_threshold_rows (#9257) * runtime: deregister executors on shutdown (#9302) Send executor shutdown notifications over the control stream, use advertise host:port executor IDs, and add a regression check for prompt scheduler removal. Fixes #9289 * fix: Update Search integration test snapshots (#9299) * fix: Update the test snapshots * ci --------- Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: jeadie <jack@spice.ai> * Log health check issues as warnings instead of failing append benchmarks (#9304) * fix: Update test snapshots (#9307) * fix: Update the test snapshots * ci --------- Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: jeadie <jack@spice.ai> * fix: Update Search integration test snapshots (#9280) * fix: Update the test snapshots * ci --------- Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> Co-authored-by: jeadie <jack@spice.ai> * From v1/sql in scheduler, use accelerated partitions from executor. (#9251) * initial work and refactor * scheduler uses accelerations from executors at query time * compile * runtime: deregister executors on shutdown Send executor shutdown notifications over the control stream, use advertise host:port executor IDs, and add a regression check for prompt scheduler removal. Fixes #9289 * Jeadie/26 02 02/testing (#9303) * don't refresh accelerations on scheduler node * Add distributed query mode support to testoperator (#9248) Adds --distributed flag to testoperator commands (bench, query, throughput, load) to enable benchmarking queries via the /v1/queries async API used for distributed query execution in cluster mode. Changes: - Add QUERIES_ENDPOINT constant in test-framework/constants.rs - Add --distributed CLI flag to QueryArgs and DatasetTestArgs - Add distributed_mode field to NotStarted test builder and SpiceTestQueryWorker - Implement execute_distributed() method that: - Submits query via POST /v1/queries - Polls /v1/queries/{query_id}/status until completion - Fetches results from /v1/queries/{query_id}/results - Returns error if HTTP client not configured or manifest fields missing - Update execute_query() to route to distributed mode when enabled - Update all commands (bench, query, throughput, load) to pass distributed flag - Add distributed field to dispatch LoadArgs for workflow configuration - Add #[expect] attributes for clippy::struct_excessive_bools and too_many_lines * fix: Provide better job store error handling (#9235) * fix: Provide better job store error handling * chore: fix build * fix: propogate all listing errors * docs: Update docstrings * chore: bad merge * fix flight scan rule --------- Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: William <98815791+peasee@users.noreply.github.com> * add cookies * clean up: * fix * clippy * clippy * temp * fixes * clippy --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: William <98815791+peasee@users.noreply.github.com> * fix: Improve Cayenne error for unsupported partition values (#9312) * chore: update ballista to 68afffbdb36e with in-memory shuffle fix (#9310) * test: add integration test for in-memory shuffle with multiple executors * chore: update ballista to 68afffbdb36e with in-memory shuffle fix * refactor: simplify file write operation in in-memory shuffle test * fix: correct CSV string formatting in in-memory shuffle test * fix: Ensure the job store makes conditional writes (#9309) * fix: Ensure the job store makes conditional writes * test: Add test for conditional chunk writes * feat(http): Return all HTTP responses as data, skip caching 5xx (#9313) * Add response_status * Update retry and supported status codes * Don't cache 5xx response data * Fix empty content and add tests * lint * Add more tests * lint * Fix test test_swr_handle_cache_hit_refreshes_only_accessed_entry * fix: Update the test snapshots (#9319) Co-authored-by: Spice Snapshot Update Bot <spiceaibot@spice.ai> * Add engine to snapshot metadata & fix row count (#9193) * Add engine to snapshot metadata & fix row count * Refactor snapshot creation calls for improved readability * feat: Enhance snapshot functionality with accelerator integration and add performance tests for lock contention * refactor: Improve code readability in snapshot lock contention tests * refactor: Simplify error handling in snapshot workload table retrieval * refactor: Simplify result handling in snapshot lock contention tests and improve linting exceptions * refactor: Add Clippy expectations for precision loss and truncation in contention metrics * fix(snapshot): Improve snapshot handling and row count conversion logic * fix(snapshot): Simplify row count conversion logic in get_row_count function * chore: update AWS SDK dependencies to latest versions (#9320) Crate Old New aws-config 1.8.12 1.8.13 aws-credential-types 1.2.10 1.2.11 aws-runtime 1.5.16 1.6.0 aws-sdk-bedrockruntime 1.120.0 1.124.0 aws-sdk-cognitoidentity 1.91.0 1.94.0 aws-sdk-cognitoidentityprovider 1.107.0 1.108.0 aws-sdk-dynamodb 1.100.0 1.104.0 aws-sdk-dynamodbstreams 1.87.0 1.94.0 aws-sdk-glue 1.132.0 1.137.0 aws-sdk-s3 1.119.0 1.122.0 aws-sdk-s3vectors 1.17.0 1.19.0 aws-sdk-secretsmanager 1.95.0 1.99.0 aws-sdk-sts 1.94.0 1.97.0 aws-smithy-async 1.2.6 1.2.11 aws-smithy-runtime 1.10.0 1.10.0 (unchanged) aws-smithy-runtime-api 1.9.2 1.11.3 aws-smithy-types 1.3.4 1.4.3 * feat: Update QueryHandle to wait for job completion using job notifier (#9306) * feat: Update QueryHandle to wait for job completion using ballista job notifier * chore: clippy * fix: build * chore: clippy * fix: Update async openapi fork to disable recursion for CompoundFilter (#9322) * fix: Update async openapi fork to disable recursion for CompoundFilter * Update openapi.json (#9323) Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Spice Schema Bot <schema-bot@spice.ai> * Fix sccache setup failing on fork PRs due to missing secrets (#9327) * fix(caching): Handle timestamp precision mismatch and add more tests (#9315) * fix(caching): Handle timestamp precision mismatch and optimize batch writes * Update * Update crates/runtime/src/accelerated_table/caching.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix lint * Fix merge --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix Show sccache stats in build-test to check SCCACHE_SETUP (#9328) * Add Databricks SQL dialect for DataFusion unparser (#9314) * Add Databricks SQL dialect for DataFusion unparser Implements a custom Databricks dialect that generates Spark SQL-compatible queries instead of using the generic CustomDialect. This enables proper function translation for Databricks SQL Warehouse queries. Key changes: - Add DatabricksDialect implementing DataFusion's Dialect trait - Translate array_has/list_has to Spark SQL's array_contains - Use backtick identifier quoting for Databricks compatibility - Configure MySQL-style intervals - Comprehensive test coverage for dialect behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix copyright year and use .expect() directly in test * Fix lint --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> * DynamoDB Streams Table Rebootstrapping (#9305) * WIP * DynamoDB Table Rebootstrapping * Refactor * Fix * Lint * Improvements * Lint * Lint * Lint * Fixes * Minor improvements for streaming benchmarks (#9329) * Fix streaming benchmarks * Fix * Lint * Fix * Fix * runtime: avoid double projection in federated task history\n\nQuery local task_history batches without projection and apply projection once\nwhen constructing MemorySourceConfig in FederatedTaskHistoryTable::scan().\n\nAdds a regression test for sparse projections (e.g. [7, 10]) to ensure\nprojected scans execute without out-of-bounds errors.\n\nFixes #9324 (#9326) * Update ballista to support object_store in distributed queries (#9331) * Snapshots Improvements (#9318) * Fix * Lint * Fix * Fix * Add compaction * Fix * Fix * Retries for SnapshotManager (#9334) * Initial implementation of Ducklake catalog & data connectors * Add module declarations for Debezium, Ducklake, and DynamoDB data connectors * Add Cluster Observability (Metrics+Dashboard) (#9066) * feat(cluster): Add scheduler_id column to task_history for cluster observability (Phase 1 & 2) Phase 1: Add protobuf schema for cluster observability RPCs - Add GetTaskHistory RPC for federated task history queries - Add GetMetrics RPC for cluster-wide metrics collection - Add ControlStream RPC for bidirectional executor-scheduler communication - Add associated message types (request/response, heartbeat, metrics) Phase 2: Add conditional scheduler_id column to task_history - Modify table_schema() to conditionally include scheduler_id in cluster mode - Update TaskSpan struct with scheduler_id field - Update TaskHistoryExporter to accept and populate scheduler_id - Compute scheduler_id from cluster config (host:port format) - Add stub implementations for new gRPC RPCs in ClusterServiceImpl - Schema is now fetched from registered table provider for consistency The scheduler_id column enables federated queries across cluster schedulers by identifying which node executed each task. * feat(cluster): Implement GetTaskHistory RPC and FederatedTaskHistoryTable (Phase 3) * feat(cluster): Implement Executor Control Stream for scheduler-executor communication (Phase 4) * feat(cluster): Implement Cluster Metrics Endpoint with OTLP to Prometheus conversion (Phase 5) - Add ?scope=cluster query parameter support to /metrics endpoint - Implement OTLP-to-Prometheus text format conversion for cluster metrics - Update metrics server to accept optional ClusterMetricsCollector - Add helper functions for metric/label name sanitization and escaping - Add comprehensive tests for OTLP to Prometheus conversion - Clean up control_stream_client.rs (defer executor metrics to Phase 6) Phase 5 of cluster observability implementation. The metrics server now supports cluster-wide metrics collection when ?scope=cluster is specified. Actual wiring of ClusterMetricsCollector is deferred to Phase 6 integration. * feat(cluster): Wire cluster observability components for /metrics?scope=cluster endpoint (Phase 6) * feat(cluster): Add comprehensive OpenTelemetry metrics for cluster mode Add 31 new metrics for monitoring distributed query execution in cluster mode: - Node status: node_status, scheduler_active_executors_count, scheduler_count - Task metrics: node_tasks_total, node_tasks_active, node_task_duration_ms, node_task_failures, node_task_retries, scheduler_task_queue_depth, scheduler_task_scheduling_latency_ms - Stage metrics: scheduler_stages_total, scheduler_stage_duration_ms, scheduler_stage_failures, scheduler_stage_retries, scheduler_tasks_per_stage - Executor metrics: executor_tasks_active, executor_tasks_total, executor_task_failures, executor_memory_available_bytes - Shuffle metrics: node_shuffle_write_bytes, node_shuffle_write_rows, node_shuffle_write_duration_ms, node_shuffle_read_bytes, node_shuffle_read_rows, node_shuffle_read_duration_ms - Scheduler ops: scheduler_job_queue_depth, scheduler_planning_duration_ms, scheduler_executor_assignments - Add OtelExecutorMetricsCollector implementing Ballista ExecutorMetricsCollector - Add OtelSchedulerMetricsCollector implementing Ballista SchedulerMetricsCollector - Wire up collectors in cluster initialization (replaces LoggingMetricsCollector) - Instrument scheduler_registry.rs to track scheduler_count when peers change - Add node_status metric recording in status.rs Requires corresponding Ballista fork changes for trait extensions. * fix: use local_task_history table for cluster RPC to avoid infinite recursion * feat: add shuffle locality and result fetch metrics for cluster mode - Add executor shuffle read locality metrics (local vs remote) - executor_shuffle_read_local_bytes/rows/count/duration_ms - executor_shuffle_read_remote_bytes/rows/count/duration_ms - Add scheduler result fetch metrics - scheduler_result_fetch_bytes/rows/count/duration_ms - Rename shuffle metrics to be role-specific: - node_shuffle_write_* -> executor_shuffle_write_* - node_task_duration_ms -> executor_task_duration_ms - Remove generic node_shuffle_read_* (replaced by locality metrics) - Fix MetricsReader registration on executors without --metrics flag - Rename metric labels: spice_node_id -> node_id, spice_node_role -> node_role - Add FederatedTaskHistoryTable.insert_into() for cluster task history writes - Update Ballista dependency to include shuffle read metrics callbacks * fix: prevent duplicate node_id labels in cluster metrics and update Ballista dependency - Fix add_labels_to_metric_data_points to check if node_id/node_role labels already exist before adding them, preventing duplicate labels like 'node_id="x",node_id="x"' - Update Ballista dependency to 201f2ee7 with: - Shuffle affinity metrics instrumentation - Actual task scheduling latency tracking (scheduler_task_scheduling_latency_ms now reports real values instead of 0) * dashboard plan * add max slots metric * Add distributed Grafana dashboard and local observability stack * Update distributed dashboard and executor utilization * remove plans * fix lint * proto for 'CayenneAccelerationExec' (#9094) * basic script to run distbributed spice * improve * proto for 'CayenneAccelerationExec' * build(deps): bump github/codeql-action from 4.31.10 to 4.31.11 (#9108) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.31.10 to 4.31.11. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](https://github.com/github/codeql-action/compare/cdefb33c0f6224e58673d9004f47f7cb3e328b89...19b2f06db2b6f5108140aeb04014ef02b648f789) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.31.11 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Remove `DistributeFileScanOptimizer` and `UnionProjectionPushdownOptimizer` & set `target_partitions` dynamically based on cluster capacity (#9100) * Remove redundant cluster physical optimizer rules Remove DistributeFileScanOptimizer and UnionProjectionPushdownOptimizer as they are redundant with vanilla Ballista/DataFusion behavior: - DistributeFileScanOptimizer: Ballista already distributes file groups as separate tasks natively. Each file group in FileScanConfig becomes one partition/task automatically. This optimizer was adding unnecessary shuffle stages (network + disk I/O) without benefit. - UnionProjectionPushdownOptimizer: DataFusion's built-in ProjectionPushdown optimizer already handles projection pushdown through UnionExec via try_swapping_with_projection. This was only needed to complement DistributeFileScanOptimizer. * Set target_partitions dynamically based on cluster capacity - Remove hardcoded target_partitions=16 default for distributed queries - Add background task that polls cluster state for total executor slots - Update session_builder to use dynamic target_partitions = sum(executor.task_slots) - Falls back to 16 partitions when no executors are registered yet - Ensures query parallelism scales with cluster size automatically * set to debug log * build(deps): bump zip from 2.4.2 to 6.0.0 (#9111) Bumps [zip](https://github.com/zip-rs/zip2) from 2.4.2 to 6.0.0. - [Release notes](https://github.com/zip-rs/zip2/releases) - [Changelog](https://github.com/zip-rs/zip2/blob/master/CHANGELOG.md) - [Commits](https://github.com/zip-rs/zip2/compare/v2.4.2...v6.0.0) --- updated-dependencies: - dependency-name: zip dependency-version: 6.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * Add PollNow interrupt for Ballista executors to reduce task scheduling latency (#9098) * Add PollNow interrupt for Ballista executors to reduce task scheduling latency - Add PollNowCommand proto message and update SchedulerControlMessage - Add ExecutorStreamRegistry to track connected executor control streams - Implement broadcast_poll_now() to send PollNow to all executors - Update ControlStreamManager to provide shared Notify handle for poll wake-up - Wire poll_now_notify through spawn_scheduler_poll_loop and update_scheduler_pollers - Add executor_stream_registry field to DataFusion struct for scheduler access - Update ClusterServiceImpl to use shared ExecutorStreamRegistry - Update Cargo.toml to reference ballista fork commit e0ab27eeef9f * Fix testoperator dispatch (#9097) * Rename ExecutorStreamRegistry to ExecutorControlStreamRegistry for clarity and update references throughout the cluster and datafusion modules. * Remove unused scheduler configuration from create_scheduler_server function --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> * fix: update Ballista fork to include executor timeout fix (#9124) Update datafusion-ballista to rev 626181fbc04a6f964abc459f99f62cf7742f75b4 which includes the fix for executor timeout due to lock contention in the scheduler event loop. * Properly propagate SIGINT/SIGTERM from CLI to runtime (#9127) * Properly progate SIGINT/SIGTERM to spice runtime * Lint * fix: Use the same vortex dependency as ballista (#9123) * fix: Use the same vortex dependency as ballista * fix: Add patch * fix: Update to specific rev * build(deps): combined dependabot updates (#9128) * build(deps): bump actions/setup-python from 6.1.0 to 6.2.0 Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.1.0 to 6.2.0. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/83679a892e2d95755f2dac6acb0bfd1e9ac5d548...a309ff8b426b58ec0e2a45f0f869d46889d02405) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: 6.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump azure_core from 0.30.1 to 0.31.0 Bumps [azure_core](https://github.com/azure/azure-sdk-for-rust) from 0.30.1 to 0.31.0. - [Release notes](https://github.com/azure/azure-sdk-for-rust/releases) - [Commits](https://github.com/azure/azure-sdk-for-rust/compare/azure_core@0.30.1...azure_core@0.31.0) --- updated-dependencies: - dependency-name: azure_core dependency-version: 0.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): bump rustyline from 15.0.0 to 17.0.2 Bumps [rustyline](https://github.com/kkawakam/rustyline) from 15.0.0 to 17.0.2. - [Release notes](https://github.com/kkawakam/rustyline/releases) - [Changelog](https://github.com/kkawak…

phillipleblanc added this to the v1.12.0 milestone Jan 26, 2026

phillipleblanc self-assigned this Jan 26, 2026

phillipleblanc requested a review from a team as a code owner January 26, 2026 07:02

Copilot AI review requested due to automatic review settings January 26, 2026 07:02

phillipleblanc added kind/enhancement New feature or request area/distributed-query labels Jan 26, 2026

github-actions Bot added area/config area/docs kind/dependencies Pull requests that update a dependency file size/xl labels Jan 26, 2026

Copilot started reviewing on behalf of phillipleblanc January 26, 2026 07:03 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Comment thread crates/runtime/src/datafusion/mod.rs Outdated

Jeadie reviewed Jan 26, 2026

View reviewed changes

Comment thread crates/runtime/src/datafusion/mod.rs Outdated

Jeadie reviewed Jan 26, 2026

View reviewed changes

Comment thread crates/runtime/src/cluster/service.rs Outdated

Jeadie previously approved these changes Jan 26, 2026

View reviewed changes

phillipleblanc dismissed Jeadie’s stale review via 9389649 January 26, 2026 10:49

phillipleblanc force-pushed the phillip/260126-poll-loop-interrupt branch from a9bad9b to 9389649 Compare January 26, 2026 10:49

Copilot AI review requested due to automatic review settings January 26, 2026 19:53

phillipleblanc force-pushed the phillip/260126-poll-loop-interrupt branch from 9389649 to f9afc41 Compare January 26, 2026 19:53

github-actions Bot added the size/l label Jan 26, 2026

Copilot started reviewing on behalf of phillipleblanc January 26, 2026 19:53 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

sgrebnov and others added 2 commits January 26, 2026 12:13

Fix testoperator dispatch (#9097)

70e60e8

Rename ExecutorStreamRegistry to ExecutorControlStreamRegistry for cl…

c3b30dd

…arity and update references throughout the cluster and datafusion modules.

lukekim previously approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'trunk' into phillip/260126-poll-loop-interrupt

6a7dd5b

Copilot AI review requested due to automatic review settings January 26, 2026 20:29

lukekim dismissed their stale review via 6a7dd5b January 26, 2026 20:29

Copilot started reviewing on behalf of lukekim January 26, 2026 20:30 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Comment thread crates/runtime/src/cluster/mod.rs Outdated

Comment thread Cargo.toml

Remove unused scheduler configuration from create_scheduler_server fu…

4bf244e

…nction

lukekim previously approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'trunk' into phillip/260126-poll-loop-interrupt

65d13c4

Copilot AI review requested due to automatic review settings January 26, 2026 21:56

lukekim dismissed their stale review via 65d13c4 January 26, 2026 21:56

lukekim enabled auto-merge January 26, 2026 21:56

lukekim approved these changes Jan 26, 2026

View reviewed changes

Copilot started reviewing on behalf of lukekim January 26, 2026 21:56 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

lukekim added this pull request to the merge queue Jan 26, 2026

Merged via the queue into trunk with commit 50ca902 Jan 26, 2026
60 of 67 checks passed

lukekim deleted the phillip/260126-poll-loop-interrupt branch January 26, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PollNow interrupt for Ballista executors to reduce task scheduling latency#9098

Add PollNow interrupt for Ballista executors to reduce task scheduling latency#9098
lukekim merged 6 commits into
trunkfrom
phillip/260126-poll-loop-interrupt

phillipleblanc commented Jan 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

phillipleblanc commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Uh oh!

github-actions Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Pull with Spice Passed

Passing checks:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

phillipleblanc commented Jan 26, 2026 •

edited

Loading

github-actions Bot commented Jan 26, 2026 •

edited

Loading