[pull] trunk from spiceai:trunk by pull[bot] · Pull Request #834 · TheRakeshPurohit/spiceai

pull · 2026-05-14T15:06:19Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Update spiceio setup action to v0.5.2 * fix: update spiceio setup action to v0.5.4 in workflow files

* Enable parallel Cayenne Vortex writes * Refactor CayenneTableProvider to accept target_partitions in staged append and update related methods * Add upload and write concurrency options to main function * Add upload and write concurrency options to configuration * Refactor Cayenne catalog parameters to remove prefix from component names --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>

* Add range fallback for join accumulator * Add join accumulator tests and benchmarks * Defer join accumulator range bounds * Improve * Add CayenneJoinRewriter to DataFusionBuilder and implement tests for optimizer rules * Enhance Cayenne join optimization and add benchmarks - Refactor join logic to use a unified hash_join function for better readability. - Introduce new benchmarks for join accumulator transitions and memory limits. - Implement memory limit configuration for hash joins in DataFusionBuilder. - Add tests to validate memory limit behavior for exact join filters. - Create snapshots for Cayenne probe join optimizations. * Implement min/max value calculations for various data types in RangeBounds * Add datafusion-pruning dependency and enhance memory management for exact join filters * Enhance Cayenne context and execution plan with projection pushdown and improve table statistics persistence * Add support for MongoDB change streams and replica set initialization in DuckDB feature * Enhance Cayenne table provider with cached table statistics management and loading functionality * Add tests for table statistics serialization and inexact value downgrades in CayenneTableProvider * Enhance SpicePhysicalCodec with support for serializable hash joins and nested physical plan decoding * Refactor bloom_hashes to use BloomHashStream for improved hashing with multiple streams * Refactor in-list memory management to use shared memory limit across accumulators * Refactor BloomFilter and range handling for improved memory management and type safety * Add shared in-list memory budget configuration and clean up unused DuckDB code * Implement PkDeletionSnapshot for improved deletion handling and add test for empty batch aggregation * Enhance KeyBasedDeletionFilterStream to handle empty batches and improve error handling for primary key column indices * Add twox-hash dependency and refactor BloomFilter for improved handling of NaN values * Replace twox-hash with blake3 for improved hashing in BloomFilter and update Cargo dependencies * Remove blake3 dependency from Cargo.toml and Cargo.lock; refactor BloomFilter to use DataFusion's hashing utilities * Refactor CayenneTableProvider to improve code readability and maintainability * Refactor runtime_env function and simplify memory limit calculations in builder.rs * Refactor join key extraction in HashJoinExec and update plan_snapshot function signature for consistency * refactor: rename and update in-list memory budget function for clarity and correctness * refactor: enhance test module by importing additional Arrow types for improved memory source configuration --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>

* Add MongoDB Change Streams support * fix(mongodb): improve error messages for Change Stream connection failures * feat(mongodb): enhance Change Stream functionality with resume token support * Fix MongoDB change stream clippy lints * Fix test * Persist MongoDB Change Stream resume tokens across restarts (#10817) Adds a `spice_sys_mongodb` sidecar table (per-engine: duckdb, sqlite, postgres, turso) that stores the most recent Change Stream resume token, optional cluster time, and an optional Arrow schema snapshot for drift detection. The MongoDB connector commits the token to the sidecar after each batch is persisted to the accelerator (at-least-once), so a restart resumes the stream from the last persisted position instead of always re-snapshotting the collection. - Mirrors the existing `spice_sys_dynamodb_streams`/`spice_sys_kafka` pattern: `MongoSys::try_new` / `get` / `upsert` / `delete` with per-engine impls behind feature gates. - Wires a `MongoResumeTokenCommitter` into `build_changes_stream`, forking on a persisted token: resume from token (skip snapshot) vs. cold bootstrap (truncate + snapshot + ready + commit captured token). The initial token is piggy-backed onto the ready signal envelope so a crash mid-snapshot re-bootstraps cleanly on the next start. - Detects stale-token responses from the driver (`ChangeStreamHistoryLost` 286 / `ChangeStreamFatalError` 280) and dispatches on a new `mongodb_resume_token_invalid_behavior` parameter (`error` default, `rebootstrap` opt-in), modeled on the DynamoDB lag fallback. - Logs a warning on schema drift between runs (does not fail). - Documents schema, write cadence, and recovery semantics under docs/features/mongodb-change-streams.md. Co-authored-by: Claude <noreply@anthropic.com> * Lint * Enhance MongoDB integration: update Change Stream handling, improve SQL queries, and refine test dataset creation * Refactor primary key extraction to accept row index and remove CDC config setup from RuntimeBuilder * Enhance CDC Configuration and Hash Index Handling in Arrow Acceleration - Introduced a mechanism to set CDC tunables at runtime startup, allowing for better performance by avoiding repeated lookups. - Updated the `is_hash_index_enabled` method to reflect changes in how hash indexing is determined, now relying on primary keys and indexes rather than explicit parameters. - Removed the `hash_index` parameter from Arrow and PartitionedArrow accelerators, simplifying the configuration and ensuring hash indexing is automatically enabled when primary keys or indexes are present. - Added tests to validate the new behavior of hash indexing and ensure that parameters are correctly ignored when not applicable. - Enhanced documentation for PostgreSQL replication parameters, including new runtime CDC tuning options for improved throughput and performance. * Fix formatting in documentation for MongoDB Change Stream resume tokens * Refactor MongoDB Change Stream functions to improve handling of optional parameters and streamline code --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Claude <noreply@anthropic.com>

…10835)

lukekim and others added 7 commits May 14, 2026 18:36

Update spiceio setup action to v0.5.4 (#10830)

217392e

* Update spiceio setup action to v0.5.2 * fix: update spiceio setup action to v0.5.4 in workflow files

Fix minio setup step (#10834)

d76d7ab

Move point lookup query files out of test/spicepods to fix validator (#…

e002354

…10835)

Fix CH-BenCHmark replication lag metrics calculation (#10836)

2f87657

pull Bot locked and limited conversation to collaborators May 14, 2026

pull Bot added the ⤵️ pull label May 14, 2026

pull Bot merged commit 2f87657 into TheRakeshPurohit:trunk May 14, 2026
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] trunk from spiceai:trunk#834

[pull] trunk from spiceai:trunk#834
pull[bot] merged 7 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk

pull Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pull Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull Bot commented May 14, 2026 •

edited

Loading