[pull] trunk from spiceai:trunk#834
Merged
Merged
Conversation
* Update spiceio setup action to v0.5.2 * fix: update spiceio setup action to v0.5.4 in workflow files
* Enable parallel Cayenne Vortex writes * Refactor CayenneTableProvider to accept target_partitions in staged append and update related methods * Add upload and write concurrency options to main function * Add upload and write concurrency options to configuration * Refactor Cayenne catalog parameters to remove prefix from component names --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
* Add range fallback for join accumulator * Add join accumulator tests and benchmarks * Defer join accumulator range bounds * Improve * Add CayenneJoinRewriter to DataFusionBuilder and implement tests for optimizer rules * Enhance Cayenne join optimization and add benchmarks - Refactor join logic to use a unified hash_join function for better readability. - Introduce new benchmarks for join accumulator transitions and memory limits. - Implement memory limit configuration for hash joins in DataFusionBuilder. - Add tests to validate memory limit behavior for exact join filters. - Create snapshots for Cayenne probe join optimizations. * Implement min/max value calculations for various data types in RangeBounds * Add datafusion-pruning dependency and enhance memory management for exact join filters * Enhance Cayenne context and execution plan with projection pushdown and improve table statistics persistence * Add support for MongoDB change streams and replica set initialization in DuckDB feature * Enhance Cayenne table provider with cached table statistics management and loading functionality * Add tests for table statistics serialization and inexact value downgrades in CayenneTableProvider * Enhance SpicePhysicalCodec with support for serializable hash joins and nested physical plan decoding * Refactor bloom_hashes to use BloomHashStream for improved hashing with multiple streams * Refactor in-list memory management to use shared memory limit across accumulators * Refactor BloomFilter and range handling for improved memory management and type safety * Add shared in-list memory budget configuration and clean up unused DuckDB code * Implement PkDeletionSnapshot for improved deletion handling and add test for empty batch aggregation * Enhance KeyBasedDeletionFilterStream to handle empty batches and improve error handling for primary key column indices * Add twox-hash dependency and refactor BloomFilter for improved handling of NaN values * Replace twox-hash with blake3 for improved hashing in BloomFilter and update Cargo dependencies * Remove blake3 dependency from Cargo.toml and Cargo.lock; refactor BloomFilter to use DataFusion's hashing utilities * Refactor CayenneTableProvider to improve code readability and maintainability * Refactor runtime_env function and simplify memory limit calculations in builder.rs * Refactor join key extraction in HashJoinExec and update plan_snapshot function signature for consistency * refactor: rename and update in-list memory budget function for clarity and correctness * refactor: enhance test module by importing additional Arrow types for improved memory source configuration --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>
* Add MongoDB Change Streams support * fix(mongodb): improve error messages for Change Stream connection failures * feat(mongodb): enhance Change Stream functionality with resume token support * Fix MongoDB change stream clippy lints * Fix test * Persist MongoDB Change Stream resume tokens across restarts (#10817) Adds a `spice_sys_mongodb` sidecar table (per-engine: duckdb, sqlite, postgres, turso) that stores the most recent Change Stream resume token, optional cluster time, and an optional Arrow schema snapshot for drift detection. The MongoDB connector commits the token to the sidecar after each batch is persisted to the accelerator (at-least-once), so a restart resumes the stream from the last persisted position instead of always re-snapshotting the collection. - Mirrors the existing `spice_sys_dynamodb_streams`/`spice_sys_kafka` pattern: `MongoSys::try_new` / `get` / `upsert` / `delete` with per-engine impls behind feature gates. - Wires a `MongoResumeTokenCommitter` into `build_changes_stream`, forking on a persisted token: resume from token (skip snapshot) vs. cold bootstrap (truncate + snapshot + ready + commit captured token). The initial token is piggy-backed onto the ready signal envelope so a crash mid-snapshot re-bootstraps cleanly on the next start. - Detects stale-token responses from the driver (`ChangeStreamHistoryLost` 286 / `ChangeStreamFatalError` 280) and dispatches on a new `mongodb_resume_token_invalid_behavior` parameter (`error` default, `rebootstrap` opt-in), modeled on the DynamoDB lag fallback. - Logs a warning on schema drift between runs (does not fail). - Documents schema, write cadence, and recovery semantics under docs/features/mongodb-change-streams.md. Co-authored-by: Claude <noreply@anthropic.com> * Lint * Enhance MongoDB integration: update Change Stream handling, improve SQL queries, and refine test dataset creation * Refactor primary key extraction to accept row index and remove CDC config setup from RuntimeBuilder * Enhance CDC Configuration and Hash Index Handling in Arrow Acceleration - Introduced a mechanism to set CDC tunables at runtime startup, allowing for better performance by avoiding repeated lookups. - Updated the `is_hash_index_enabled` method to reflect changes in how hash indexing is determined, now relying on primary keys and indexes rather than explicit parameters. - Removed the `hash_index` parameter from Arrow and PartitionedArrow accelerators, simplifying the configuration and ensuring hash indexing is automatically enabled when primary keys or indexes are present. - Added tests to validate the new behavior of hash indexing and ensure that parameters are correctly ignored when not applicable. - Enhanced documentation for PostgreSQL replication parameters, including new runtime CDC tuning options for improved throughput and performance. * Fix formatting in documentation for MongoDB Change Stream resume tokens * Refactor MongoDB Change Stream functions to improve handling of optional parameters and streamline code --------- Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )