[pull] trunk from spiceai:trunk by pull[bot] · Pull Request #803 · TheRakeshPurohit/spiceai

pull · 2026-05-04T21:06:18Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat(tools): add tool registry support and enhance tool usage options Co-authored-by: Copilot <copilot@github.com> * style: format code for better readability in ToolDocument and tokenization functions * feat(tools): add support for tool embedding model in registry and responses Co-authored-by: Copilot <copilot@github.com> * feat(tools): enhance tool options and registry embedding model handling Co-authored-by: Copilot <copilot@github.com> * feat(rrf): integrate DEFAULT_RRF_K for configurable ranking and enhance fusion scoring Co-authored-by: Copilot <copilot@github.com> * feat(tools): enhance tool handling with new options for searchable discovery and auto-selection Co-authored-by: Copilot <copilot@github.com> * feat(tools): add 'All' option for direct access to all builtin tools Co-authored-by: Copilot <copilot@github.com> * feat(tools): update descriptions for tool options and enhance tool retrieval logic * feat(tools): enhance error handling for table allowlist creation and update tool descriptions * feat(tools): improve embedding model selection logic for searchable tool discovery Co-authored-by: Copilot <copilot@github.com> * feat(tools): add searchable tool registry endpoint and enhance tool retrieval logic Co-authored-by: Copilot <copilot@github.com> * feat(tools): add optional embedding model parameter to searchable tool registry endpoint * feat(tools): rename tool_embedding_model to embedding_model in SearchToolsQuery * feat(tools): refactor table allowlist creation and enhance tool document handling Co-authored-by: Copilot <copilot@github.com> * feat(tools): refactor table allowlist creation and enhance error handling in tool registry * feat(tests): simplify mock tool parameters in test cases Co-authored-by: Copilot <copilot@github.com> * refactor(tools): improve tool registry and utility functions for better clarity and performance Co-authored-by: Copilot <copilot@github.com> * feat(tools): enhance tool registry with caching and improve tool retrieval logic Co-authored-by: Copilot <copilot@github.com> * ci: guard Install protoc step with relevant_changes check (#10646) The Install protoc step in the Build Test Binary job referenced the local ./.github/actions/install-protoc action but lacked the relevant_changes conditional guard that gates the surrounding steps. When a PR contains no relevant code changes (e.g. openapi.json or spicepod schema only), actions/checkout is skipped, leaving the workspace empty. The Install protoc step then fails immediately at job setup with "Can't find action.yml... install-protoc". This blocks all non-Rust PRs from completing the integration test workflow. Adding the same `if: needs.check_changes.outputs.relevant_changes == 'true'` guard restores the intended skip-on-no-relevant-changes behavior. * fix(benchmarks): redact non-deterministic partition_sizes in explain plan snapshots (#10641) * refactor(tools): update tool_registry_tools and caching logic for improved clarity and performance Co-authored-by: Copilot <copilot@github.com> * fix(tools): address searchable registry review feedback * feat(tools): enhance tool registry preparation logic and add specific tool handling Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: claudespice <claude@spice.ai> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>

…tore slice (stacked) (#10651) * fix(snapshot): flush SQLite/Turso WAL before snapshotting SQLite and Turso accelerators run with WAL journal mode. The pre-existing snapshot path performed `fs::copy(live_db, temp_copy)` directly, which captured only the durable pages — every uncheckpointed write (typically all writes since the last automatic checkpoint) lived only in the `-wal` sidecar and was silently lost in the upload. This bug was invisible under `refresh_mode: full|append|changes` because the federated source repopulates on bootstrap. It surfaces as data loss with `refresh_mode: snapshot` (separate change) where the snapshot is the only data source. Fixes by: 1. Adding `SnapshotEngine::checkpoint_live` (default no-op) — invoked by `SnapshotManager::create_file_snapshot` *before* `fs::copy` while the accelerator write lock is held. 2. New `SqliteSnapshotEngine` overrides `checkpoint_live` to run `PRAGMA wal_checkpoint(TRUNCATE)` against the live file via a short-lived rusqlite connection. WAL mode allows multiple connections; the existing accelerator write lock guarantees no concurrent writers race the checkpoint. 3. `SqliteSnapshotEngine::prepare_for_upload` switches the *copied* file to `journal_mode=DELETE` as defense-in-depth: even if the copy ever ended up with stale `-wal`/`-shm` sidecars next to it, the uploaded file is guaranteed to be self-contained. 4. New `TursoSnapshotEngine` reuses the SqliteSnapshotEngine logic (libsql is on-disk-compatible with SQLite). The `turso` feature now pulls in rusqlite for this purpose. 5. `create_snapshot_engine` now returns the engine-specific impls for `AccelerationEngine::Sqlite` and `::Turso` instead of `DefaultSnapshotEngine`. Closes #10643 * feat(acceleration): add refresh_mode: snapshot for snapshot-only refreshes Adds a new refresh mode that loads accelerated tables only by polling the snapshot store for newer snapshots, never querying the federated source. Use cases: read replicas, expensive/slow sources, air-gapped propagation. Behavior: * `RefreshMode::Snapshot` is a new variant accepted in spicepod `acceleration.refresh_mode` (and is rejected by the validator unless `acceleration.snapshots` is enabled). * On startup, the dataset bootstraps from the snapshot store as today. * The refresh task polls the snapshot store on a configurable interval (default 60 s) via the new `SnapshotManager::download_if_newer(current_local_id, validate_schema)` API. The schema validator runs against the snapshot's `metadata.json` *before* download so a mismatched snapshot can never overwrite the primary file. * Per-engine reload is performed via a new `Accelerator::reload_from_snapshot` trait method. DuckDB/SQLite/ Turso impls evict pool connections and reopen at the same primary path; the live federated query path serves the prior inode until the swap completes via `SwappableTableProvider` (RwLock-backed `Arc<dyn TableProvider>` that caches schema/constraints/table_type). * `INSERT INTO <dataset>` is rejected with a clear error when the accelerator's refresh_mode is Snapshot. * Reload + swap is serialized against concurrent accelerator writes via the existing accelerator write mutex. * Atomic file write on download (temp + fsync + rename + parent fsync). Integration tests: * `snapshot_refresh::duckdb` — bootstrap → writer change → reader picks up new snapshot → swap → query reflects change. * `snapshot_refresh::sqlite` — same shape; passes thanks to the SQLite WAL flush in the parent commit. * `snapshot_refresh::turso` — same shape; same WAL flush dependency. CI: new `Snapshot Refresh Mode Integration Tests` job in integration.yml runs DuckDB/SQLite/Turso variants against MinIO. Cayenne is intentionally not covered in this commit: portable Cayenne snapshots require a per-dataset metastore export/import refactor (addressed in the next commit). The snapshot_refresh test directory omits Cayenne until then. Depends on the SQLite/Turso WAL flush from the previous commit; without it the SQLite/Turso integration tests cannot pass. Pulls in the upstream pool-eviction APIs from datafusion-contrib/datafusion-table-providers#635 (`invalidate_instance`, `invalidate_file_instance`). * feat(cayenne): add per-dataset metastore export/import (snapshot foundation) Adds the foundation for portable, per-dataset Cayenne metastore snapshots: a versioned-JSON 'slice' format that captures only one dataset's rows from cayenne_table and its dependent tables, with path columns rewritten to be relative to a configurable anchor. Why: The legacy Cayenne snapshot format archived the entire cayenne.db SQLite file. That approach forced a one-dataset-per-metadata-directory limit (see validate_cayenne_snapshot_consistency) because two snapshots would clobber each other's metastore rows on extract, and produced snapshots that were not portable across nodes whose data directories did not match the writer's absolute paths (#10642). It also raced with the reader process's eager metastore initialization (#10649): SQLite journal sidecars created during init triggered checksum mismatches against the archive on extract. This commit lays the groundwork to fix all three issues by introducing: * `crates/cayenne/src/metastore/snapshot.rs` — the slice format and export/import logic: - `DatasetMetastoreSlice` (versioned JSON, format_version: 1) with one entry per metastore table from EXPECTED_TABLES. - `SliceValue` mirrors MetastoreValue but JSON-friendly (blobs base64-encoded under the 'x' tag). - `export_dataset(metastore, dataset_name, anchor)` selects every row that belongs to the dataset (the cayenne_table row plus all rows in dependent tables matching the same table_id) and emits a slice. Path columns (cayenne_table.path, cayenne_delete_file.path, cayenne_partition.path) are rewritten relative to `anchor` so the slice contains no absolute filesystem paths. - `import_dataset(metastore, slice, anchor)` runs inside a single transaction: deletes any existing rows for the same dataset_name (FK ON DELETE CASCADE removes all dependent rows), then inserts the slice's rows with paths rewritten back to absolute form anchored at `anchor`. Unsupported format_version or engine mismatch are rejected up front. * `MetastoreRow::get_value(index) -> MetastoreValue` — a new trait method that returns the raw column value without type coercion, so generic export logic can serialize columns without knowing each column's expected type at compile time. SQLite and Turso impls add a one-line implementation that clones from their existing values Vec. Tests in `crates/cayenne/src/metastore/snapshot.rs::tests`: * `round_trip_preserves_rows_and_relocates_paths` — exports a dataset from one metastore, imports into a fresh metastore at a different anchor, verifies all partitions resolve to the new anchor. * `import_replaces_prior_dataset_rows_wholesale` — confirms the DuckDB-style snapshot-refresh semantic that import wipes any prior rows for the same dataset_name before inserting (no leftover partitions). * `import_leaves_other_datasets_untouched` — confirms slicing is correctly scoped: importing dataset A does not affect dataset B's rows in the same shared metastore. * `rejects_unsupported_format_version` — both wrong format_version and wrong engine identifier are rejected with clear error messages *before* any DML runs. * `json_round_trip` — slice serializes and parses back to an equivalent slice. Follow-up (separate commit / PR): wire `CayenneSnapshotEngine` into `SnapshotManager`'s directory create/extract paths so the new format is actually used by snapshot upload/download. With that follow-up: * the cayenne.db file leaves the archive (replaces #10649), * paths become portable (closes #10642), * the single-dataset-per-metadata-dir validation can be lifted, * Cayenne refresh_mode: snapshot becomes a passing integration test. * feat(cayenne): wire CayenneSnapshotEngine into snapshot pipeline This is Commit 4 of the refresh_mode: snapshot stack. Commit 3 added the metastore export/import API; this commit wires it into the SnapshotManager upload/download pipeline so Cayenne snapshots actually use the per-dataset slice format instead of shipping raw cayenne.db. Closes: - #10642 (Cayenne snapshots not portable across data dirs) - #10649 (Cayenne metastore-init races with checksum) Architecture: * Trait extension in runtime-acceleration::snapshot::engine: Adds two new SnapshotEngine methods, each with a no-op default so DuckDB/SQLite/Turso continue to behave exactly as before: - prepare_directory_snapshot(dirs, dataset_name) -> DirectorySnapshotPlan Returns (a) a list of file paths *relative to each source dir* that should be excluded from the tar, and (b) extra in-memory entries to append at the end of the tar. - finalize_directory_snapshot(dirs, dataset_name, extras) Engine-specific post-extract hook. Plus a new SnapshotEngineError::Custom { message } variant and a SnapshotEngineError::from_display() constructor so engines defined outside runtime-acceleration (CayenneSnapshotEngine in the runtime crate) can surface their own rich errors without needing a feature-gated variant in the trait crate. * directory_archive: new archive_directories_with_plan() that takes skip_relative_paths + extras. The original archive_directories() is preserved as a thin wrapper. The new add_directory_to_archive_filtered helper handles the recursive walk while honoring the skip set. The module remains engine-agnostic — the skip predicate is just a HashSet passed in. * SnapshotManager: - create_directory_snapshot now calls prepare_directory_snapshot first, then archive_directories_with_plan with the engine's skip list and extras. - download_to_directories now calls finalize_directory_snapshot after extract. - New with_snapshot_engine(self, Arc<dyn SnapshotEngine>) builder so accelerators can inject a custom engine. * download_snapshot_if_needed and snapshot_before_recreate gain an Option<Arc<dyn SnapshotEngine>> parameter. DuckDB/SQLite/Turso pass None (default behavior). Cayenne builds and passes a CayenneSnapshotEngine. * Cayenne accelerator (crates/runtime/src/dataaccelerator/cayenne): - New module snapshot_engine.rs implements CayenneSnapshotEngine. Holds an Arc<dyn MetadataCatalog>, dataset_name, and data_dir_anchor. prepare_directory_snapshot exports the per-dataset slice via MetadataCatalog::export_dataset_slice and instructs the archiver to skip cayenne.db / -wal / -shm. finalize_directory_snapshot reads the slice JSON back from the extracted directory and calls MetadataCatalog::import_dataset_slice. - The cayenne accelerator's bootstrap path constructs a CayenneSnapshotEngine using its existing get_or_create_catalog machinery and threads it through download_snapshot_if_needed. * cayenne::MetadataCatalog: new export_dataset_slice and import_dataset_slice trait methods (default impls return InvalidOperationNoSource); CayenneCatalog overrides both with dispatch to the underlying SQLite or libsql metastore. * Lifted validate_cayenne_snapshot_consistency's SharedAcceleration restriction. With per-dataset slices, multiple Cayenne datasets sharing a metastore directory can each ship their own snapshot without clobbering each other's metastore rows on extract. The InconsistentSnapshotSettings (mixed enabled/disabled) check stays. * Re-enabled the Cayenne snapshot_refresh integration test (snapshot_refresh::cayenne::snapshot_refresh_cayenne_bootstrap_then_refresh). Tests: - 3 new unit tests in cayenne::snapshot_engine::tests: * create_directory_snapshot_skips_cayenne_db_and_emits_slice * refuses_mismatched_dataset * finalize_missing_slice_returns_clear_error - validate_snapshots tests updated; the test_cayenne_shared_acceleration_with_snapshots_errors test is renamed to ..._now_supported and asserts Ok. - Existing 91 runtime-acceleration::snapshot lib tests still pass. - Cayenne integration test runs against real S3 in CI (same harness as the DuckDB/SQLite/Turso refresh-mode tests). * fix(snapshot): bootstrap dataset checkpoint from snapshot metadata When a downloaded snapshot doesn't carry a populated _dataset_checkpoint row (the steady state for Cayenne, where the on-disk archive ships the per-dataset metastore slice JSON instead of the raw cayenne.db), the post-extract Checkpointer::get_schema() returns None and bootstrap was failing with MissingSchema. The metadata.json carries the dataset's verified schema; after import we now materialize that schema into the local checkpoint via Checkpointer::checkpoint(metadata_schema, None) so the spice_sys side matches the snapshot. For DuckDB / SQLite / Turso this branch is unreachable in steady state (their archives ship _dataset_checkpoint already); on a corrupted snapshot the self-heal is harmless because the metadata schema is the same one we'd otherwise have validated against. This unblocks refresh_mode: snapshot for Cayenne and re-enables the snapshot_refresh::cayenne integration test by default. Closes #10658.

* Add Delta Lake Azure tenant parameter * Fix Delta Lake tenant parameter lint

* Add provider-aware LLM prompt caching * feat: Add prompt cache key support and related tests for OpenAI models * fix: Update prompt tokens details initialization to use `then_some` for clarity * refactor: Update LOCAL_LLM_MAX_SEQS to NonZeroUsize and improve request handling in ResponsesWrapper * docs: Add LLM prompt caching internals * docs: Expand project docs index * feat: Add tests for prompt cache point handling in Bedrock and Databricks * ci: guard Install protoc step with relevant_changes check (#10646) The Install protoc step in the Build Test Binary job referenced the local ./.github/actions/install-protoc action but lacked the relevant_changes conditional guard that gates the surrounding steps. When a PR contains no relevant code changes (e.g. openapi.json or spicepod schema only), actions/checkout is skipped, leaving the workspace empty. The Install protoc step then fails immediately at job setup with "Can't find action.yml... install-protoc". This blocks all non-Rust PRs from completing the integration test workflow. Adding the same `if: needs.check_changes.outputs.relevant_changes == 'true'` guard restores the intended skip-on-no-relevant-changes behavior. * fix(benchmarks): redact non-deterministic partition_sizes in explain plan snapshots (#10641) * fix(mistral): refine paged attention support check and update documentation for CUDA backend * fix(docs): format provider mappings table for clarity in llm-prompt-caching documentation * fix(mistral): update LOCAL_LLM_MAX_SEQS initialization for clarity and safety fix(wrapper): improve error message for model parameter parsing * docs(llms): explain local LLM scheduler limit * Fix Anthropic streaming cache usage accounting * feat: use saturating addition for token counts in CompletionUsage --------- Co-authored-by: claudespice <claude@spice.ai> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>

* Fix DuckDB HNSW vector indexes lost after data refresh * Improve * Improve tracing * Fix lint

lukekim and others added 6 commits May 4, 2026 14:44

Add integration test for HTTP dynamic headers with pagination (#10663)

a434828

Add Delta Lake Azure tenant parameter (#10671)

393aef4

* Add Delta Lake Azure tenant parameter * Fix Delta Lake tenant parameter lint

Fix DuckDB HNSW vector indexes lost after data refresh (#10668)

5b1b824

* Fix DuckDB HNSW vector indexes lost after data refresh * Improve * Improve tracing * Fix lint

pull Bot locked and limited conversation to collaborators May 4, 2026

pull Bot added the ⤵️ pull label May 4, 2026

pull Bot merged commit 5b1b824 into TheRakeshPurohit:trunk May 4, 2026
1 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] trunk from spiceai:trunk#803

[pull] trunk from spiceai:trunk#803
pull[bot] merged 6 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk

pull Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pull Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pull Bot commented May 4, 2026 •

edited

Loading