Skip to content

[pull] trunk from spiceai:trunk#763

Merged
pull[bot] merged 5 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk
Apr 21, 2026
Merged

[pull] trunk from spiceai:trunk#763
pull[bot] merged 5 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 21, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

phillipleblanc and others added 5 commits April 21, 2026 14:57
In distributed Ballista mode, decoded ParquetSource plans on executors
lose their per-scan parquet_file_reader_factory during proto round-trip
and fall back to runtime_env().object_store(url). The delta_lake
connector did not implement DataConnector::register_object_stores, so
executors built a default S3 store with no region configured, surfacing
as 'Received redirect without LOCATION' errors against buckets outside
us-east-1.

Mirror the databricks connector's implementation: retain the connector
Parameters on DeltaLake, and in register_object_stores parse the
dataset path as the storage URL, encode the AWS/Azure/GCS subset of
params via Parameters::storage_registry_params() into the URL fragment,
and call runtime_env.object_store(&listing_url) so
SpiceObjectStoreRegistry builds and caches a properly-configured store
on each executor.
* Revert "Revert DF-native DML (#10114)"

This reverts commit bd75c67.

* Lint

* Lint

* Address copilot comments

* Add tests

* Fix

* Lint

* Update distributed Cayenne catalog snapshots for sort pushdown into FlightSQL

Query plans changed: ORDER BY is now pushed into FlightSqlExec SQL queries
instead of a separate SortExec node, and SubqueryAlias wraps TableScan in
the logical plan. Update 5 affected insta snapshots to match new output.
#10434)

* ci: run Build and Test on spiceai-macos; split install jobs by profile

- Move build-test to spiceai-macos runners with spiceio setup + cleanup and log upload, matching lint-rust.
- Split make install variants into two jobs: build-install-dev (make install-dev) and build-install-release (make install, install-cli, install-runtime), each on spiceai-macos with spiceio setup + cleanup.

* ci: gate spiceio log upload and snapshot push on relevant_changes

Addresses Copilot review comments on #10434: avoid noisy/empty artifacts and stale log uploads by gating unconditional always() steps on relevant_changes and a non-empty setup-spiceio pid.

* ci: merge lint-rust into build-test; one job per profile

Since lint and build-test both run on spiceai-macos under the lint profile, consolidate them into a single 'Lint, Build, and Test' job. Workflow now has three Rust jobs, one per profile:

- build-test (lint profile): make lint-rust + build-cli + nextest + testoperator
- build-install-dev (dev profile): make install-dev
- build-install-release (release profile): make install / install-cli / install-runtime

* Revert "ci: merge lint-rust into build-test; one job per profile"

This reverts commit 0ecad35.
* Improve search UDTFs: text_search, vector_search, rrf

text_search:
- Use Tantivy QueryParser as primary path to honor operators (AND/OR/NOT),
  phrases ("exact match"), field-scoped queries (title:foo) and boosts
  (term^2). Falls back to bag-of-words OR clause on parser errors.
- Fix pagination loop bug that issued empty queries after index exhaustion:
  decrement remaining_limit by actual hit count and short-circuit on partial pages.
- Fix multi-index column selection: pick the FTS index containing the requested
  column rather than an arbitrary pop().
- Filter spice.parameter_name passthrough literals in parse_args so RRF named
  args (e.g. rank_weight) don't confuse text_search.
- "Did you mean?" suggestions and column listings on column-not-found errors.

vector_search:
- Column suggestions on missing indexed column via Levenshtein.

rrf:
- New 'limit' named arg. Propagates to subqueries with a 4x candidate-pool
  multiplier to preserve recall; post-fuse .limit caps final output exactly.
- Fail fast on invalid recency_decay instead of silently dropping it.
- Accept bare identifier for time_column/join_key named args, e.g.
  'time_column => picked_at' in addition to 'time_column => \'picked_at\''.
- Limit propagated through distributed serialization (proto + codec).

Tests:
- Unit tests for text_search (parse_args filter, column helper, suggestions,
  levenshtein).
- Unit tests for vector_search (closest_column, levenshtein).
- Unit tests for rrf (limit named arg, identifier named args, serialization).

Verified: 1041 runtime lib tests + 22 search-crate tests pass.

* feat: add distance metric support to vector search UDTFs

* fix: format distance metric mapping for better readability

* fix: update error messages for clarity in SQL query execution failures

* fix: optimize cloning in TextSearchTableFunc and improve error handling in ReciprocalRankFusionArgs

* fix: add backticks to DataFusion in rrf doc comment for clippy

Fixes clippy::doc_markdown lint on test-only doc comment.

* fix: enhance argument parsing in UDTFs to support named parameters and improve error handling

* fix: format code for consistency in TextSearchTableFunc

* fix: enhance named argument handling in text_search UDTF for optional fields

* fix: simplify column extraction logic in TextSearchTableFunc

* fix: update constraints in SearchQueryProvider to maintain PK indices based on include_score

* fix: streamline limit and include_score handling in UDTFs for improved clarity

* fix: address UDTF review comments

- vector_search: fail fast on out-of-range `limit` (match text_search)
- vector_search: reject `distance_metric => 'dot'` at parse time instead
  of silently constructing args that fail with NotImplemented later
- vector_search: sort indexed-column lists so error messages and
  Levenshtein suggestions are deterministic across runs
- rrf: drop synthetic `__spice_rrf_row_id` after the secondary-sort step
  when no user join_key was provided, so it doesn't leak into the
  user-visible output schema of `rrf(...)`

* Lint

* fix: address PR review feedback on search UDTFs

- Replace duplicate Levenshtein implementations in FTS and embeddings
  UDTFs with util::levenshtein::distance.
- Defer all_indexed_fields collection in text_search to error/disambig
  paths so the happy query path skips it.
- Fix include_score named-arg override: keep the positional default as
  None and apply Some(true) only after named-arg merge so
  include_score => false isn't silently dropped.
- Drop the secondary join-key sort in RRF that was added purely for
  test determinism.
- Drop "dot" from distance_metric docs in provider.rs and spice.proto
  to match runtime behavior (only cosine and l2 are supported).
- Move FieldMetadata/BTreeMap imports out of vector_search to_expr
  function body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address copilot review on error kind and distance_metric

- Use DataFusionError::Plan in TextSearchTableFuncArgs::column for
  invalid user arguments to match the rest of the module's planning
  errors and keep the error classification consistent upstream.
- Fail fast in vector_search when distance_metric is set against an
  index-backed provider (S3 vectors, Elasticsearch, chunked) instead
  of silently using the index's configured metric.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: apply rustfmt to new include_score test helper

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update UdtfSource::VectorSearch signature to include distance_metric

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: update snapshot to match DataFusionError::Plan classification

The Execution→Plan change in TextSearchTableFuncArgs::column surfaces
as "Error during planning:" in the HTTP error message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix distance

* Lint

* Lint

* lint: fix cast_possible_truncation in duckdb test helper

---------

Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sformers layouts (#10444)

* fix(model2vec): Improve robustness of model loading for sentence-transformers layouts

* Update sha
@pull pull Bot locked and limited conversation to collaborators Apr 21, 2026
@pull pull Bot added the ⤵️ pull label Apr 21, 2026
@pull pull Bot merged commit e9768bf into TheRakeshPurohit:trunk Apr 21, 2026
1 of 11 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants