[pull] trunk from spiceai:trunk#763
Merged
Merged
Conversation
In distributed Ballista mode, decoded ParquetSource plans on executors lose their per-scan parquet_file_reader_factory during proto round-trip and fall back to runtime_env().object_store(url). The delta_lake connector did not implement DataConnector::register_object_stores, so executors built a default S3 store with no region configured, surfacing as 'Received redirect without LOCATION' errors against buckets outside us-east-1. Mirror the databricks connector's implementation: retain the connector Parameters on DeltaLake, and in register_object_stores parse the dataset path as the storage URL, encode the AWS/Azure/GCS subset of params via Parameters::storage_registry_params() into the URL fragment, and call runtime_env.object_store(&listing_url) so SpiceObjectStoreRegistry builds and caches a properly-configured store on each executor.
* Revert "Revert DF-native DML (#10114)" This reverts commit bd75c67. * Lint * Lint * Address copilot comments * Add tests * Fix * Lint * Update distributed Cayenne catalog snapshots for sort pushdown into FlightSQL Query plans changed: ORDER BY is now pushed into FlightSqlExec SQL queries instead of a separate SortExec node, and SubqueryAlias wraps TableScan in the logical plan. Update 5 affected insta snapshots to match new output.
#10434) * ci: run Build and Test on spiceai-macos; split install jobs by profile - Move build-test to spiceai-macos runners with spiceio setup + cleanup and log upload, matching lint-rust. - Split make install variants into two jobs: build-install-dev (make install-dev) and build-install-release (make install, install-cli, install-runtime), each on spiceai-macos with spiceio setup + cleanup. * ci: gate spiceio log upload and snapshot push on relevant_changes Addresses Copilot review comments on #10434: avoid noisy/empty artifacts and stale log uploads by gating unconditional always() steps on relevant_changes and a non-empty setup-spiceio pid. * ci: merge lint-rust into build-test; one job per profile Since lint and build-test both run on spiceai-macos under the lint profile, consolidate them into a single 'Lint, Build, and Test' job. Workflow now has three Rust jobs, one per profile: - build-test (lint profile): make lint-rust + build-cli + nextest + testoperator - build-install-dev (dev profile): make install-dev - build-install-release (release profile): make install / install-cli / install-runtime * Revert "ci: merge lint-rust into build-test; one job per profile" This reverts commit 0ecad35.
* Improve search UDTFs: text_search, vector_search, rrf
text_search:
- Use Tantivy QueryParser as primary path to honor operators (AND/OR/NOT),
phrases ("exact match"), field-scoped queries (title:foo) and boosts
(term^2). Falls back to bag-of-words OR clause on parser errors.
- Fix pagination loop bug that issued empty queries after index exhaustion:
decrement remaining_limit by actual hit count and short-circuit on partial pages.
- Fix multi-index column selection: pick the FTS index containing the requested
column rather than an arbitrary pop().
- Filter spice.parameter_name passthrough literals in parse_args so RRF named
args (e.g. rank_weight) don't confuse text_search.
- "Did you mean?" suggestions and column listings on column-not-found errors.
vector_search:
- Column suggestions on missing indexed column via Levenshtein.
rrf:
- New 'limit' named arg. Propagates to subqueries with a 4x candidate-pool
multiplier to preserve recall; post-fuse .limit caps final output exactly.
- Fail fast on invalid recency_decay instead of silently dropping it.
- Accept bare identifier for time_column/join_key named args, e.g.
'time_column => picked_at' in addition to 'time_column => \'picked_at\''.
- Limit propagated through distributed serialization (proto + codec).
Tests:
- Unit tests for text_search (parse_args filter, column helper, suggestions,
levenshtein).
- Unit tests for vector_search (closest_column, levenshtein).
- Unit tests for rrf (limit named arg, identifier named args, serialization).
Verified: 1041 runtime lib tests + 22 search-crate tests pass.
* feat: add distance metric support to vector search UDTFs
* fix: format distance metric mapping for better readability
* fix: update error messages for clarity in SQL query execution failures
* fix: optimize cloning in TextSearchTableFunc and improve error handling in ReciprocalRankFusionArgs
* fix: add backticks to DataFusion in rrf doc comment for clippy
Fixes clippy::doc_markdown lint on test-only doc comment.
* fix: enhance argument parsing in UDTFs to support named parameters and improve error handling
* fix: format code for consistency in TextSearchTableFunc
* fix: enhance named argument handling in text_search UDTF for optional fields
* fix: simplify column extraction logic in TextSearchTableFunc
* fix: update constraints in SearchQueryProvider to maintain PK indices based on include_score
* fix: streamline limit and include_score handling in UDTFs for improved clarity
* fix: address UDTF review comments
- vector_search: fail fast on out-of-range `limit` (match text_search)
- vector_search: reject `distance_metric => 'dot'` at parse time instead
of silently constructing args that fail with NotImplemented later
- vector_search: sort indexed-column lists so error messages and
Levenshtein suggestions are deterministic across runs
- rrf: drop synthetic `__spice_rrf_row_id` after the secondary-sort step
when no user join_key was provided, so it doesn't leak into the
user-visible output schema of `rrf(...)`
* Lint
* fix: address PR review feedback on search UDTFs
- Replace duplicate Levenshtein implementations in FTS and embeddings
UDTFs with util::levenshtein::distance.
- Defer all_indexed_fields collection in text_search to error/disambig
paths so the happy query path skips it.
- Fix include_score named-arg override: keep the positional default as
None and apply Some(true) only after named-arg merge so
include_score => false isn't silently dropped.
- Drop the secondary join-key sort in RRF that was added purely for
test determinism.
- Drop "dot" from distance_metric docs in provider.rs and spice.proto
to match runtime behavior (only cosine and l2 are supported).
- Move FieldMetadata/BTreeMap imports out of vector_search to_expr
function body.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address copilot review on error kind and distance_metric
- Use DataFusionError::Plan in TextSearchTableFuncArgs::column for
invalid user arguments to match the rest of the module's planning
errors and keep the error classification consistent upstream.
- Fail fast in vector_search when distance_metric is set against an
index-backed provider (S3 vectors, Elasticsearch, chunked) instead
of silently using the index's configured metric.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: apply rustfmt to new include_score test helper
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: update UdtfSource::VectorSearch signature to include distance_metric
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: update snapshot to match DataFusionError::Plan classification
The Execution→Plan change in TextSearchTableFuncArgs::column surfaces
as "Error during planning:" in the HTTP error message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix distance
* Lint
* Lint
* lint: fix cast_possible_truncation in duckdb test helper
---------
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sformers layouts (#10444) * fix(model2vec): Improve robustness of model loading for sentence-transformers layouts * Update sha
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )