[pull] trunk from spiceai:trunk#761
Merged
Merged
Conversation
* feat: Add Azure Cosmos DB (NoSQL) data connector (alpha) Adds a first-pass, read-only Azure Cosmos DB NoSQL / Core SQL API data connector built on the azure_data_cosmos 0.30 SDK. - New data_components::cosmosdb module with CosmosDBClient, CosmosDBTableProvider, and CosmosDBExec. - Cross-partition scan via Cosmos SQL (defaults to SELECT * FROM c). - Arrow schema inferred from a sample of documents; system fields (_rid/_self/_etag/_attachments/_ts) are stripped. - Key-based auth (connection string or account endpoint + key). - New 'cosmosdb' Cargo feature wired through data_components, runtime, and spiced (included in the default spiced distribution). - Makefile lint target + SPICED_DATA_FEATURES entry. - Docs stub and initial RC criteria table row. Alpha quality: no filter/projection push-down, no writes, no change feed, no Microsoft Entra ID auth yet. * refactor: Clean up code formatting and improve readability in Cosmos DB modules * fix: Simplify endpoint assignment in CosmosDBClient implementation * fix(cosmosdb): address review comments for Cosmos DB connector - Drop `.required()` from `database` param; falls back to path segment. - Validate `schema_infer_max_records`; warn-and-default on 0/non-integer. - Use prefixed param names (`cosmosdb_*`) in the auth error message. - Split path parsing into `parse_database_and_container` and reject empty db/container segments with a descriptive message. - Fix `infer_schema` doc comment to match actual empty-sample behavior. - Add unit tests for `parse_database_and_container`, `strip_system_fields`, and `infer_schema`. - Docs: use unprefixed `query` / `schema_infer_max_records` (runtime params). * fix(cosmosdb): address follow-up review comments - parse_database_and_container error now references the user-facing 'cosmosdb_database' parameter name. - execute() wraps Cosmos query errors with Error::QueryFailed so failures include the database/container context (matches fetch_samples). - Add unit tests for decode_batch covering multi-doc decoding, projection, empty input, and missing-field-fills-null behavior. * feat(cosmosdb): bring connector to RC quality Closes the remaining gaps against docs/criteria/connectors/rc.md for the Azure Cosmos DB (NoSQL) connector: - Connection resilience: per-account concurrency semaphore, exponential/fibonacci backoff, Retry-After / x-ms-retry-after-ms handling, permanent-error (401/403/404) detection that latches the connector disabled. New CosmosResilienceConfig in crates/data_components/src/cosmosdb/resilience.rs with 16 unit tests. - inflight_operations metric gauge via MetricsProvider, keyed off the account endpoint so datasets sharing an account aggregate cleanly. - New runtime parameters: max_concurrent_requests, http_max_retries, backoff_method, disable_on_permanent_error. - unsupported_type_action plumbing — all-null inferred columns (DataType::Null) are warn-dropped by default, with user-configurable Error/Ignore/String alternatives. - Integration-test scaffold at crates/runtime/tests/cosmosdb/ with an offline smoke test plus two #[ignore]'d live tests that read COSMOSDB_CONNECTION_STRING (or ENDPOINT+KEY) from the environment. - Cookbook recipe at examples/cosmosdb-connector/ (README + spicepod.yaml + queries.sql). - Docs: status flip to RC, JSON -> Arrow type mapping table, RC exceptions call-out, resilience parameter reference, integration-test run instructions. - Criteria tables: add Cosmos DB rows to alpha.md / beta.md (quality + feature matrix); flip rc.md row to DRI @lukekim. Test coverage: - 32 unit tests in data_components::cosmosdb (resilience, schema, provider incl. 1024-column max-width test, unsupported-type actions). - 12 unit tests in runtime::dataconnector::cosmosdb (shared_semaphore, shared_disabled_flag, path parsing). - 1 offline smoke + 2 ignored live tests in crates/runtime/tests/cosmosdb. * refactor(cosmosdb): simplify post-review nits - Redact CosmosDBCredential Debug impl (was deriving Debug on an enum holding raw account keys / connection strings; would surface secrets in any tracing::debug! or panic). - Drop CosmosDBMetrics wrapper struct; store the Arc<AtomicU64> inflight counter directly on CosmosDB and on CosmosDBMetricsProvider. Removes one level of Arc indirection. - Use azure_core::http::headers::X_MS_RETRY_AFTER_MS constant instead of hand-rolling HeaderName::from_static for the Cosmos-specific retry-ms header. - Swap Ordering::SeqCst -> Acquire/Release on the disabled latch; a single AtomicBool only needs acquire/release pairing, and the change is free on x86 but saves a fence on ARM. - Avoid allocating String endpoint on the streaming scan hot path; pass client.endpoint() (&str) directly to handle_stream_error. - Add eviction-note docstring to COSMOS_CONCURRENCY_LIMITS and COSMOS_DISABLED_FLAGS, matching the git-connector statics. - Trim narrating / RC-pointing doc comments from PARAMETERS, CosmosDBTableProviderConfig, and apply_unsupported_type_action. * perf(cosmosdb): decode JSON docs directly via arrow-json decoder Replace decode_batch's NDJSON round-trip (serialize each Value to bytes, push newline, re-parse through ReaderBuilder::build) with the serde-aware path: ReaderBuilder::build_decoder() + decoder.serialize(&docs) + decoder.flush(). Avoids the per-batch serialize -> parse overhead and matches the pattern used in crates/data_components/src/s3_vectors/ (list_provider.rs, query_provider.rs). Addresses Copilot review feedback on PR #10392. * docs(cosmosdb): clarify retry scope (schema vs streaming) http_max_retries / backoff_method apply to the schema-inference pass at dataset registration; mid-stream pager errors propagate directly because a FeedPager cannot be rewound once rows have been emitted. The permanent-error latch still fires on both paths. Addresses Copilot review feedback on PR #10392. * style(cosmosdb): apply cargo fmt * docs(cosmosdb): align doc comments with RC quality - Update crate-level mod.rs doc to state RC quality and reference docs - Update provider.rs doc comment to RC quality with retry-scope limitation - Clarify inflight_operations metric is per-dataset (shared concurrency budget is enforced via endpoint-keyed COSMOS_CONCURRENCY_LIMITS map) - Update http_max_retries and backoff_method descriptions to state they apply to the schema-inference sampling pass only * docs(cosmosdb): clarify inflight_operations gauge is per-dataset scope * refactor(cosmosdb): replace CosmosDBClient wrapper with build_container_client Each CosmosDBTableProvider is pinned to a single (database, container) pair, so building a ContainerClient once at connector setup and reusing it — rather than holding a CosmosDBClient wrapper and re-deriving the ContainerClient on every scan — simplifies the API. - Delete the CosmosDBClient struct and its re-export - Add build_container_client(credential, database, container) free function that returns (ContainerClient, endpoint) - Store ContainerClient + endpoint directly on CosmosDBTableProvider and CosmosDBExec - Update fetch_samples and execute to use the pre-built ContainerClient (no per-call .container_client() indirection) * docs(cosmosdb): clarify example comment for max_concurrent_requests=8 The comment said 'tighten' the per-account concurrency budget, but the value (8) raises it above the default (4). Update the comment to say 'raise' and explicitly note the default, so the direction matches the demonstrated value. * Lint * Lint * Lint * fix(cosmosdb): normalize endpoint + fix Fibonacci comment indexing - Normalize the account endpoint (trim trailing '/', lowercase) before returning it as the resilience key, so the per-account concurrency budget is shared across datasets configured with benign URL formatting differences (trailing slash / casing). - Fix the Fibonacci backoff test comment: factors follow F(attempt+2) with F(1)=F(2)=1, not F(attempt+1). The asserted sequence 1, 2, 3, 5, 8 already matched the implementation; only the comment was off by one. * Lint * Lint * Lint * fix(cosmosdb): remove unused schema_override; fix schema-pinning docs Spicepod's dataset.columns is semantic-only and has no type field, so the docs/example claims about pinning Cosmos schema via 'columns:' were misleading — schema pinning is not actually supported by this connector today. This commit aligns the code and docs with reality: - Remove the unused schema_override field + with_schema_override from CosmosDBTableProviderConfig; simplify try_new to always infer schema - Rewrite the 'pinned schema' example to a resilience-tuning example (widens schema_infer_max_records to stabilize inference instead) - Update docs/dev/cosmosdb.md type map and 'What's supported' to remove columns:-based pinning claims and point to schema_infer_max_records - Clarify inflight_operations metric description: it counts operations holding a concurrency permit (incremented once per operation, held across retry backoff sleeps) rather than strictly in-flight HTTP requests * Lint * Lint --------- Co-authored-by: Viktor Yershov <viktor@spice.ai>
* feat(datafusion): add flatten_json_properties and json_tree UDTFs (M1) M1 skeleton of the `flatten_json_properties` table function from #10399 — recursively walks a JSON-Schema-shaped document's `properties` tree and emits one row per field with path, parent_path, name, description, type, required, format, enum_values, and metadata columns. Also adds `json_tree`, a schema-agnostic recursive JSON walker modeled on DuckDB/SQLite's function of the same name (cols: key, value, type, atom, id, parent, fullkey, path) so users have a generic alternative when their input isn't JSON-Schema-shaped. Both are experimental and gated behind `flatten-json-properties` and `json-tree` Cargo features (off by default). M1 accepts only literal JSON string arguments; per-row LATERAL invocation with a column reference lands in M2 alongside `$ref` / `allOf` / `oneOf` / `anyOf` resolution, `items.properties`, `additionalProperties` maps, the options struct, cycle detection, and metrics. Refs #10399 * feat(datafusion): complete M2-M4 for flatten_json_properties + json_tree M1 shipped a `properties`-only skeleton behind a feature flag. This commit lands the rest of the milestones for both functions. M2 — Full shape coverage: - `items.properties` — arrays of objects; leaves emit at `array.field`. - `additionalProperties` — typed maps; `type = "map"` and children at `map.child`. - `allOf` / `oneOf` / `anyOf` — fields merged across branches with first- declaration dedupe; `required` is union across branches. - Local `$ref` resolution (JSON Pointer syntax, including `~0` / `~1` escapes) with an active-ref set for cycle detection — cycles yield a `kind=cycle` metric, no stack overflow. - External `$ref` URIs — surfaced as `type = "ref"` rows with the URI captured in `metadata`. Never dereferenced (no network / file IO). - Options surface (named args on both UDTF and planning path): `max_depth`, `max_rows`, `max_bytes`, `dialect`, `include_internal`, `path_style` (`dot` or `json-pointer`). - OpenTelemetry counters: `flatten_json_properties_invocations_total`, `_rows_emitted_total`, `_errors_total{kind}` where kind ∈ {parse, depth_exceeded, row_cap_hit, cycle, input_too_large}. Same set for `json_tree` (with applicable kinds). - Scalar UDF companion registered under the same name, returning `List<Struct<...>>` — gives per-row / LATERAL semantics via `UNNEST(flatten_json_properties(s.body))`. `json_tree` brought to parity: max_depth / max_bytes options, scalar UDF variant, cycle-independent depth cap, metrics. M3 — UX + perf: - Cookbook recipe at `examples/flatten-json-properties/` with a worked spicepod.yaml (dataset → view via UNNEST → DuckDB acceleration → column-level embeddings → vector_search) plus a 3-document sample. - Bench harness at `crates/runtime/benches/flatten_json_properties.rs` with Criterion groups for flat-schema fan-out, nested depth, and a 1k-schema catalog simulation. M4 — Release decision: - Feature flags dropped. Both UDTFs + UDFs register unconditionally on every build. Default behavior change vs M1: `include_internal` is now `false` (spec default), so container rows (`object` / `array` / `map`) are suppressed unless the caller opts in. 32 unit tests covering the full shape matrix, ref resolution, cycle termination, option parsing, limit tripping, path-style variants, scalar UDF per-row dispatch with NULLs, and UDTF plan integration. Refs #10399 * refactor(datafusion/udtf): simplify walker per /simplify review - Replace hand-rolled `resolve_local_ref` with `serde_json::Value::pointer`. - Delete `collect_effective_owned` and the `Cow<'static>` lifetime-laundering dance; everything walked lives under the walker's `&'a Value` root, so `&'a Value` suffices. Removes two identical recursion paths and the deep target clone on every `$ref` resolution. - Drop the dead `depth` parameter from `collect_effective`. - Hoist `property_fields` / `tree_fields` into static `LazyLock<Fields>` handles so the schema isn't reallocated on every call. - Extract `build_tree_arrays` in `json_tree` so `rows_to_batch` and the scalar-UDF struct-array builder share one implementation. - Borrow-not-clone for `HashSet<&str>` required / seen_names in the walker. - Strip WHAT-style comments and task-references from the bench. * fix(datafusion/udtf): address PR review feedback - Update copyright headers to 2024-2026 across the new UDTF files. - Tighten scalar UDF signatures (`flatten_json_properties` / `json_tree`) to accept Utf8 / LargeUtf8 / Utf8View; normalize via `cast` so non-Utf8 string columns no longer panic in `as_string_array`. - Cap combinator / `$ref` expansion in `collect_effective` by threading a ref-depth counter through recursion; prevents pathological chains from bypassing `max_depth` / exhausting the stack. - Clarify `dialect` option semantics in docs: currently only tags invocation metrics; OpenAPI-specific walker behavior is future scope. - `compute_type` no longer treats non-object `properties` / non-object-or- array `items` as `object` / `array`. - Collapse duplicate-row emission in `handle_field`: recurse once on the original `spec` so `walk_schema`'s `seen_names` de-duplicates fields across allOf/oneOf/anyOf / `$ref` branches. - Document single-node-only scan for both UDTFs (cluster mode requires a `UdtfArgs` proto variant + codec, tracked as follow-up). - Fix three branch-local clippy `collapsible_if` errors and annotate `emit_row`'s argument count. * fix(datafusion/udtf): address second round of PR review - `json_tree`: root row now emits `path = NULL` (field is nullable) to match DuckDB / SQLite `json_tree` semantics; children still carry the parent fullkey as `path`. - `json_tree`: array element rows now set `key = idx.to_string()` so consumers can distinguish array siblings (previously NULL). - `flatten_json_properties`: container fields with no walkable children (array of primitives, map of primitives, empty object) are now emitted as leaf rows in `include_internal = false` mode, so the field still appears in output. - Deny `flatten_json_properties` / `json_tree` scalar UDFs for federation pushdown; add them to the existing `deny_list_blocks_spice_builtins` test so regressions are caught. README double-pipe comment was a false positive (the file already uses single `|` with `\|` escapes inside cells). * fix(datafusion/udtf): address round-3 PR review - `json_tree`: add `max_rows` option (default 1,000,000) so bounded `max_bytes` input can't explode into unbounded row counts. Walker records `row_cap_hit` metric when hit and truncates cleanly. - `json_tree`: clarify module-level docs — named options are UDTF-form only; the scalar UDF takes just the JSON argument with default caps. - Both scalar UDFs now truncate deterministically at `i32::MAX` flattened rows (with a `row_cap_hit` metric) instead of returning a query-level `Execution` error on `List<Struct>` offset overflow. Preserves the "never a query-level error" contract. Not addressed: re-raised comments on `DataSourceExec` / cluster-mode `UdtfExec` wrapping — documented as follow-up scope in the prior commit; wrapping requires a new `UdtfArgs` proto variant + codec. * style: cargo fmt line-wrap in flatten_json_properties scalar UDF * fix(datafusion/udtf): bracket-quote JSON-path keys with hyphens SQLite / DuckDB `json_tree` path shorthand only accepts identifier-style keys; anything else must be bracket-quoted so consumers can re-parse the `fullkey`. Previously a key like `has-hyphen` was rendered as `$.a-b`, which isn't a valid shorthand. Now forces bracket-quoting for keys with any non-identifier character, and extends the existing special-character test to cover hyphens. * fix(datafusion/udtf): switch scalar UDFs to LargeList<Struct<...>> Copilot flagged that i32 ListArray offsets could silently truncate results when the flattened row count across a batch exceeds i32::MAX (only a metric signal was emitted). Silent incomplete results risk query correctness. Switching to LargeList (i64 offsets) makes overflow effectively impossible with no behavior change — UNNEST works transparently on both variants. Drops the `max_flattened_rows` truncation path entirely. * style(datafusion/udtf): fix pedantic clippy + fmt errors CI's `make lint-rust` uses `clippy::pedantic + clippy::allow_attributes + clippy::unwrap_used + clippy::expect_used`, which surfaced: - `#[allow(clippy::too_many_arguments)]` → `#[expect(...)]` with reason (lint 1.81+ requires explicit expect for cleared warnings). - `doc_markdown`: backtick-wrap `UInt`, `Bool`, `Utf8`, numeric defaults, `DuckDB`, `SQLite`, `DoS`, `DataFusion`, `OpenAPI` in module docs. - `single_match_else` + `match_like_matches_macro`: rewrite the `serde_json::from_str` match as `let Ok(root) = ... else { ... }`. - `.unwrap()` on `key.chars().next()` in `escape_object_key` → `is_some_and`. - `name.to_string()` on `&String` → `name.clone()`. - `all_rows.len() as i64` → `i64::try_from(...).unwrap_or(i64::MAX)` (walker caps bound the count well under i64::MAX; saturate instead of unwrap since the lint config bans `.unwrap()`/`.expect()`). * fix(datafusion/udtf): type-union ordering + fail-loud on offset overflow - `compute_type`: when `"type"` is an array (JSON-Schema nullable syntax, e.g. `["null", "string"]`), pick the first non-null entry so optional fields classify as their real type instead of `"null"`. Falls back to `"null"` only when it's the sole type. Extended test coverage. - Both scalar UDFs: `i64::try_from(row_count)` now returns a `DataFusionError::Execution` on overflow instead of saturating to `i64::MAX`. Saturation would silently misalign `LargeList` offsets; erroring surfaces the (unreachable-in-practice) condition loudly. * fix(datafusion/udtf): cross-walk cycle detection + batch row cap - `walk_schema` now persists `$ref` insertion in `visited_refs` for the duration of the tree-walk recursion, not just for a single `collect_effective` pass. Fixes a leak where schemas like `{$defs: {Node: {properties: {next: {$ref: #/$defs/Node}}}}, properties: {root: {$ref: #/$defs/Node}}}` could descend past the first resolution boundary. Tightened `local_ref_cycle_terminates` to assert stopping at `root.next`. - Both scalar UDFs now error on `DataFusionError::Execution` if the accumulated cross-batch row count exceeds `SCALAR_BATCH_MAX_ROWS` (10M). Per-document caps bound single rows, but a wide batch could previously reach `number_rows * max_rows` in memory before returning. * fix(udtf): pass projection to MemorySourceConfig in json_properties and json_tree Both UDTFs were ignoring the projection parameter in scan(), causing a schema mismatch error when selecting specific columns (e.g. SELECT path, name, type FROM flatten_json_properties(...)). Pass projection.cloned() to MemorySourceConfig::try_new() so DataFusion can push column pruning down into the scan. * fix: format MemorySourceConfig initialization for better readability * Tests + Lint * fix(tests): improve error handling and assertions in JSON property tests * fix(tests): update projection comments for clarity in JSON schema tests --------- Co-authored-by: Viktor Yershov <viktor@spice.ai>
#10365) * Harden /v1/tools and /v1/nsql against unauthenticated / LLM-driven SQL Addresses threat model items #50 and #51 (docs/threat_models/v2.0.0.md): - Add strict read-only SQL validator (validate_sql_query_read_only) that rejects every DDL/DML/COPY/non-prepared Statement node regardless of per-catalog writability. - Plumb a read_only flag through QueryBuilder/Query and apply the validator at all three plan execution sites (local, Ballista, async). - Default the built-in `sql` tool to read-only; operators may opt in via SqlTool::allow_writes(). LLM tool-use can no longer mutate data through the sql tool. - Run LLM-generated SQL from /v1/nsql under the read-only validator so prompt-injection-driven writes cannot reach writable catalogs. - Gate /v1/tools/* behind a require_auth_configured middleware: when runtime.auth is not set, these routes return 401 rather than invoking tool.call anonymously with attacker-controlled bodies. - Record the new mitigations in the v2.0.0 threat model. * refactor: clarify read-only SQL validation comments and enhance documentation for DDL/DML restrictions * Refactor authentication error response to use JSON format and add SQL tool descriptions for read-only and writable modes * Fix collapsible_if clippy lint in read-only validation path * Reject write-capable extension nodes in read-only validator Spice's planner can represent DDL/DML as LogicalPlan::Extension nodes (DdlExtensionNode, DmlExtensionNode, DistributedCayenne{Insert,Update, Delete,Merge}Node, CayenneMergeNode). The previous read-only validator only matched Ddl/Dml/Copy/Statement and would have let those plan shapes through, defeating the read-only guarantee on /v1/tools/sql and /v1/nsql. - Add Extension arm to validate_sql_query_read_only that denies any node whose UserDefinedLogicalNodeCore::name matches a curated list of write-capable extension names. - Test the deny mechanism with a stub UserDefinedLogicalNode and verify a non-write extension name is still allowed. - Add an integration test that exercises Spice's create_logical_plan wrapper end-to-end (cfg(not(windows))). - Reflect the PREPARE/EXECUTE/DEALLOCATE rejection in the SqlTool read-only description so LLM/tool-selection logic knows the posture. - Replace the PR-contextual 'Unverified in this review' phrasing in the threat model with the durable 'Unverified mitigation'. * Bypass SQL results cache for read-only query paths When ctx.read_only is set (e.g. the /v1/tools/sql read-only tool and the /v1/nsql LLM SQL path), both the SQL-keyed and plan-keyed results-cache lookups are now skipped inside get_plan_or_cached, and the returned RequestCacheManager is forced to CacheDisabled. Previously, a cache hit from a prior writable execution could short-circuit validate_sql_query_read_only, letting a cached result produced by a write-capable plan (e.g. LogicalPlan::Extension nodes like DmlExtension or DistributedCayenneInsert) be served on a read-only surface. Also move WRITE_CAPABLE_EXTENSION_NAMES into the cache crate as the single source of truth, and extend cache_is_enabled_for_plan to reject write-capable LogicalPlan::Extension nodes. Defense-in-depth: even on writable paths, write-capable extension plans are now never cached or populated in the results cache. * fix: flatten write-capable extension check to match guard in cache eligibility Removes one level of nesting as requested in review. --------- Co-authored-by: Viktor Yershov <viktor@spice.ai>
…ion (#10408) * feat(embeddings): multi-vector embeddings with MaxSim + late-interaction Extends column-level embeddings to accept list-of-string columns and produces one embedding vector per list element, stored as List<FixedSizeList<F32, D>> per row. Enables tag/synonym/attribute-style columns to participate in vector search without users having to pre-flatten into a separate table. Search adds two new scoring modes alongside the existing chunked-scalar path: - Single-query × multi-vector (MaxSim / Mean / Sum): per-row score is max/mean/sum over the list element cosines. Default is MaxSim (ColBERT-style). _match returns the element that produced the top cosine. - Multi-query × multi-vector (late-interaction): SUM_{q in Q} MAX_{d in D} cos(q, d). Opt in by passing an array query, e.g. vector_search(tbl, ['foo','bar'], col). Config surface (both ColumnLevelEmbeddingConfig and the legacy ColumnEmbeddingConfig) gains two new fields: - aggregation: max|mean|sum (default max); rejected on scalar columns. - max_elements_per_row: default 32, hard cap 1024; excess elements are dropped with a tracing::warn. Mode is auto-detected from the column Arrow type: Utf8/Utf8View/LargeUtf8 → Scalar, List<Utf8>/LargeList<Utf8> → ListMulti. Chunking is rejected on list columns; multi-vector options on scalar columns error with a clear message. Implementation: - EmbeddingInputMode { Scalar, ListMulti } threaded through EmbeddingTable; resolve_input_mode enforces all validation rules. - decompose_list_of_strings handles ListArray, LargeListArray and all three Utf8 element types; respects max_elements_per_row truncation. - get_vectors_per_list_element (async + sync) embeds via one model call, respecting null-row (→ empty output list) and null/empty-element (→ null vector) semantics. - base_table_has_embedding_column relaxes the offset-column requirement when the source column is list-typed; multi-vector uses element index as the implicit offset. - ChunkedNonIndexVectorGeneration grows a VectorScanMode with three variants; search() dispatches to search_chunked_scalar, search_list_multi, or search_late_interaction. Aggregation uses Expr::AggregateFunction.partition_by windowing. Late-interaction unions per-query subplans tagged with q_idx and does a two-step aggregate (pk, q_idx → MAX; pk → SUM). - VectorSearchTableFuncArgs gains a queries: Vec<String> alongside the existing query: String. parse_query_arg accepts either a Utf8 literal or a make_array(...) expression; to_expr round-trips both forms. Dispatcher errors when multi-query is paired with a scalar column. - Telemetry track_vector_search gets multi_vector and multi_vector_aggregation KeyValue dims when applicable. Accelerator compatibility: multi-vector output Arrow shape is identical to the chunked-scalar path's output, so Arrow, Cayenne, and DuckDB round-trip transparently. Turso serializes nested lists to JSON TEXT today (turso.rs:581-583); SQLite inherits via datafusion-table-providers. The compat matrix is documented at the head of EmbeddingInputMode. A native typed side-table for SQLite/Turso is a future optimization that would benefit chunked-scalar equally. Tests: 41 new unit tests across embeddings::table (25), embeddings::execution_plan (12), and embeddings::udtf::parser_tests (4) cover type detection, input-mode resolution, list decomposition with null/empty/truncation edge cases, multi-vector list-array construction, end-to-end per-element embedding via a mock embedder, and the make_array query parser. * Fix + Lint * Lint * fix(embeddings): address review comments on multi-vector PR - build_multi_vector_list_array validates each embedding's length against vector_length before appending, returning a structured error instead of letting the FixedSizeListBuilder panic on mismatch. - decompose_generic_list hoists value_offsets() outside the loop and resolves the string-array variant once via a three-way downcast, removing per-element dynamic dispatch. Introduces a generic build_rows helper parameterised by a closure. - parse_query_arg rejects make_array(...) with more than VECTOR_SEARCH_MAX_QUERIES (32) elements to prevent late-interaction plans from blowing up on unbounded input. - to_expr derives the single-query literal from args.queries.first() so the single- and multi-query branches stay consistent. * fix(embeddings): round 2 of review fixes — format, missing-column error, grammar - cargo fmt fixup on decompose_generic_list after the string-variant refactor. - try_new now returns Error::EmbeddingColumnNotInSchema when a configured embedding source column is missing from the base schema, instead of silently dropping the column. Misconfiguration fails fast during table construction. - Grammar: "Cannot use it create an embeddings" → "Cannot use it to create embeddings" in the base_table_has_embedding_column warning. * Lint * feat: Implement JSON schema decomposition for HTTP connector * fix(vector): unify aggregation handling for ChunkedScalar and LateInteraction modes * Lint * fix(tests): improve variable naming for clarity in embedding tests * fix(clippy): replace unwrap with expect and Arc::clone in embedding tests --------- Co-authored-by: Viktor Yershov <viktor@spice.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )