Skip to content

[pull] trunk from spiceai:trunk#761

Merged
pull[bot] merged 4 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk
Apr 21, 2026
Merged

[pull] trunk from spiceai:trunk#761
pull[bot] merged 4 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 21, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

lukekim and others added 4 commits April 21, 2026 04:10
* feat: Add Azure Cosmos DB (NoSQL) data connector (alpha)

Adds a first-pass, read-only Azure Cosmos DB NoSQL / Core SQL API
data connector built on the azure_data_cosmos 0.30 SDK.

- New data_components::cosmosdb module with CosmosDBClient,
  CosmosDBTableProvider, and CosmosDBExec.
- Cross-partition scan via Cosmos SQL (defaults to SELECT * FROM c).
- Arrow schema inferred from a sample of documents; system fields
  (_rid/_self/_etag/_attachments/_ts) are stripped.
- Key-based auth (connection string or account endpoint + key).
- New 'cosmosdb' Cargo feature wired through data_components, runtime,
  and spiced (included in the default spiced distribution).
- Makefile lint target + SPICED_DATA_FEATURES entry.
- Docs stub and initial RC criteria table row.

Alpha quality: no filter/projection push-down, no writes, no change
feed, no Microsoft Entra ID auth yet.

* refactor: Clean up code formatting and improve readability in Cosmos DB modules

* fix: Simplify endpoint assignment in CosmosDBClient implementation

* fix(cosmosdb): address review comments for Cosmos DB connector

- Drop `.required()` from `database` param; falls back to path segment.
- Validate `schema_infer_max_records`; warn-and-default on 0/non-integer.
- Use prefixed param names (`cosmosdb_*`) in the auth error message.
- Split path parsing into `parse_database_and_container` and reject
  empty db/container segments with a descriptive message.
- Fix `infer_schema` doc comment to match actual empty-sample behavior.
- Add unit tests for `parse_database_and_container`, `strip_system_fields`,
  and `infer_schema`.
- Docs: use unprefixed `query` / `schema_infer_max_records` (runtime params).

* fix(cosmosdb): address follow-up review comments

- parse_database_and_container error now references the user-facing
  'cosmosdb_database' parameter name.
- execute() wraps Cosmos query errors with Error::QueryFailed so failures
  include the database/container context (matches fetch_samples).
- Add unit tests for decode_batch covering multi-doc decoding, projection,
  empty input, and missing-field-fills-null behavior.

* feat(cosmosdb): bring connector to RC quality

Closes the remaining gaps against docs/criteria/connectors/rc.md for the
Azure Cosmos DB (NoSQL) connector:

- Connection resilience: per-account concurrency semaphore,
  exponential/fibonacci backoff, Retry-After / x-ms-retry-after-ms
  handling, permanent-error (401/403/404) detection that latches the
  connector disabled. New CosmosResilienceConfig in
  crates/data_components/src/cosmosdb/resilience.rs with 16 unit tests.
- inflight_operations metric gauge via MetricsProvider, keyed off the
  account endpoint so datasets sharing an account aggregate cleanly.
- New runtime parameters: max_concurrent_requests, http_max_retries,
  backoff_method, disable_on_permanent_error.
- unsupported_type_action plumbing — all-null inferred columns
  (DataType::Null) are warn-dropped by default, with user-configurable
  Error/Ignore/String alternatives.
- Integration-test scaffold at crates/runtime/tests/cosmosdb/ with an
  offline smoke test plus two #[ignore]'d live tests that read
  COSMOSDB_CONNECTION_STRING (or ENDPOINT+KEY) from the environment.
- Cookbook recipe at examples/cosmosdb-connector/ (README +
  spicepod.yaml + queries.sql).
- Docs: status flip to RC, JSON -> Arrow type mapping table, RC
  exceptions call-out, resilience parameter reference, integration-test
  run instructions.
- Criteria tables: add Cosmos DB rows to alpha.md / beta.md (quality +
  feature matrix); flip rc.md row to DRI @lukekim.

Test coverage:
- 32 unit tests in data_components::cosmosdb (resilience, schema,
  provider incl. 1024-column max-width test, unsupported-type actions).
- 12 unit tests in runtime::dataconnector::cosmosdb
  (shared_semaphore, shared_disabled_flag, path parsing).
- 1 offline smoke + 2 ignored live tests in
  crates/runtime/tests/cosmosdb.

* refactor(cosmosdb): simplify post-review nits

- Redact CosmosDBCredential Debug impl (was deriving Debug on an enum
  holding raw account keys / connection strings; would surface secrets
  in any tracing::debug! or panic).
- Drop CosmosDBMetrics wrapper struct; store the Arc<AtomicU64> inflight
  counter directly on CosmosDB and on CosmosDBMetricsProvider. Removes
  one level of Arc indirection.
- Use azure_core::http::headers::X_MS_RETRY_AFTER_MS constant instead of
  hand-rolling HeaderName::from_static for the Cosmos-specific retry-ms
  header.
- Swap Ordering::SeqCst -> Acquire/Release on the disabled latch; a
  single AtomicBool only needs acquire/release pairing, and the change
  is free on x86 but saves a fence on ARM.
- Avoid allocating String endpoint on the streaming scan hot path;
  pass client.endpoint() (&str) directly to handle_stream_error.
- Add eviction-note docstring to COSMOS_CONCURRENCY_LIMITS and
  COSMOS_DISABLED_FLAGS, matching the git-connector statics.
- Trim narrating / RC-pointing doc comments from PARAMETERS,
  CosmosDBTableProviderConfig, and apply_unsupported_type_action.

* perf(cosmosdb): decode JSON docs directly via arrow-json decoder

Replace decode_batch's NDJSON round-trip (serialize each Value to bytes,
push newline, re-parse through ReaderBuilder::build) with the serde-aware
path: ReaderBuilder::build_decoder() + decoder.serialize(&docs) +
decoder.flush(). Avoids the per-batch serialize -> parse overhead and
matches the pattern used in crates/data_components/src/s3_vectors/
(list_provider.rs, query_provider.rs).

Addresses Copilot review feedback on PR #10392.

* docs(cosmosdb): clarify retry scope (schema vs streaming)

http_max_retries / backoff_method apply to the schema-inference pass at
dataset registration; mid-stream pager errors propagate directly because
a FeedPager cannot be rewound once rows have been emitted. The
permanent-error latch still fires on both paths.

Addresses Copilot review feedback on PR #10392.

* style(cosmosdb): apply cargo fmt

* docs(cosmosdb): align doc comments with RC quality

- Update crate-level mod.rs doc to state RC quality and reference docs
- Update provider.rs doc comment to RC quality with retry-scope limitation
- Clarify inflight_operations metric is per-dataset (shared concurrency
  budget is enforced via endpoint-keyed COSMOS_CONCURRENCY_LIMITS map)
- Update http_max_retries and backoff_method descriptions to state they
  apply to the schema-inference sampling pass only

* docs(cosmosdb): clarify inflight_operations gauge is per-dataset scope

* refactor(cosmosdb): replace CosmosDBClient wrapper with build_container_client

Each CosmosDBTableProvider is pinned to a single (database, container)
pair, so building a ContainerClient once at connector setup and reusing
it — rather than holding a CosmosDBClient wrapper and re-deriving the
ContainerClient on every scan — simplifies the API.

- Delete the CosmosDBClient struct and its re-export
- Add build_container_client(credential, database, container) free
  function that returns (ContainerClient, endpoint)
- Store ContainerClient + endpoint directly on CosmosDBTableProvider
  and CosmosDBExec
- Update fetch_samples and execute to use the pre-built ContainerClient
  (no per-call .container_client() indirection)

* docs(cosmosdb): clarify example comment for max_concurrent_requests=8

The comment said 'tighten' the per-account concurrency budget, but the
value (8) raises it above the default (4). Update the comment to say
'raise' and explicitly note the default, so the direction matches the
demonstrated value.

* Lint

* Lint

* Lint

* fix(cosmosdb): normalize endpoint + fix Fibonacci comment indexing

- Normalize the account endpoint (trim trailing '/', lowercase) before
  returning it as the resilience key, so the per-account concurrency
  budget is shared across datasets configured with benign URL
  formatting differences (trailing slash / casing).
- Fix the Fibonacci backoff test comment: factors follow F(attempt+2)
  with F(1)=F(2)=1, not F(attempt+1). The asserted sequence 1, 2, 3, 5, 8
  already matched the implementation; only the comment was off by one.

* Lint

* Lint

* Lint

* fix(cosmosdb): remove unused schema_override; fix schema-pinning docs

Spicepod's dataset.columns is semantic-only and has no type field, so
the docs/example claims about pinning Cosmos schema via 'columns:' were
misleading — schema pinning is not actually supported by this connector
today. This commit aligns the code and docs with reality:

- Remove the unused schema_override field + with_schema_override from
  CosmosDBTableProviderConfig; simplify try_new to always infer schema
- Rewrite the 'pinned schema' example to a resilience-tuning example
  (widens schema_infer_max_records to stabilize inference instead)
- Update docs/dev/cosmosdb.md type map and 'What's supported' to remove
  columns:-based pinning claims and point to schema_infer_max_records
- Clarify inflight_operations metric description: it counts operations
  holding a concurrency permit (incremented once per operation, held
  across retry backoff sleeps) rather than strictly in-flight HTTP
  requests

* Lint

* Lint

---------

Co-authored-by: Viktor Yershov <viktor@spice.ai>
* feat(datafusion): add flatten_json_properties and json_tree UDTFs (M1)

M1 skeleton of the `flatten_json_properties` table function from #10399 —
recursively walks a JSON-Schema-shaped document's `properties` tree and
emits one row per field with path, parent_path, name, description, type,
required, format, enum_values, and metadata columns.

Also adds `json_tree`, a schema-agnostic recursive JSON walker modeled on
DuckDB/SQLite's function of the same name (cols: key, value, type, atom,
id, parent, fullkey, path) so users have a generic alternative when their
input isn't JSON-Schema-shaped.

Both are experimental and gated behind `flatten-json-properties` and
`json-tree` Cargo features (off by default). M1 accepts only literal JSON
string arguments; per-row LATERAL invocation with a column reference lands
in M2 alongside `$ref` / `allOf` / `oneOf` / `anyOf` resolution,
`items.properties`, `additionalProperties` maps, the options struct,
cycle detection, and metrics.

Refs #10399

* feat(datafusion): complete M2-M4 for flatten_json_properties + json_tree

M1 shipped a `properties`-only skeleton behind a feature flag. This commit
lands the rest of the milestones for both functions.

M2 — Full shape coverage:
- `items.properties` — arrays of objects; leaves emit at `array.field`.
- `additionalProperties` — typed maps; `type = "map"` and children at
  `map.child`.
- `allOf` / `oneOf` / `anyOf` — fields merged across branches with first-
  declaration dedupe; `required` is union across branches.
- Local `$ref` resolution (JSON Pointer syntax, including `~0` / `~1`
  escapes) with an active-ref set for cycle detection — cycles yield a
  `kind=cycle` metric, no stack overflow.
- External `$ref` URIs — surfaced as `type = "ref"` rows with the URI
  captured in `metadata`. Never dereferenced (no network / file IO).
- Options surface (named args on both UDTF and planning path):
  `max_depth`, `max_rows`, `max_bytes`, `dialect`, `include_internal`,
  `path_style` (`dot` or `json-pointer`).
- OpenTelemetry counters: `flatten_json_properties_invocations_total`,
  `_rows_emitted_total`, `_errors_total{kind}` where kind ∈ {parse,
  depth_exceeded, row_cap_hit, cycle, input_too_large}. Same set for
  `json_tree` (with applicable kinds).
- Scalar UDF companion registered under the same name, returning
  `List<Struct<...>>` — gives per-row / LATERAL semantics via
  `UNNEST(flatten_json_properties(s.body))`.

`json_tree` brought to parity: max_depth / max_bytes options, scalar UDF
variant, cycle-independent depth cap, metrics.

M3 — UX + perf:
- Cookbook recipe at `examples/flatten-json-properties/` with a worked
  spicepod.yaml (dataset → view via UNNEST → DuckDB acceleration →
  column-level embeddings → vector_search) plus a 3-document sample.
- Bench harness at `crates/runtime/benches/flatten_json_properties.rs`
  with Criterion groups for flat-schema fan-out, nested depth, and a
  1k-schema catalog simulation.

M4 — Release decision:
- Feature flags dropped. Both UDTFs + UDFs register unconditionally on
  every build.

Default behavior change vs M1: `include_internal` is now `false` (spec
default), so container rows (`object` / `array` / `map`) are suppressed
unless the caller opts in.

32 unit tests covering the full shape matrix, ref resolution, cycle
termination, option parsing, limit tripping, path-style variants, scalar
UDF per-row dispatch with NULLs, and UDTF plan integration.

Refs #10399

* refactor(datafusion/udtf): simplify walker per /simplify review

- Replace hand-rolled `resolve_local_ref` with `serde_json::Value::pointer`.
- Delete `collect_effective_owned` and the `Cow<'static>` lifetime-laundering
  dance; everything walked lives under the walker's `&'a Value` root, so
  `&'a Value` suffices. Removes two identical recursion paths and the deep
  target clone on every `$ref` resolution.
- Drop the dead `depth` parameter from `collect_effective`.
- Hoist `property_fields` / `tree_fields` into static `LazyLock<Fields>`
  handles so the schema isn't reallocated on every call.
- Extract `build_tree_arrays` in `json_tree` so `rows_to_batch` and the
  scalar-UDF struct-array builder share one implementation.
- Borrow-not-clone for `HashSet<&str>` required / seen_names in the walker.
- Strip WHAT-style comments and task-references from the bench.

* fix(datafusion/udtf): address PR review feedback

- Update copyright headers to 2024-2026 across the new UDTF files.
- Tighten scalar UDF signatures (`flatten_json_properties` / `json_tree`)
  to accept Utf8 / LargeUtf8 / Utf8View; normalize via `cast` so non-Utf8
  string columns no longer panic in `as_string_array`.
- Cap combinator / `$ref` expansion in `collect_effective` by threading a
  ref-depth counter through recursion; prevents pathological chains from
  bypassing `max_depth` / exhausting the stack.
- Clarify `dialect` option semantics in docs: currently only tags
  invocation metrics; OpenAPI-specific walker behavior is future scope.
- `compute_type` no longer treats non-object `properties` / non-object-or-
  array `items` as `object` / `array`.
- Collapse duplicate-row emission in `handle_field`: recurse once on the
  original `spec` so `walk_schema`'s `seen_names` de-duplicates fields
  across allOf/oneOf/anyOf / `$ref` branches.
- Document single-node-only scan for both UDTFs (cluster mode requires a
  `UdtfArgs` proto variant + codec, tracked as follow-up).
- Fix three branch-local clippy `collapsible_if` errors and annotate
  `emit_row`'s argument count.

* fix(datafusion/udtf): address second round of PR review

- `json_tree`: root row now emits `path = NULL` (field is nullable) to
  match DuckDB / SQLite `json_tree` semantics; children still carry the
  parent fullkey as `path`.
- `json_tree`: array element rows now set `key = idx.to_string()` so
  consumers can distinguish array siblings (previously NULL).
- `flatten_json_properties`: container fields with no walkable children
  (array of primitives, map of primitives, empty object) are now emitted
  as leaf rows in `include_internal = false` mode, so the field still
  appears in output.
- Deny `flatten_json_properties` / `json_tree` scalar UDFs for federation
  pushdown; add them to the existing `deny_list_blocks_spice_builtins`
  test so regressions are caught.

README double-pipe comment was a false positive (the file already uses
single `|` with `\|` escapes inside cells).

* fix(datafusion/udtf): address round-3 PR review

- `json_tree`: add `max_rows` option (default 1,000,000) so bounded
  `max_bytes` input can't explode into unbounded row counts. Walker
  records `row_cap_hit` metric when hit and truncates cleanly.
- `json_tree`: clarify module-level docs — named options are UDTF-form
  only; the scalar UDF takes just the JSON argument with default caps.
- Both scalar UDFs now truncate deterministically at `i32::MAX` flattened
  rows (with a `row_cap_hit` metric) instead of returning a query-level
  `Execution` error on `List<Struct>` offset overflow. Preserves the
  "never a query-level error" contract.

Not addressed: re-raised comments on `DataSourceExec` / cluster-mode
`UdtfExec` wrapping — documented as follow-up scope in the prior commit;
wrapping requires a new `UdtfArgs` proto variant + codec.

* style: cargo fmt line-wrap in flatten_json_properties scalar UDF

* fix(datafusion/udtf): bracket-quote JSON-path keys with hyphens

SQLite / DuckDB `json_tree` path shorthand only accepts identifier-style
keys; anything else must be bracket-quoted so consumers can re-parse the
`fullkey`. Previously a key like `has-hyphen` was rendered as `$.a-b`,
which isn't a valid shorthand. Now forces bracket-quoting for keys with
any non-identifier character, and extends the existing special-character
test to cover hyphens.

* fix(datafusion/udtf): switch scalar UDFs to LargeList<Struct<...>>

Copilot flagged that i32 ListArray offsets could silently truncate
results when the flattened row count across a batch exceeds i32::MAX
(only a metric signal was emitted). Silent incomplete results risk
query correctness.

Switching to LargeList (i64 offsets) makes overflow effectively
impossible with no behavior change — UNNEST works transparently on both
variants. Drops the `max_flattened_rows` truncation path entirely.

* style(datafusion/udtf): fix pedantic clippy + fmt errors

CI's `make lint-rust` uses `clippy::pedantic + clippy::allow_attributes +
clippy::unwrap_used + clippy::expect_used`, which surfaced:

- `#[allow(clippy::too_many_arguments)]` → `#[expect(...)]` with reason
  (lint 1.81+ requires explicit expect for cleared warnings).
- `doc_markdown`: backtick-wrap `UInt`, `Bool`, `Utf8`, numeric defaults,
  `DuckDB`, `SQLite`, `DoS`, `DataFusion`, `OpenAPI` in module docs.
- `single_match_else` + `match_like_matches_macro`: rewrite the
  `serde_json::from_str` match as `let Ok(root) = ... else { ... }`.
- `.unwrap()` on `key.chars().next()` in `escape_object_key` → `is_some_and`.
- `name.to_string()` on `&String` → `name.clone()`.
- `all_rows.len() as i64` → `i64::try_from(...).unwrap_or(i64::MAX)`
  (walker caps bound the count well under i64::MAX; saturate instead of
  unwrap since the lint config bans `.unwrap()`/`.expect()`).

* fix(datafusion/udtf): type-union ordering + fail-loud on offset overflow

- `compute_type`: when `"type"` is an array (JSON-Schema nullable syntax,
  e.g. `["null", "string"]`), pick the first non-null entry so optional
  fields classify as their real type instead of `"null"`. Falls back to
  `"null"` only when it's the sole type. Extended test coverage.

- Both scalar UDFs: `i64::try_from(row_count)` now returns a
  `DataFusionError::Execution` on overflow instead of saturating to
  `i64::MAX`. Saturation would silently misalign `LargeList` offsets;
  erroring surfaces the (unreachable-in-practice) condition loudly.

* fix(datafusion/udtf): cross-walk cycle detection + batch row cap

- `walk_schema` now persists `$ref` insertion in `visited_refs` for the
  duration of the tree-walk recursion, not just for a single
  `collect_effective` pass. Fixes a leak where schemas like `{$defs:
  {Node: {properties: {next: {$ref: #/$defs/Node}}}}, properties: {root:
  {$ref: #/$defs/Node}}}` could descend past the first resolution
  boundary. Tightened `local_ref_cycle_terminates` to assert stopping at
  `root.next`.

- Both scalar UDFs now error on `DataFusionError::Execution` if the
  accumulated cross-batch row count exceeds `SCALAR_BATCH_MAX_ROWS`
  (10M). Per-document caps bound single rows, but a wide batch could
  previously reach `number_rows * max_rows` in memory before returning.

* fix(udtf): pass projection to MemorySourceConfig in json_properties and json_tree

Both UDTFs were ignoring the projection parameter in scan(), causing a
schema mismatch error when selecting specific columns (e.g. SELECT path,
name, type FROM flatten_json_properties(...)). Pass projection.cloned()
to MemorySourceConfig::try_new() so DataFusion can push column pruning
down into the scan.

* fix: format MemorySourceConfig initialization for better readability

* Tests + Lint

* fix(tests): improve error handling and assertions in JSON property tests

* fix(tests): update projection comments for clarity in JSON schema tests

---------

Co-authored-by: Viktor Yershov <viktor@spice.ai>
#10365)

* Harden /v1/tools and /v1/nsql against unauthenticated / LLM-driven SQL

Addresses threat model items #50 and #51 (docs/threat_models/v2.0.0.md):

- Add strict read-only SQL validator (validate_sql_query_read_only) that
  rejects every DDL/DML/COPY/non-prepared Statement node regardless of
  per-catalog writability.
- Plumb a read_only flag through QueryBuilder/Query and apply the
  validator at all three plan execution sites (local, Ballista, async).
- Default the built-in `sql` tool to read-only; operators may opt in via
  SqlTool::allow_writes(). LLM tool-use can no longer mutate data through
  the sql tool.
- Run LLM-generated SQL from /v1/nsql under the read-only validator so
  prompt-injection-driven writes cannot reach writable catalogs.
- Gate /v1/tools/* behind a require_auth_configured middleware: when
  runtime.auth is not set, these routes return 401 rather than invoking
  tool.call anonymously with attacker-controlled bodies.
- Record the new mitigations in the v2.0.0 threat model.

* refactor: clarify read-only SQL validation comments and enhance documentation for DDL/DML restrictions

* Refactor authentication error response to use JSON format and add SQL tool descriptions for read-only and writable modes

* Fix collapsible_if clippy lint in read-only validation path

* Reject write-capable extension nodes in read-only validator

Spice's planner can represent DDL/DML as LogicalPlan::Extension nodes
(DdlExtensionNode, DmlExtensionNode, DistributedCayenne{Insert,Update,
Delete,Merge}Node, CayenneMergeNode). The previous read-only validator
only matched Ddl/Dml/Copy/Statement and would have let those plan shapes
through, defeating the read-only guarantee on /v1/tools/sql and /v1/nsql.

- Add Extension arm to validate_sql_query_read_only that denies any node
  whose UserDefinedLogicalNodeCore::name matches a curated list of
  write-capable extension names.
- Test the deny mechanism with a stub UserDefinedLogicalNode and verify
  a non-write extension name is still allowed.
- Add an integration test that exercises Spice's create_logical_plan
  wrapper end-to-end (cfg(not(windows))).
- Reflect the PREPARE/EXECUTE/DEALLOCATE rejection in the SqlTool
  read-only description so LLM/tool-selection logic knows the posture.
- Replace the PR-contextual 'Unverified in this review' phrasing in the
  threat model with the durable 'Unverified mitigation'.

* Bypass SQL results cache for read-only query paths

When ctx.read_only is set (e.g. the /v1/tools/sql read-only tool and the
/v1/nsql LLM SQL path), both the SQL-keyed and plan-keyed results-cache
lookups are now skipped inside get_plan_or_cached, and the returned
RequestCacheManager is forced to CacheDisabled. Previously, a cache hit
from a prior writable execution could short-circuit
validate_sql_query_read_only, letting a cached result produced by a
write-capable plan (e.g. LogicalPlan::Extension nodes like DmlExtension
or DistributedCayenneInsert) be served on a read-only surface.

Also move WRITE_CAPABLE_EXTENSION_NAMES into the cache crate as the
single source of truth, and extend cache_is_enabled_for_plan to reject
write-capable LogicalPlan::Extension nodes. Defense-in-depth: even on
writable paths, write-capable extension plans are now never cached or
populated in the results cache.

* fix: flatten write-capable extension check to match guard in cache eligibility

Removes one level of nesting as requested in review.

---------

Co-authored-by: Viktor Yershov <viktor@spice.ai>
…ion (#10408)

* feat(embeddings): multi-vector embeddings with MaxSim + late-interaction

Extends column-level embeddings to accept list-of-string columns and
produces one embedding vector per list element, stored as
List<FixedSizeList<F32, D>> per row. Enables tag/synonym/attribute-style
columns to participate in vector search without users having to
pre-flatten into a separate table.

Search adds two new scoring modes alongside the existing
chunked-scalar path:

- Single-query × multi-vector (MaxSim / Mean / Sum): per-row score is
  max/mean/sum over the list element cosines. Default is MaxSim
  (ColBERT-style). _match returns the element that produced the top
  cosine.
- Multi-query × multi-vector (late-interaction): SUM_{q in Q} MAX_{d in D}
  cos(q, d). Opt in by passing an array query, e.g.
  vector_search(tbl, ['foo','bar'], col).

Config surface (both ColumnLevelEmbeddingConfig and the legacy
ColumnEmbeddingConfig) gains two new fields:

- aggregation: max|mean|sum (default max); rejected on scalar columns.
- max_elements_per_row: default 32, hard cap 1024; excess elements are
  dropped with a tracing::warn.

Mode is auto-detected from the column Arrow type:
Utf8/Utf8View/LargeUtf8 → Scalar, List<Utf8>/LargeList<Utf8> → ListMulti.
Chunking is rejected on list columns; multi-vector options on scalar
columns error with a clear message.

Implementation:

- EmbeddingInputMode { Scalar, ListMulti } threaded through
  EmbeddingTable; resolve_input_mode enforces all validation rules.
- decompose_list_of_strings handles ListArray, LargeListArray and all
  three Utf8 element types; respects max_elements_per_row truncation.
- get_vectors_per_list_element (async + sync) embeds via one model
  call, respecting null-row (→ empty output list) and null/empty-element
  (→ null vector) semantics.
- base_table_has_embedding_column relaxes the offset-column requirement
  when the source column is list-typed; multi-vector uses element index
  as the implicit offset.
- ChunkedNonIndexVectorGeneration grows a VectorScanMode with three
  variants; search() dispatches to search_chunked_scalar,
  search_list_multi, or search_late_interaction. Aggregation uses
  Expr::AggregateFunction.partition_by windowing. Late-interaction
  unions per-query subplans tagged with q_idx and does a two-step
  aggregate (pk, q_idx → MAX; pk → SUM).
- VectorSearchTableFuncArgs gains a queries: Vec<String> alongside the
  existing query: String. parse_query_arg accepts either a Utf8 literal
  or a make_array(...) expression; to_expr round-trips both forms.
  Dispatcher errors when multi-query is paired with a scalar column.
- Telemetry track_vector_search gets multi_vector and
  multi_vector_aggregation KeyValue dims when applicable.

Accelerator compatibility: multi-vector output Arrow shape is identical
to the chunked-scalar path's output, so Arrow, Cayenne, and DuckDB
round-trip transparently. Turso serializes nested lists to JSON TEXT
today (turso.rs:581-583); SQLite inherits via datafusion-table-providers.
The compat matrix is documented at the head of EmbeddingInputMode. A
native typed side-table for SQLite/Turso is a future optimization that
would benefit chunked-scalar equally.

Tests: 41 new unit tests across embeddings::table (25),
embeddings::execution_plan (12), and embeddings::udtf::parser_tests (4)
cover type detection, input-mode resolution, list decomposition with
null/empty/truncation edge cases, multi-vector list-array construction,
end-to-end per-element embedding via a mock embedder, and the
make_array query parser.

* Fix + Lint

* Lint

* fix(embeddings): address review comments on multi-vector PR

- build_multi_vector_list_array validates each embedding's length
  against vector_length before appending, returning a structured error
  instead of letting the FixedSizeListBuilder panic on mismatch.
- decompose_generic_list hoists value_offsets() outside the loop and
  resolves the string-array variant once via a three-way downcast,
  removing per-element dynamic dispatch. Introduces a generic build_rows
  helper parameterised by a closure.
- parse_query_arg rejects make_array(...) with more than
  VECTOR_SEARCH_MAX_QUERIES (32) elements to prevent late-interaction
  plans from blowing up on unbounded input.
- to_expr derives the single-query literal from args.queries.first() so
  the single- and multi-query branches stay consistent.

* fix(embeddings): round 2 of review fixes — format, missing-column error, grammar

- cargo fmt fixup on decompose_generic_list after the string-variant
  refactor.
- try_new now returns Error::EmbeddingColumnNotInSchema when a
  configured embedding source column is missing from the base schema,
  instead of silently dropping the column. Misconfiguration fails fast
  during table construction.
- Grammar: "Cannot use it create an embeddings" → "Cannot use it to
  create embeddings" in the base_table_has_embedding_column warning.

* Lint

* feat: Implement JSON schema decomposition for HTTP connector

* fix(vector): unify aggregation handling for ChunkedScalar and LateInteraction modes

* Lint

* fix(tests): improve variable naming for clarity in embedding tests

* fix(clippy): replace unwrap with expect and Arc::clone in embedding tests

---------

Co-authored-by: Viktor Yershov <viktor@spice.ai>
@pull pull Bot locked and limited conversation to collaborators Apr 21, 2026
@pull pull Bot added the ⤵️ pull label Apr 21, 2026
@pull pull Bot merged commit b9e71a9 into TheRakeshPurohit:trunk Apr 21, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant