Skip to content

[pull] trunk from spiceai:trunk#752

Merged
pull[bot] merged 4 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk
Apr 19, 2026
Merged

[pull] trunk from spiceai:trunk#752
pull[bot] merged 4 commits into
TheRakeshPurohit:trunkfrom
spiceai:trunk

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 19, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

lukekim and others added 4 commits April 18, 2026 19:59
* Add on_schema_resolved dataset ready state

Introduces a new dataset readiness mode where the dataset is marked
ready as soon as the federated source's schema is resolved (which also
implies access has been verified), without waiting for the initial data
refresh. Queries fall back to the federated source until the initial
load completes.

- Adds OnSchemaResolved variant to spicepod and runtime ReadyState enums
- Marks the dataset ready immediately for FederatedTable::Immediate
- Spawns a background task for FederatedTable::Deferred to mark ready
  once the provider resolves
- Extends the scan() federated fallback to include the new variant
- Forces OnLoad when the source is a DeferredConnector (e.g. Databricks
  U2M before trigger) so a placeholder schema is never exposed as Ready
- Adds 4 integration tests (native/federated x arrow/duckdb)
- Updates schema JSON and test YAML

* Enhance dataset readiness handling in federated table provider

- Update the logic to only mark the dataset as ready if the deferred provider successfully connects.
- Introduce logging for unresolved federated providers to avoid misleading status updates.
- Modify tests and add new snapshots for various ready states to ensure correctness.

* Address PR review: don't overwrite Error status, clarify fallback comment

- Preserve existing Error status when schema-resolved readiness arrives
  after the refresh path already reported an error, so schema-resolution
  readiness can't mask refresh failures surfaced via status/metrics.
- Update the fallback-branch comment to reflect that the federated
  provider may still be resolved asynchronously before the initial load
  completes (the prior comment about a prior checkpoint was misleading).

* feat(runtime): enhance dataset readiness handling and introduce federated resolution error management
…0258)

* feat: Add Elasticsearch data connector with hybrid search support

Add a new Elasticsearch data connector that supports querying ES indexes
as SQL tables, kNN vector similarity search (vector_search UDTF),
native BM25 full-text search (text_search UDTF), and hybrid search
via the rrf UDTF.

New crates:
- elasticsearch: thin HTTP client wrapping reqwest
- connector-elasticsearch: DataConnectorFactory/DataConnector impl
- data_components/elasticsearch: DataFusion TableProvider impls
- search/index/elasticsearch: SearchIndex + VectorIndex impls
- runtime/embeddings/index/elasticsearch: vector engine integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: Address review feedback for Elasticsearch connector

- Client: add timeouts (connect 10s, request 30s), user-agent header
- Client: include response body in error messages for debuggability
- Client: propagate JSON serialization errors in bulk_index instead of
  silently dropping via unwrap_or_default()
- Client: fix double-slash URL bug by storing base_url as trimmed String
- Query table: use checked integer casts (try_from) instead of lossy `as`
- Search table: move hits_to_record_batch outside per-field loop (O(1) vs O(n))
- Search table: add QueryEmbedder trait for lazy embedding computation
- Search table: validate query_vector is non-empty before kNN request
- Search index: compute query embeddings at execution time via
  EmbedQueryAdapter instead of passing empty placeholder vector
- Runtime: use get_primary_keys() fallback when row_ids not configured
- Runtime: remove unused imports (Expr, SchemaRef, ExposeSecret)
- Makefile: add elasticsearch to lint features
- Add integration tests (7 client + 5 data_components) with ES 8.17.0
- Add CI workflow for Elasticsearch integration tests
- Add test spicepod for basic Elasticsearch connector config

* fix: Improve ES CI health check and add readiness verification

Simplify Docker health check to use `curl -sf` (avoids grep escaping
issues in CI). Add start period, increase retries, and add an explicit
readiness step that waits for yellow cluster status before running tests.

* fix: Drop RwLock guard before awaiting embedding size inference

Clone the embedding model Arc before releasing the read lock on
embedding_models, avoiding holding the lock across the async
get_or_infer_size() call.

* fix: Update dense vector handling and add null mask in Elasticsearch query table

* feat: Enhance Elasticsearch support by adding conditional compilation and index handling

* refactor: Simplify import statements and improve code readability in UDTF

* fix: address PR review comments - plumb scan limit and add ES UDTF discovery

* test: ignore tests requiring a running Elasticsearch instance

docs: update ElasticsearchTextIndex documentation for clarity

fix: enhance index name formatting to ensure lowercase and valid characters

* fix: remove unnecessary secret flag from Elasticsearch endpoint parameter

* fix: align arrow type mapping with elsticsearch http client output

* fix: normalize elasticsearch table schema for HasJoinExec compatibility with accelerated tables

* fix(elasticsearch): validate dense_vector dims, fix index name sanitization, and append vector field to source schema

* fix(elasticsearch): normalize base table schema for HashJoinExec compatibility with accelerated tables

* fix(elasticsearch): address clippy lint issues

* fix(elasticsearch): improve primary key resolution and error handling in try_from_table

* fix(elasticsearch): saturating_mul for kNN num_candidates and sanitize error body

- Use k.saturating_mul(2) to avoid potential usize overflow on large k.
- Replace newlines/carriage returns in HTTP error response bodies before
  formatting, so error messages stay single-line per project conventions.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: ewgenius <hey@ewgenius.me>
Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
* ci: bump test archive upload compression-level to 1

The integration_models test archive upload sets compression-level: 0
which skips the artifact zip wrapper compression. Set it to 1 (fastest)
so the artifact still benefits from minimal zip-level compression
without meaningfully increasing upload time.

* ci: bump compression-level to 1 for all integration test artifact uploads

Apply the same change across integration.yml, integration_llms.yml, and
the second integration_models.yml upload step.
* feat(git-connector): add HTTPS/SSH auth, git-lfs, and RC-level resilience

Bring the Git data connector up to RC quality:

- HTTP(S) auth via git_username/git_password and git_token
- SSH auth via git_ssh_key + git_ssh_passphrase, plus opt-in ssh-agent
- git-lfs support (enable_lfs) via the git-lfs CLI after clone/fetch
- Connection resilience: max_concurrent_requests semaphore per repo URL,
  bounded retries with exponential/fibonacci backoff, permanent-error
  circuit breaking (401/403), and an inflight_operations gauge
- Partial filter push-down hint for path/name/sha/version predicates
- Local-repo integration tests and expanded documentation
- Adds the Git connector to the alpha/beta/rc criteria tables

* fix(git-connector): address Copilot PR review comments

Round of fixes driven by the Copilot review on #10385:

- Do not override libgit2's default TLS certificate verification
- Drop git:// from is_ssh_url; it is the anonymous git protocol
- Replace `git lfs fetch --all` with a scoped fetch of the current ref
- Source all runtime params and credentials from self.params only so
  secret-store references are resolved and the parameter validator is
  honored
- Add a redacted Debug impl for GitCredentials/GitTableConfig/GitClient
  so secret material cannot leak via tracing
- Also fetch refs/tags/*:refs/tags/* so tag-pinned datasets stay current
- Warn when datasets sharing a repo_url request different max_concurrent_requests
- Include the libgit2 source in the AuthenticationFailed display for triage
- Update copyright header to 2024-2026 on new .rs files

* style: cargo fmt for git connector changes

* fix(git-connector): LFS content, SSH agent honoring, clippy

Address the second round of review comments:

- LFS: when enable_lfs is true, read file content from the
  LFS-materialized working tree instead of the Git blob (which still
  holds the pointer file). Falls back to the blob when the on-disk
  file is missing or exceeds max_file_bytes.
- SSH: `ssh_use_agent` is now the sole switch for ssh-agent fallback;
  we no longer auto-fall back based on the URL shape when the
  operator explicitly disabled agent use.
- Document the bounded memory trade-off of the global repo-URL maps.
- Appease clippy: type alias for the semaphore entry map and
  `#[allow(clippy::too_many_arguments)]` on GitClient::new.

* style(git-connector): satisfy strict clippy lint profile

- Replace `#[allow]` with `#[expect]` reason (clippy::allow_attributes)
- Drop u64→u32 casts via checked conversion
- Remove unused `_url`/`is_ssh` arguments and helper
- Collapse unit-pattern `match` to `if let`
- Hoist closure `unwrap_or_else` to `PoisonError::into_inner`
- Use `unwrap_or` instead of map+to_string+unwrap_or_else
- Annotate Arc clones with concrete types to satisfy clone_on_ref_ptr

* fix(git-connector): round-3 Copilot review fixes

- Add a per-cache-path mutex that serializes clone/fetch/checkout/LFS
  operations so max_concurrent_requests > 1 never corrupts a shared
  on-disk cache
- Check out the working tree to the requested ref before invoking
  git-lfs, so non-HEAD queries return content from the right commit
- Treat LFS-materialized files that are missing or exceed max_file_bytes
  as "no content" rather than silently returning the LFS pointer
- Clarify git_token docs: token is the password, username defaults to
  `x-access-token` and can be overridden via git_username

* style(git-connector): replace match-over-single-pattern with enum-driven decode

Extract an `LfsLookup` enum and a `decode` closure so the content branch
avoids the `match` single-pattern destructures that `-D clippy::pedantic`
flags.

* fix(git-connector): round-4 Copilot review fixes

- Add `sanitize_repo_url` and use it for log fields and the global
  semaphore/disabled-flag map keys so a URL with inline userinfo
  (e.g. https://user:token@host/repo) never leaks into logs or
  long-lived in-memory state
- Return `ErrorCode::Auth` from the credentials callback when no
  HTTP credentials are configured, so missing-credentials errors
  are classified as permanent and trip the circuit breaker
- Align the `GitCredentials` doc comment with the actual `Default`
  derive and explain how the runtime flips `ssh_use_agent` on

* style(git-connector): fix clippy lints in test helpers

- Replace `assert!(result.is_err())` with a `let Err(_) = ... else` pattern
  (clippy::assertions_on_result_states)
- Replace `match { Err(e) => e, Ok(_) => panic! }` with `let Err(_) = ... else`
  (clippy::manual_let_else)

* fix(git-connector): round-5 Copilot fixes + lint followups

Round-5 Copilot review fixes:
- Sanitize the repo URL when logging the "Connecting to Git repository"
  debug line in the runtime connector
- Store a sanitized URL in `Error::AuthenticationFailed` so the
  user-facing display never carries inline credentials
- Hold the per-cache-path mutex for the duration of the scan/read when
  LFS is enabled, so a concurrent query against a different ref cannot
  swap the working tree underneath us
- Use the on-disk byte length as the row's `size` for LFS-tracked files
  (and use the same value for the `max_file_bytes` check) so the column
  no longer reflects the LFS pointer

Plus cargo fmt / clippy followups flagged by CI:
- Inline the `args` variable into the `bail!` format string
- Replace `|b| b.num_rows()` with the trait method pointer

* style(git-connector): apply rustfmt line wrap

* style(git-connector): u64->usize via try_from

* fix(git-connector): round-6 Copilot fixes

- parse_git_url now ignores `@` that lives in the URL's userinfo
  component so HTTPS URLs such as `https://user:token@host/repo@trunk`
  are split at the correct `@`.
- Introduce an RAII `InflightGuard` so the `inflight_operations` gauge
  is always decremented, including on future cancellation.
- Hold the per-cache-path mutex from clone/fetch through LFS checkout
  through the working-tree scan so concurrent queries against the same
  cache can't interleave mutations with reads.

* fix(git-connector): wire credentials into git-lfs and redact CLI errors

Round-7 Copilot fixes:
- `git lfs fetch/checkout` now runs with the connector's credentials
  applied in a non-persistent way: HTTP(S) token/password are injected
  as an `http.extraHeader` via `-c`, SSH keys are pointed to via
  `GIT_SSH_COMMAND`, and `GIT_TERMINAL_PROMPT=0` prevents interactive
  auth prompts.
- Sanitize subprocess stderr before bubbling it up in
  `Error::GitLfsFailed` so configured secrets and URL userinfo can't
  leak into user-facing error messages.

* fix(git-connector): clippy let-binding + GIT_DISABLED_FLAGS doc + Array import

- Inline the `let redacted = { ... }; redacted` block to satisfy the
  `let_and_return` clippy lint
- Rephrase the `GIT_DISABLED_FLAGS` doc comment to describe what the
  map actually does (shared latch across Git connector instances) —
  no implicit reference to the metrics provider
- Keep `use arrow::array::Array as _` (aliased) in the integration
  tests: rustc still needs the trait in scope for `StringArray::len`,
  but the alias makes the "unused import" warning go away

* fix(git-connector): round-8 security hardening

- Replace `-c http.extraHeader=...` with a 0o700 temp `GIT_ASKPASS`
  script so HTTP(S) credentials never appear in argv (visible via `ps`
  on shared hosts) and never touch the on-disk git config
- Unset `GIT_TRACE*` / `GIT_CURL_VERBOSE` / `GCM_TRACE` on every git
  subprocess so ambient tracing can't leak auth headers/helper output
- Redact `Authorization: <scheme> <value>` from subprocess stderr
  before embedding it in `Error::GitLfsFailed`
- Reject LFS-tracked symlinks and non-regular files, canonicalize the
  on-disk path, and require it to stay inside the repo cache before
  reading content with `std::fs::read` (TOCTOU-narrowed)
- Use `Url::from_file_path` in integration tests so Windows paths
  produce a valid `file://` URL

* style(git-connector): satisfy clippy lint profile

- Move `AUTH_HEADER_RE` LazyLock to module scope (items_after_statements)
- Suppress the single `expect` on a compile-time constant regex pattern
  with a reasoned `#[expect(clippy::expect_used)]`
- Refactor `basic_auth_parts` to use `Option::map` for the password path

* fix(git-connector): round-9 Copilot fixes

- Swap the outer `GIT_CACHE_MUTEXES` from `std::sync::Mutex<HashMap>`
  to `dashmap::DashMap`. Sharded internal locking avoids blocking Tokio
  worker threads on the hot `cache_mutex_for` path.
- Add deprecated `component` specs for the `git_`-prefixed runtime
  options (`git_include`, `git_fetch_content`, `git_cache_path`,
  `git_max_files`, `git_max_file_bytes`, `git_enable_lfs`) so existing
  spicepods that used those keys keep working and emit a clear
  deprecation warning instead of silently being dropped.
- Replace `.expect("valid regex")` on `AUTH_HEADER_RE` with an
  `unwrap_or_else(|_| unreachable!())` pattern so the module compiles
  under the project-wide `clippy::expect_used` lint.
@pull pull Bot locked and limited conversation to collaborators Apr 19, 2026
@pull pull Bot added the ⤵️ pull label Apr 19, 2026
@pull pull Bot merged commit c994b93 into TheRakeshPurohit:trunk Apr 19, 2026
1 of 12 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant