[pull] trunk from spiceai:trunk#752
Merged
Merged
Conversation
* Add on_schema_resolved dataset ready state Introduces a new dataset readiness mode where the dataset is marked ready as soon as the federated source's schema is resolved (which also implies access has been verified), without waiting for the initial data refresh. Queries fall back to the federated source until the initial load completes. - Adds OnSchemaResolved variant to spicepod and runtime ReadyState enums - Marks the dataset ready immediately for FederatedTable::Immediate - Spawns a background task for FederatedTable::Deferred to mark ready once the provider resolves - Extends the scan() federated fallback to include the new variant - Forces OnLoad when the source is a DeferredConnector (e.g. Databricks U2M before trigger) so a placeholder schema is never exposed as Ready - Adds 4 integration tests (native/federated x arrow/duckdb) - Updates schema JSON and test YAML * Enhance dataset readiness handling in federated table provider - Update the logic to only mark the dataset as ready if the deferred provider successfully connects. - Introduce logging for unresolved federated providers to avoid misleading status updates. - Modify tests and add new snapshots for various ready states to ensure correctness. * Address PR review: don't overwrite Error status, clarify fallback comment - Preserve existing Error status when schema-resolved readiness arrives after the refresh path already reported an error, so schema-resolution readiness can't mask refresh failures surfaced via status/metrics. - Update the fallback-branch comment to reflect that the federated provider may still be resolved asynchronously before the initial load completes (the prior comment about a prior checkpoint was misleading). * feat(runtime): enhance dataset readiness handling and introduce federated resolution error management
…0258) * feat: Add Elasticsearch data connector with hybrid search support Add a new Elasticsearch data connector that supports querying ES indexes as SQL tables, kNN vector similarity search (vector_search UDTF), native BM25 full-text search (text_search UDTF), and hybrid search via the rrf UDTF. New crates: - elasticsearch: thin HTTP client wrapping reqwest - connector-elasticsearch: DataConnectorFactory/DataConnector impl - data_components/elasticsearch: DataFusion TableProvider impls - search/index/elasticsearch: SearchIndex + VectorIndex impls - runtime/embeddings/index/elasticsearch: vector engine integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Address review feedback for Elasticsearch connector - Client: add timeouts (connect 10s, request 30s), user-agent header - Client: include response body in error messages for debuggability - Client: propagate JSON serialization errors in bulk_index instead of silently dropping via unwrap_or_default() - Client: fix double-slash URL bug by storing base_url as trimmed String - Query table: use checked integer casts (try_from) instead of lossy `as` - Search table: move hits_to_record_batch outside per-field loop (O(1) vs O(n)) - Search table: add QueryEmbedder trait for lazy embedding computation - Search table: validate query_vector is non-empty before kNN request - Search index: compute query embeddings at execution time via EmbedQueryAdapter instead of passing empty placeholder vector - Runtime: use get_primary_keys() fallback when row_ids not configured - Runtime: remove unused imports (Expr, SchemaRef, ExposeSecret) - Makefile: add elasticsearch to lint features - Add integration tests (7 client + 5 data_components) with ES 8.17.0 - Add CI workflow for Elasticsearch integration tests - Add test spicepod for basic Elasticsearch connector config * fix: Improve ES CI health check and add readiness verification Simplify Docker health check to use `curl -sf` (avoids grep escaping issues in CI). Add start period, increase retries, and add an explicit readiness step that waits for yellow cluster status before running tests. * fix: Drop RwLock guard before awaiting embedding size inference Clone the embedding model Arc before releasing the read lock on embedding_models, avoiding holding the lock across the async get_or_infer_size() call. * fix: Update dense vector handling and add null mask in Elasticsearch query table * feat: Enhance Elasticsearch support by adding conditional compilation and index handling * refactor: Simplify import statements and improve code readability in UDTF * fix: address PR review comments - plumb scan limit and add ES UDTF discovery * test: ignore tests requiring a running Elasticsearch instance docs: update ElasticsearchTextIndex documentation for clarity fix: enhance index name formatting to ensure lowercase and valid characters * fix: remove unnecessary secret flag from Elasticsearch endpoint parameter * fix: align arrow type mapping with elsticsearch http client output * fix: normalize elasticsearch table schema for HasJoinExec compatibility with accelerated tables * fix(elasticsearch): validate dense_vector dims, fix index name sanitization, and append vector field to source schema * fix(elasticsearch): normalize base table schema for HashJoinExec compatibility with accelerated tables * fix(elasticsearch): address clippy lint issues * fix(elasticsearch): improve primary key resolution and error handling in try_from_table * fix(elasticsearch): saturating_mul for kNN num_candidates and sanitize error body - Use k.saturating_mul(2) to avoid potential usize overflow on large k. - Replace newlines/carriage returns in HTTP error response bodies before formatting, so error messages stay single-line per project conventions. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: ewgenius <hey@ewgenius.me> Co-authored-by: Evgenii Khramkov <evgenii@spice.ai>
* ci: bump test archive upload compression-level to 1 The integration_models test archive upload sets compression-level: 0 which skips the artifact zip wrapper compression. Set it to 1 (fastest) so the artifact still benefits from minimal zip-level compression without meaningfully increasing upload time. * ci: bump compression-level to 1 for all integration test artifact uploads Apply the same change across integration.yml, integration_llms.yml, and the second integration_models.yml upload step.
* feat(git-connector): add HTTPS/SSH auth, git-lfs, and RC-level resilience Bring the Git data connector up to RC quality: - HTTP(S) auth via git_username/git_password and git_token - SSH auth via git_ssh_key + git_ssh_passphrase, plus opt-in ssh-agent - git-lfs support (enable_lfs) via the git-lfs CLI after clone/fetch - Connection resilience: max_concurrent_requests semaphore per repo URL, bounded retries with exponential/fibonacci backoff, permanent-error circuit breaking (401/403), and an inflight_operations gauge - Partial filter push-down hint for path/name/sha/version predicates - Local-repo integration tests and expanded documentation - Adds the Git connector to the alpha/beta/rc criteria tables * fix(git-connector): address Copilot PR review comments Round of fixes driven by the Copilot review on #10385: - Do not override libgit2's default TLS certificate verification - Drop git:// from is_ssh_url; it is the anonymous git protocol - Replace `git lfs fetch --all` with a scoped fetch of the current ref - Source all runtime params and credentials from self.params only so secret-store references are resolved and the parameter validator is honored - Add a redacted Debug impl for GitCredentials/GitTableConfig/GitClient so secret material cannot leak via tracing - Also fetch refs/tags/*:refs/tags/* so tag-pinned datasets stay current - Warn when datasets sharing a repo_url request different max_concurrent_requests - Include the libgit2 source in the AuthenticationFailed display for triage - Update copyright header to 2024-2026 on new .rs files * style: cargo fmt for git connector changes * fix(git-connector): LFS content, SSH agent honoring, clippy Address the second round of review comments: - LFS: when enable_lfs is true, read file content from the LFS-materialized working tree instead of the Git blob (which still holds the pointer file). Falls back to the blob when the on-disk file is missing or exceeds max_file_bytes. - SSH: `ssh_use_agent` is now the sole switch for ssh-agent fallback; we no longer auto-fall back based on the URL shape when the operator explicitly disabled agent use. - Document the bounded memory trade-off of the global repo-URL maps. - Appease clippy: type alias for the semaphore entry map and `#[allow(clippy::too_many_arguments)]` on GitClient::new. * style(git-connector): satisfy strict clippy lint profile - Replace `#[allow]` with `#[expect]` reason (clippy::allow_attributes) - Drop u64→u32 casts via checked conversion - Remove unused `_url`/`is_ssh` arguments and helper - Collapse unit-pattern `match` to `if let` - Hoist closure `unwrap_or_else` to `PoisonError::into_inner` - Use `unwrap_or` instead of map+to_string+unwrap_or_else - Annotate Arc clones with concrete types to satisfy clone_on_ref_ptr * fix(git-connector): round-3 Copilot review fixes - Add a per-cache-path mutex that serializes clone/fetch/checkout/LFS operations so max_concurrent_requests > 1 never corrupts a shared on-disk cache - Check out the working tree to the requested ref before invoking git-lfs, so non-HEAD queries return content from the right commit - Treat LFS-materialized files that are missing or exceed max_file_bytes as "no content" rather than silently returning the LFS pointer - Clarify git_token docs: token is the password, username defaults to `x-access-token` and can be overridden via git_username * style(git-connector): replace match-over-single-pattern with enum-driven decode Extract an `LfsLookup` enum and a `decode` closure so the content branch avoids the `match` single-pattern destructures that `-D clippy::pedantic` flags. * fix(git-connector): round-4 Copilot review fixes - Add `sanitize_repo_url` and use it for log fields and the global semaphore/disabled-flag map keys so a URL with inline userinfo (e.g. https://user:token@host/repo) never leaks into logs or long-lived in-memory state - Return `ErrorCode::Auth` from the credentials callback when no HTTP credentials are configured, so missing-credentials errors are classified as permanent and trip the circuit breaker - Align the `GitCredentials` doc comment with the actual `Default` derive and explain how the runtime flips `ssh_use_agent` on * style(git-connector): fix clippy lints in test helpers - Replace `assert!(result.is_err())` with a `let Err(_) = ... else` pattern (clippy::assertions_on_result_states) - Replace `match { Err(e) => e, Ok(_) => panic! }` with `let Err(_) = ... else` (clippy::manual_let_else) * fix(git-connector): round-5 Copilot fixes + lint followups Round-5 Copilot review fixes: - Sanitize the repo URL when logging the "Connecting to Git repository" debug line in the runtime connector - Store a sanitized URL in `Error::AuthenticationFailed` so the user-facing display never carries inline credentials - Hold the per-cache-path mutex for the duration of the scan/read when LFS is enabled, so a concurrent query against a different ref cannot swap the working tree underneath us - Use the on-disk byte length as the row's `size` for LFS-tracked files (and use the same value for the `max_file_bytes` check) so the column no longer reflects the LFS pointer Plus cargo fmt / clippy followups flagged by CI: - Inline the `args` variable into the `bail!` format string - Replace `|b| b.num_rows()` with the trait method pointer * style(git-connector): apply rustfmt line wrap * style(git-connector): u64->usize via try_from * fix(git-connector): round-6 Copilot fixes - parse_git_url now ignores `@` that lives in the URL's userinfo component so HTTPS URLs such as `https://user:token@host/repo@trunk` are split at the correct `@`. - Introduce an RAII `InflightGuard` so the `inflight_operations` gauge is always decremented, including on future cancellation. - Hold the per-cache-path mutex from clone/fetch through LFS checkout through the working-tree scan so concurrent queries against the same cache can't interleave mutations with reads. * fix(git-connector): wire credentials into git-lfs and redact CLI errors Round-7 Copilot fixes: - `git lfs fetch/checkout` now runs with the connector's credentials applied in a non-persistent way: HTTP(S) token/password are injected as an `http.extraHeader` via `-c`, SSH keys are pointed to via `GIT_SSH_COMMAND`, and `GIT_TERMINAL_PROMPT=0` prevents interactive auth prompts. - Sanitize subprocess stderr before bubbling it up in `Error::GitLfsFailed` so configured secrets and URL userinfo can't leak into user-facing error messages. * fix(git-connector): clippy let-binding + GIT_DISABLED_FLAGS doc + Array import - Inline the `let redacted = { ... }; redacted` block to satisfy the `let_and_return` clippy lint - Rephrase the `GIT_DISABLED_FLAGS` doc comment to describe what the map actually does (shared latch across Git connector instances) — no implicit reference to the metrics provider - Keep `use arrow::array::Array as _` (aliased) in the integration tests: rustc still needs the trait in scope for `StringArray::len`, but the alias makes the "unused import" warning go away * fix(git-connector): round-8 security hardening - Replace `-c http.extraHeader=...` with a 0o700 temp `GIT_ASKPASS` script so HTTP(S) credentials never appear in argv (visible via `ps` on shared hosts) and never touch the on-disk git config - Unset `GIT_TRACE*` / `GIT_CURL_VERBOSE` / `GCM_TRACE` on every git subprocess so ambient tracing can't leak auth headers/helper output - Redact `Authorization: <scheme> <value>` from subprocess stderr before embedding it in `Error::GitLfsFailed` - Reject LFS-tracked symlinks and non-regular files, canonicalize the on-disk path, and require it to stay inside the repo cache before reading content with `std::fs::read` (TOCTOU-narrowed) - Use `Url::from_file_path` in integration tests so Windows paths produce a valid `file://` URL * style(git-connector): satisfy clippy lint profile - Move `AUTH_HEADER_RE` LazyLock to module scope (items_after_statements) - Suppress the single `expect` on a compile-time constant regex pattern with a reasoned `#[expect(clippy::expect_used)]` - Refactor `basic_auth_parts` to use `Option::map` for the password path * fix(git-connector): round-9 Copilot fixes - Swap the outer `GIT_CACHE_MUTEXES` from `std::sync::Mutex<HashMap>` to `dashmap::DashMap`. Sharded internal locking avoids blocking Tokio worker threads on the hot `cache_mutex_for` path. - Add deprecated `component` specs for the `git_`-prefixed runtime options (`git_include`, `git_fetch_content`, `git_cache_path`, `git_max_files`, `git_max_file_bytes`, `git_enable_lfs`) so existing spicepods that used those keys keep working and emit a clear deprecation warning instead of silently being dropped. - Replace `.expect("valid regex")` on `AUTH_HEADER_RE` with an `unwrap_or_else(|_| unreachable!())` pattern so the module compiles under the project-wide `clippy::expect_used` lint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )