[pull] trunk from spiceai:trunk by pull[bot] · Pull Request #753 · TheRakeshPurohit/spiceai

pull · 2026-04-19T09:06:14Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat(postgres): stream WAL directly to Spice accelerators Adds a native logical-replication path so a Postgres dataset with `acceleration.refresh_mode: changes` streams INSERT/UPDATE/DELETE from WAL into the local accelerator (DuckDB, SQLite, Postgres, Cayenne, Arrow) without Debezium or Kafka. Implementation - New `data_components::postgres_replication` module with pgoutput decoder, slot/publication setup, REPEATABLE READ snapshot bootstrap, and an `LsnCommitter` that acknowledges durable LSN back to Postgres so WAL can be reclaimed. - `connector-postgres` now implements `supports_changes_stream` / `changes_stream` behind a new `postgres-replication` Cargo feature (on by default). Adds `pg_replication_slot`, `pg_publication`, `pg_replication_initial_snapshot`, `pg_replication_temporary_slot`, and `pg_replication_status_interval` params. - Each Spice replica auto-gets a distinct slot name (`spice_<dataset>_<hash(SPICE_INSTANCE_ID|hostname)>`), so multi- replica deployments just work; publication is shared. - Slot's server-side `confirmed_flush_lsn` is the durable checkpoint: restarts resume without re-bootstrapping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postgres): expose WAL replication metrics via MetricsProvider Emits 14 OpenTelemetry observables per replicated dataset under `dataset_postgres_replication_*`, matching the pattern used by MySQL, Databricks, and DynamoDB: Freshness - replication_lag_ms (gauge, auto) - replication_lag_bytes (gauge, auto) - replication_confirmed_flush_lsn (gauge) - replication_server_wal_end_lsn (gauge) Throughput - replication_transactions_total (counter, auto) - replication_inserts_total (counter, auto) - replication_updates_total (counter, auto) - replication_deletes_total (counter, auto) - replication_truncates_total (counter) Bootstrap - replication_bootstrap_rows_total (counter) - replication_bootstrap_complete (gauge, auto) Errors - replication_decode_errors_total (counter, auto) - replication_schema_mismatch_errors_total (counter, auto) - replication_recv_errors_total (counter, auto) Implementation mirrors the DynamoDB Streams pattern: a MetricsCollector (AtomicU64 counters + RwLock<SystemTime> watermark) is shared between the replication stream and the connector's MetricsProvider impl. Docs updated with a full Metrics section including Prometheus alert/throughput queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(postgres): end-to-end TPC-H replication integration test New `crates/runtime/tests/postgres/replication_tpch.rs` exercises the full stack: Postgres docker container (wal_level=logical) → connector-postgres replication → DuckDB accelerator → SQL queries through Runtime::datafusion(). Covers a TPC-H-shaped subset (region, nation, customer, orders) with int4, int8, text, float8, and date columns. The test asserts: - Initial snapshot bootstrap populates all four accelerated tables. - INSERT/UPDATE/DELETE propagate through WAL to the accelerator. - Date and float types round-trip accurately. - Joins across the replicated tables return consistent results. - A final pretty-printed region snapshot (insta) guards the row shape. CI: add `postgres-replication` to the FEATURES list in .github/workflows/integration.yml so nextest picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): address PR review feedback Correctness - KeepAlive now ACKs the last *applied* LSN, never the server's wal_end (was silently advancing the slot and risked losing un-applied changes across restarts). - Reject `pg_sslmode` values other than `disable` at setup with an actionable error, instead of silently falling back to `NoTls` against TLS-enforced Postgres. - `Value::Unchanged` (TOAST omitted) and `Value::Binary` now fail the stream instead of silently coercing to NULL — avoids accelerator data corruption on UPDATE. - `parse_lsn` returns `Result` and errors on malformed input; LSN 0 is no longer a silent fallback. - Guard the Commit path against publications that contain multiple tables (would mix relations into one ChangeBatch). - `replication_lag_ms` now uses pgoutput's `commit_time_micros` (source commit time) instead of ingest time, so the metric matches its description and reflects end-to-end freshness. - Reject `DataType::LargeUtf8` in both the bootstrap and WAL-change builders (was producing a Utf8 `StringArray` that fails struct schema validation). Hygiene - Prefer `acceleration.primary_key` over the source provider's `Constraints::PrimaryKey` when routing `ChangeBatch` PKs — matches what the accelerator write path uses. - Precompute dataset→relation column mapping once per batch (O(rows × fields²) → O(rows × fields)). - Replace `.expect("epoch")` with `unreachable!` in two places. - Drop `SPICE_INSTANCE_ID` env-var mutation in a unit test (flaky under concurrent test runs). - Fix mod.rs doc comment: "COPY-based snapshot" → row streaming. - Clarify the at-least-once snapshot/WAL boundary and note exported- snapshot handshake as a follow-up. - Trim unused deps on connector-postgres (arrow, twox-hash, tracing, paste, linkme). Docs - Update orphaned-slot query to use `inactive_since` (PG14+) instead of the non-existent `stats_reset`. - Minimal spicepod example uses `pg_sslmode: disable` with a note about TLS being a follow-up. - Limitations section now states TLS is rejected at startup rather than silently ignored. Tests: 24 passing (+2 new LSN-parse rejection tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postgres-replication): resilient reconnect + address review feedback Resilience - New `resilience` module with `Backoff` (500ms → 30s with ±20% jitter), transient-error classifiers for both `pgwire_replication::PgWireError` and `tokio_postgres::Error` (SQLSTATE 08xxx connection-exception and 57P0x admin-shutdown are retriable), and a generic `retry_async` wrapper. - WAL client now wraps its recv loop in an outer reconnect loop: transient connect/recv failures drop the client, back off, and reconnect. Relation cache + in-flight txn buffer are reset per reconnect (Postgres re-sends Relation on demand). Only fatal errors (auth, slot missing, schema mismatch) surface to the caller. - Initial connect attempt done upfront so startup misconfig surfaces immediately; transient failures on that first connect are demoted to a warning so the dataset comes up once Postgres is reachable. - `setup_slot_and_publication` wraps its connect+catalog SQL in the same retry logic (2-minute budget). Idempotent SQL means no harm on retry. - New `dataset_postgres_replication_reconnects_total` counter, auto- registered. Docs section explaining the resilience model + a new "Resilience" subsection under Operations. Review feedback (4 threads) - Bootstrap loop now precomputes the dataset-field → pg-column index map once from the first row's metadata (O(fields²) on first row, O(fields) per row thereafter). - `replication_lag_bytes` unit: "by" → "By" (UCUM for bytes). - Client overrides pgoutput `Relation.is_key` from the dataset's declared `acceleration.primary_key` before routing to the change builder, so REPLICA IDENTITY FULL (which flags every column as key) no longer explodes `ChangeBatch.primary_keys`. - Slot-resume path now prepends a zero-row `is_dataset_ready=true` envelope with a no-op committer, so the dataset becomes ready without waiting for the first WAL change. Important for quiet sources on restart. Tests: 28 passing (+4 new — backoff doubling/cap, transient classifier, retry_async success and fatal paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): address another round of review feedback - pgoutput.rs header comment: "Postgres 15+" → "Postgres 10+". pgoutput has existed since PG 10. - config.rs: slot + publication names now truncate the sanitized dataset portion to keep the final identifier within Postgres' 63-byte NAMEDATALEN limit. Added slot_name_is_truncated_to_postgres_limit and publication_name_is_truncated_to_postgres_limit tests. - connector-postgres/replication.rs: fail fast with an actionable StreamError when no primary key is available from either acceleration.primary_key or the source provider's constraints — refresh_mode: changes cannot route UPDATE/DELETE without one. - bootstrap.rs: module-level comment no longer implies LSN-consistent snapshot semantics (it's at-least-once; the authoritative note is on `start_replication_stream`). - bootstrap.rs: `metrics.mark_bootstrap_complete()` now runs *after* the REPEATABLE READ transaction's COMMIT succeeds, so `replication_bootstrap_complete=1` reflects the true durable bootstrap state (relevant when operators use the metric as a readiness signal). Tests: 30 passing (+2 new — slot/publication truncation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postgres): fold replication into the postgres feature, finish TLS + TRUNCATE + LargeUtf8 Feature consolidation - No separate `postgres-replication` feature flag. The WAL streaming path is now always compiled with the `postgres` feature on data_components and is always on in connector-postgres. Removes the feature entry in data_components, connector-postgres, and runtime; drops `postgres- replication` from the CI integration.yml FEATURES list. - `data_components::postgres_replication` is now gated on `postgres`. TLS (full implementation) - New `SslMode` variants: `VerifyCa` and `VerifyFull`, matching libpq vocabulary. `SslMode::{requires_tls, verifies_certificate, verifies_hostname}` helpers. - `pg_sslrootcert` parameter threaded through `ReplicationParams`. - Setup + bootstrap connections now build a `postgres_native_tls:: MakeTlsConnector` from the configured mode; `disable` still uses `NoTls`, everything else negotiates TLS with the appropriate verification level. - WAL replication (pgwire-replication) wires our `SslMode` to the crate's `TlsConfig`: disabled / require / verify_ca / verify_full, with the optional CA path plumbed through. - Minimal PEM certificate parser that splits a CA bundle into individual `native_tls::Certificate`s so users can point `pg_sslrootcert` at a multi-cert file. - Removes the old `TlsNotSupported` fast-fail in setup. TRUNCATE (full implementation) - pgoutput `Truncate` is no longer just a warning. The client enqueues a synthetic `ChangeOp::Truncate` into the transaction buffer; the downstream `ChangeBatch` path already groups these into a `ChangeOperationType::Truncate` sub-batch. - Runtime's refresh_task/changes.rs gains `process_truncate` which calls `DeletionTableProvider::delete_from(&[])` to unconditionally drop all rows in the accelerated table. - Docs updated: `replication_truncates_total` metric now documented as "received and applied" rather than "received (skipped)"; Limitations section no longer flags TRUNCATE. LargeUtf8 (full implementation) - FieldBuilder (WAL path) and BootstrapBuilder (snapshot path) now handle `DataType::LargeUtf8` via `LargeStringBuilder` instead of erroring. No more "not yet supported" rejection. Misc - Docs minimal example uses `pg_sslmode: verify-full` with an optional `pg_sslrootcert` pointing at a CA PEM. - Dropped the "No TLS" entry from the Limitations section. - Workspace deps: added postgres-native-tls 0.5, native-tls 0.2. 30 unit tests pass. Clippy + fmt clean. Runtime tests compile with `--features postgres,duckdb`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): 5 review fixes + trunk merge - process_truncate now passes `[lit(true)]` as the delete filter instead of `&[]`. Existing DeletionTableProvider impls (DuckDB/SQLite/Postgres) intentionally treat empty filters as a no-op to guard against accidental full-table deletes, so `TRUNCATE` would have been silently skipped. An always-true literal produces `DELETE FROM t WHERE TRUE` which is the correct behavior. - pgoutput TRUNCATE now uses `relation_ids` from the message instead of picking the first cached relation. Errors clearly on empty or multi- relation truncates (this replication path requires exactly one table per publication). Also errors clearly if the referenced relation hasn't been cached yet. - Metric description for `replication_truncates_total`: "currently skipped" → "applied to the accelerator". - Docs troubleshooting table: "TRUNCATE not yet implemented" entry replaced with an informational entry describing the new queued- for-apply log line. - Docs metric-naming sentence: `dataset_postgres_<metric>` → `dataset_postgres_replication_<metric>` to match actual series names. Also includes the clean merge of origin/trunk (2 new commits: OTEL exporter fix, candle/TEI fork pinning). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postgres-replication): full pg↔arrow type coverage + nullability fix Type coverage (matches datafusion-table-providers' postgres mapping) - Timestamp(Nanosecond) for both timestamp and timestamptz — this was the critical gap; DTP's read_provider returns Nanosecond precision, so any real timestamp column would previously fail `FieldBuilder` construction with "unsupported Arrow data type". - Binary (bytea) — parse pgoutput `\x...` hex text in WAL path, read `Vec<u8>` directly in bootstrap. - Decimal128(precision, scale) for NUMERIC — parse text "123.45" to i128 with Arrow scale; rejects NaN/Infinity/overscale values with actionable errors. - Int8 (pg CHAR), UInt32 (pg OID/XID), Time64(Nanosecond) (pg TIME). - LargeUtf8 now fully handled (earlier hot-fix had it rejected). - Arrays / Lists / Intervals rejected with explicit errors pointing the operator at a cast-to-scalar workaround. Nullability fix (the DELETE regression) - `build_change_batch` and bootstrap's `finish_batch` now construct the internal `data` StructArray with all fields *nullable*, regardless of the dataset schema's original nullability. This is required because REPLICA IDENTITY DEFAULT populates only PK columns in a DELETE event; non-PK columns are sent as null. Building the struct with the original nullable=false fields would fail StructArray::new validation and kill the stream. Downstream writes still respect the accelerator's nullability because `SchemaCastScanExec` re-casts to the strict schema on INSERT. DELETE doesn't touch non-PK columns anyway. Comprehensive unit tests (49 passing, up from 30) - INSERT / UPDATE / DELETE (both REPLICA IDENTITY DEFAULT and FULL) / TRUNCATE on both nullable and NOT NULL schemas. - Mixed ops in a single transaction preserve order. - Composite primary keys populate all PK entries correctly. - Type coverage: Int8/16/32/64, UInt32, Float32/64, Boolean, Date32, Time64(ns), Timestamp(us/ns both naive and tz-aware), Binary(bytea), Decimal128(38, 2). - `FieldBuilder::new` rejects Array / List / Interval with actionable messages. - Numeric parser: standard cases, sign handling, NaN/Infinity rejection, overscale rejection. - Hex decoder: round-trip, odd length, invalid digit rejection. - Unchanged-TOAST during UPDATE errors clearly (regression guard). Bootstrap + WAL paths share the same type coverage and nullability semantics so ChangeBatches are structurally identical regardless of which path produced them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): 10 review fixes — scientific notation, LSN seeding, commit-before-ready, arrow deletes Correctness - `expand_scientific_notation` no longer goes through `f64` (which loses precision and can re-emit scientific notation that breaks the `combined.parse::<i128>()` downstream). Now a pure string-based expansion that preserves full Decimal128 range and emits structured errors for malformed exponents/significands. - `confirmed_flush` atomic is now seeded with the slot's `consistent_lsn` after setup (previously pinned to 0 until the first Commit, which made KeepAlive ACKs and `replication_lag_bytes` report wildly inflated values at startup). - Bootstrap now commits the REPEATABLE READ transaction BEFORE yielding the ready-signalling envelope. If COMMIT fails the runtime never sees `is_dataset_ready=true`, matching durable state. Docs - Arrow accelerator row in the engine table: `DELETE` and `TRUNCATE` are applied via Arrow's `DeletionTableProvider` (previously claimed they were ignored — they weren't). `UPDATE`s still degrade to inserts because Arrow lacks `on_conflict` semantics. - Limitations section reconciled with the fix above — Arrow is append-only for updates, but deletes and truncates work. Integration tests - `replication.rs` now commits the bootstrap / UPDATE / DELETE envelopes (was `let (_committer, ...)` in all three). Exercises the LSN acknowledgement path the runtime actually uses. - Added `#![allow(clippy::expect_used)]` at crate level so CI's `-D clippy::expect_used` lint doesn't fire on the test file. 49 tests pass. Runtime builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): resume from confirmed_flush_lsn + TLS/hostname hardening - slot.rs: on resume, query pg_replication_slots.confirmed_flush_lsn and return it as consistent_lsn so the caller seeds its in-memory atomic (and metrics) from the server's durable checkpoint instead of the 0 sentinel. Handles NULL catalog values by falling back to 0. - config.rs: drop /etc/hostname blocking-I/O fallback; env-var only. - config.rs: PEM parser returns TruncatedPem on unmatched BEGIN marker and callers return EmptyCaBundle if zero certs parse, instead of silently loading nothing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): warn when TLS cert verification is disabled CodeQL flagged the native-tls danger_accept_invalid_* calls. The code is correct by design — it implements the libpq sslmode contract, and each dangerous flag is gated on an explicit user opt-in via pg_sslmode — but we now emit a tracing::warn each time the connector is built in a non-verifying mode, so operators see the security posture of their deployment in the logs. Production recommendation remains pg_sslmode=verify-full (no dangerous flags set, chain + hostname both verified). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): map sslmode=prefer to plaintext for WAL stream Previously `SslMode::Prefer` mapped to `TlsConfig::require()` — this silently strengthened our parsing default into "TLS required" and broke WAL replication against non-TLS Postgres instances that the regular connector happily talks to (the setup connection uses tokio_postgres's real Prefer semantics). pgwire-replication doesn't expose a safe "try TLS, fall back to plaintext" path, so the only honest mappings are Disabled or Require. Matching libpq's "don't block on missing TLS" intent and staying symmetric with the setup path makes Disabled the right default — operators who want TLS on replication set Require, VerifyCa, or VerifyFull explicitly. A tracing::warn already fires when any non-verifying mode is in effect. Docs now document per-sslmode transport/verification behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): pass CI pedantic clippy Resolves the `Rust Lint` CI failure. CI runs clippy with -Dclippy::pedantic plus -Dclippy::unwrap_used / clone_on_ref_ptr / allow_attributes; these weren't being caught by the looser local clippy invocation used during earlier ticks. Changes: - bootstrap.rs: drop unused `async`, drop the unused `_snapshot_name` pub field, replace `expect` with `let-else + unreachable`, use `Arc::clone(tz)` instead of `tz.clone()`, remove redundant param clone, fold the column-map lookup into the iteration path. - changes.rs: `append` now takes `Option<&Value>` (ref_option fix); `Arc::clone(tz)` for TimestampMicros/Nanos finish; `.expect()` in tests (unwrap_used); `unwrap_err()` via `.expect_err` instead of `assert!(...is_err())` (assertions_on_result_states); `let-else` for FieldBuilder rejection tests (manual_let_else); doc backticks. - client.rs: fold Begin/Commit/Other into one no-op arm; `u64::try_from` for pg_epoch_to_system_time (cast_sign_loss). - config.rs: merge ParseCa/BuildConnector in `source()` (both native_tls::Error — match_same_arms); scoped `#[expect]` on the intentional u64→u32 truncation in xxh3_short_hash. - mod.rs: `#[must_use]` on start_replication_stream; `stream_error(&Error)`; scoped `#[expect]` on err_to_stream's by-value signature (used as function pointer in map_err). - resilience.rs: scoped `#[expect]` on the intentional bounded- millis casts in jitter(). - slot.rs: scoped `#[expect]` on the intentional u64→u32 truncation in format_lsn; drop redundant #[allow(dead_code)] (parse_lsn_or_zero is exercised by a test). - connector-postgres lib.rs: rename `postgres_factory` to `factory` (struct_field_names). - connector-postgres replication.rs: `map_or(true, ...)` → `is_none_or(...)`; `is_some_and(...)` for the false-default; method path over closure in PK mapping. - runtime tests: `u16::try_from(port).expect(...)` (cast truncation); `r"..."` instead of `r#"..."#` where no quotes embedded; inlined format args; `expect_err` idioms. No behavioral changes. Library test count unchanged (49 passing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): 4 review fixes — publication race, async CA read, TRUNCATE comment, insta snapshot - slot.rs: ensure_publication now routes CREATE/ALTER PUBLICATION through a new `ignore_duplicate_object` helper that treats SQLSTATE 42710 as success. Two replicas racing into publication setup no longer fail on the loser. - config.rs: native_tls_connector is now async and reads sslrootcert via tokio::fs::read instead of std::fs::read, so it doesn't block a Tokio reactor thread during setup/bootstrap. Callers (slot + bootstrap) await the connector; bootstrap moved the call inside the try_stream! block with explicit err_to_stream conversion. - refresh_task/changes.rs: clarified the TRUNCATE comment to say *some* DeletionTableProvider impls (DuckDB in particular, with the no-op guard at data_components/src/duckdb.rs:137-140) treat empty filters as a no-op, which is why we pass `&[lit(true)]` for a uniform DELETE WHERE TRUE across engines. - runtime/tests/postgres/replication_tpch.rs: replaced the `insta::assert_snapshot!` call with an inlined `assert_eq!` against a `concat!`-built expected string. The integration test no longer depends on a separate .snap artifact landing in the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): 4 review fixes — tpch test regions, COPY wording, param description - replication_tpch.rs: expected_regions table now matches the 3-row seed (AFRICA, AMERICA, ASIA). The previous copy included EUROPE and MIDDLE EAST rows that were never seeded; the assertion would have failed deterministically once CI ran the TPC-H integration path. - bootstrap.rs: BOOTSTRAP_BATCH_SIZE doc updated from "COPY-streaming" to "streaming the initial snapshot query" — the path uses `SELECT *` with query_raw/portal, not Postgres COPY. - connector-postgres/src/lib.rs: replication_initial_snapshot parameter description changed from "COPY the table's existing rows" to "take an initial snapshot of the table's existing rows" so the wording matches the actual REPEATABLE READ + SELECT * path. - docs/features/postgres-replication.md: same wording fix in the parameter reference table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): enable SCRAM auth + correct tpch order count Two CI integration-test failures on 36f823c: 1. SCRAM authentication — pgwire-replication was pulled in with default-features = false, which disables its `scram` feature. All postgres:16 containers require SCRAM-SHA-256 auth by default, so every replication connect attempt failed with `authentication error: SCRAM authentication required but 'scram' feature not enabled`. Enable the `scram` feature alongside `tls-rustls` in the workspace dep. Fixes: - postgres::replication::bootstrap_then_stream_changes - postgres::replication::two_replicas_have_independent_slots 2. tpch_postgres_replication_end_to_end: the assertion `SELECT count(*) FROM tpch_orders WHERE o_orderdate < '1995-01-01'` was compared against 3, but the seed data has 5 matching rows (ids 3, 5, 6, 9, 10 with dates in 1992–1994). Corrected the expected count to 5 with a comment documenting which seed rows qualify. Local clippy (-Dwarnings -Dclippy::pedantic) clean on data_components with the new feature flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): switch regions assertion back to insta snapshot Per PR feedback, use the repo's standard Insta snapshot pattern for the final region-table assertion instead of an inline assert_eq!. Committed the generated .snap file so CI runs deterministically without needing INSTA_UPDATE=1. Values match the 3-row seed (AFRICA, AMERICA, ASIA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): 5 review fixes — on_conflict guard, doc wording, comment accuracy - connector-postgres: `build_changes_stream` now fails fast when the engine supports upsert (DuckDB / SQLite / PostgreSQL / Cayenne) but the dataset has no `acceleration.on_conflict` entry with an `Upsert` variant. Arrow / PartitionedArrow are exempted — they're documented as append-only for UPDATEs. Error tells operators exactly what to add. - docs/features/postgres-replication.md: dropped the misleading `/etc/hostname` row from the replica-identity sources table. The hostname fallback is env-var only (intentionally avoids blocking I/O) — matches the actual implementation. - data_components/postgres_replication/metrics.rs: module doc now names the OTel observable pattern correctly: `dataset_postgres_<metric_spec_name>` with `dataset_postgres_replication_lag_ms` as a concrete example. - data_components/postgres_replication/slot.rs: dropped the incorrect "PG 16+" framing from the `create_logical_slot` comment. The `(slot_name, lsn)` return shape of `pg_create_logical_replication_slot` has been stable for many versions; the real limitation is that the SQL-function path doesn't give us `EXPORT_SNAPSHOT` semantics, hence the REPEATABLE READ bootstrap. Also updated the PR description so its "postgres-replication Cargo feature" bullet and test-plan commands match the actual build surface (replication is part of the existing `postgres` feature). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): add dataset-name hash to default slot/publication names Two long dataset names that share the same first ~48 characters previously truncated to the same default slot/publication name because only the instance hash disambiguated the slot suffix. That means the slot/publication on Postgres would alias across datasets on the same replica. Added a 6-char `xxh3` hash of the full dataset name between the truncated prefix and the instance hash: slot: `spice_{truncated}_{ds-hash6}_{instance-hash8}` pub: `spice_{truncated}_{ds-hash6}_pub` Added `truncated_prefix_collisions_are_disambiguated` regression test and updated the other naming tests for the new format. Docs updated (feature doc + PR description) to reflect the new default shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): port-fallback + NULL confirmed_flush_lsn handling Two review fixes: 1. runtime/tests/postgres/common.rs — the new `start_postgres_docker_container_with_logical_wal` helper no longer uses `port.try_into().unwrap_or(15432)`. A port that doesn't fit in u16 now returns a clear anyhow error instead of silently binding on 15432 (which would mask the misconfiguration as a confusing connection failure later). 2. data_components slot.rs — do_setup now explicitly distinguishes three catalog states for the named slot: - None → no slot, create + bootstrap - Some(0) → slot exists but NULL confirmed_flush_lsn (crash between slot creation and first standby status); warn and treat as bootstrap-required so the accelerator isn't silently left missing historical rows - Some(lsn) lsn>0 → normal resume Previously Some(0) was treated as a resume with created_fresh=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(postgres-replication): sync default slot/pub names to actual naming logic The docs and parameter descriptions still referenced the old `spice_<dataset>_<instance-hash>` shape after the dataset-hash segment was added in 90013a2. Updated: - docs/features/postgres-replication.md overview step, "Start the runtime" bullet list, `pg_replication_slot` row in the config reference, and the sample `pg_replication_slots` SQL output - connector-postgres parameter descriptions for `replication_slot` and `publication` Operators can now match the actual slot names they'll see in `pg_replication_slots` when inspecting or decommissioning replicas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): 3 review fixes — on_conflict PK match, count-column strictness, commits in two-replica test 1. connector-postgres replication.rs: tightened the `has_upsert` check. Previously we accepted any `Upsert` entry in `acceleration.on_conflict`; a misconfigured target on a non-PK column would pass but still let UPDATE events append duplicate rows. Now we look up the declared `primary_key` directly in the `on_conflict` map and require an `Upsert(_)` there. The error message suggests the actual PK column instead of a generic placeholder. 2. runtime/tests/postgres/replication_tpch.rs: `wait_for_row_count` previously used `unwrap_or_default()` which swallowed schema/type regressions (empty batch, unexpected column type, negative count) as "still 0 rows" and turned them into a 60-second timeout. Each mis-shape now produces an explicit error with the sql + actual type in the message. 3. runtime/tests/postgres/replication.rs: `two_replicas_have_independent_slots` now calls `env.commit().await?` on both bootstrap and live envelopes for both replicas. Previously the test skipped commits, so confirmed_flush_lsn never advanced and slot state could drift — source of potential WAL-retention / cleanup flakiness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(postgres-replication): move upsert validation after PK resolution Per review: the missing_upsert_error was computed from `acceleration.primary_key`/`on_conflict` BEFORE `primary_keys` was resolved. That meant the extract_primary_keys fallback was effectively dead for engines that need upsert — a dataset with no declared `acceleration.primary_key` but a source-table PK would always trip the error with a useless `<pk>` hint. Validation now runs inside the try_stream! block after `primary_keys` is determined, and distinguishes two cases: 1. Declared PK set but `on_conflict` missing the matching Upsert entry — error names the PK and suggests the exact fix. 2. Declared PK unset (source-table fallback in play) — error explicitly says BOTH `acceleration.primary_key` AND a matching `on_conflict` are required, because the accelerator's write path consults only the declared config, not the source table's PK. The error messages now contain a concrete PK name (either the declared one or the one extracted from the source table) instead of a generic placeholder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pull Bot locked and limited conversation to collaborators Apr 19, 2026

pull Bot added the ⤵️ pull label Apr 19, 2026

pull Bot merged commit 3582d66 into TheRakeshPurohit:trunk Apr 19, 2026
1 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] trunk from spiceai:trunk#753

[pull] trunk from spiceai:trunk#753
pull[bot] merged 1 commit into
TheRakeshPurohit:trunkfrom
spiceai:trunk

pull Bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pull Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pull Bot commented Apr 19, 2026 •

edited

Loading