Skip to content

[pull] trunk from spiceai:trunk#753

Merged
pull[bot] merged 1 commit into
TheRakeshPurohit:trunkfrom
spiceai:trunk
Apr 19, 2026
Merged

[pull] trunk from spiceai:trunk#753
pull[bot] merged 1 commit into
TheRakeshPurohit:trunkfrom
spiceai:trunk

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 19, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat(postgres): stream WAL directly to Spice accelerators

Adds a native logical-replication path so a Postgres dataset with
`acceleration.refresh_mode: changes` streams INSERT/UPDATE/DELETE from
WAL into the local accelerator (DuckDB, SQLite, Postgres, Cayenne,
Arrow) without Debezium or Kafka.

Implementation
- New `data_components::postgres_replication` module with pgoutput
  decoder, slot/publication setup, REPEATABLE READ snapshot bootstrap,
  and an `LsnCommitter` that acknowledges durable LSN back to Postgres
  so WAL can be reclaimed.
- `connector-postgres` now implements `supports_changes_stream` /
  `changes_stream` behind a new `postgres-replication` Cargo feature
  (on by default). Adds `pg_replication_slot`, `pg_publication`,
  `pg_replication_initial_snapshot`, `pg_replication_temporary_slot`,
  and `pg_replication_status_interval` params.
- Each Spice replica auto-gets a distinct slot name
  (`spice_<dataset>_<hash(SPICE_INSTANCE_ID|hostname)>`), so multi-
  replica deployments just work; publication is shared.
- Slot's server-side `confirmed_flush_lsn` is the durable checkpoint:
  restarts resume without re-bootstrapping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postgres): expose WAL replication metrics via MetricsProvider

Emits 14 OpenTelemetry observables per replicated dataset under
`dataset_postgres_replication_*`, matching the pattern used by MySQL,
Databricks, and DynamoDB:

Freshness
- replication_lag_ms                  (gauge, auto)
- replication_lag_bytes               (gauge, auto)
- replication_confirmed_flush_lsn     (gauge)
- replication_server_wal_end_lsn      (gauge)

Throughput
- replication_transactions_total      (counter, auto)
- replication_inserts_total           (counter, auto)
- replication_updates_total           (counter, auto)
- replication_deletes_total           (counter, auto)
- replication_truncates_total         (counter)

Bootstrap
- replication_bootstrap_rows_total    (counter)
- replication_bootstrap_complete      (gauge, auto)

Errors
- replication_decode_errors_total     (counter, auto)
- replication_schema_mismatch_errors_total (counter, auto)
- replication_recv_errors_total       (counter, auto)

Implementation mirrors the DynamoDB Streams pattern: a
MetricsCollector (AtomicU64 counters + RwLock<SystemTime> watermark)
is shared between the replication stream and the connector's
MetricsProvider impl. Docs updated with a full Metrics section
including Prometheus alert/throughput queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(postgres): end-to-end TPC-H replication integration test

New `crates/runtime/tests/postgres/replication_tpch.rs` exercises the
full stack: Postgres docker container (wal_level=logical) →
connector-postgres replication → DuckDB accelerator → SQL queries
through Runtime::datafusion().

Covers a TPC-H-shaped subset (region, nation, customer, orders) with
int4, int8, text, float8, and date columns. The test asserts:

- Initial snapshot bootstrap populates all four accelerated tables.
- INSERT/UPDATE/DELETE propagate through WAL to the accelerator.
- Date and float types round-trip accurately.
- Joins across the replicated tables return consistent results.
- A final pretty-printed region snapshot (insta) guards the row shape.

CI: add `postgres-replication` to the FEATURES list in
.github/workflows/integration.yml so nextest picks it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): address PR review feedback

Correctness
- KeepAlive now ACKs the last *applied* LSN, never the server's wal_end
  (was silently advancing the slot and risked losing un-applied changes
  across restarts).
- Reject `pg_sslmode` values other than `disable` at setup with an
  actionable error, instead of silently falling back to `NoTls` against
  TLS-enforced Postgres.
- `Value::Unchanged` (TOAST omitted) and `Value::Binary` now fail the
  stream instead of silently coercing to NULL — avoids accelerator
  data corruption on UPDATE.
- `parse_lsn` returns `Result` and errors on malformed input; LSN 0 is
  no longer a silent fallback.
- Guard the Commit path against publications that contain multiple
  tables (would mix relations into one ChangeBatch).
- `replication_lag_ms` now uses pgoutput's `commit_time_micros` (source
  commit time) instead of ingest time, so the metric matches its
  description and reflects end-to-end freshness.
- Reject `DataType::LargeUtf8` in both the bootstrap and WAL-change
  builders (was producing a Utf8 `StringArray` that fails struct
  schema validation).

Hygiene
- Prefer `acceleration.primary_key` over the source provider's
  `Constraints::PrimaryKey` when routing `ChangeBatch` PKs — matches
  what the accelerator write path uses.
- Precompute dataset→relation column mapping once per batch
  (O(rows × fields²) → O(rows × fields)).
- Replace `.expect("epoch")` with `unreachable!` in two places.
- Drop `SPICE_INSTANCE_ID` env-var mutation in a unit test (flaky under
  concurrent test runs).
- Fix mod.rs doc comment: "COPY-based snapshot" → row streaming.
- Clarify the at-least-once snapshot/WAL boundary and note exported-
  snapshot handshake as a follow-up.
- Trim unused deps on connector-postgres (arrow, twox-hash, tracing,
  paste, linkme).

Docs
- Update orphaned-slot query to use `inactive_since` (PG14+) instead of
  the non-existent `stats_reset`.
- Minimal spicepod example uses `pg_sslmode: disable` with a note about
  TLS being a follow-up.
- Limitations section now states TLS is rejected at startup rather than
  silently ignored.

Tests: 24 passing (+2 new LSN-parse rejection tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postgres-replication): resilient reconnect + address review feedback

Resilience
- New `resilience` module with `Backoff` (500ms → 30s with ±20% jitter),
  transient-error classifiers for both `pgwire_replication::PgWireError`
  and `tokio_postgres::Error` (SQLSTATE 08xxx connection-exception and
  57P0x admin-shutdown are retriable), and a generic `retry_async`
  wrapper.
- WAL client now wraps its recv loop in an outer reconnect loop:
  transient connect/recv failures drop the client, back off, and
  reconnect. Relation cache + in-flight txn buffer are reset per
  reconnect (Postgres re-sends Relation on demand). Only fatal errors
  (auth, slot missing, schema mismatch) surface to the caller.
- Initial connect attempt done upfront so startup misconfig surfaces
  immediately; transient failures on that first connect are demoted to
  a warning so the dataset comes up once Postgres is reachable.
- `setup_slot_and_publication` wraps its connect+catalog SQL in the
  same retry logic (2-minute budget). Idempotent SQL means no harm on
  retry.
- New `dataset_postgres_replication_reconnects_total` counter, auto-
  registered. Docs section explaining the resilience model + a new
  "Resilience" subsection under Operations.

Review feedback (4 threads)
- Bootstrap loop now precomputes the dataset-field → pg-column index
  map once from the first row's metadata (O(fields²) on first row,
  O(fields) per row thereafter).
- `replication_lag_bytes` unit: "by" → "By" (UCUM for bytes).
- Client overrides pgoutput `Relation.is_key` from the dataset's
  declared `acceleration.primary_key` before routing to the change
  builder, so REPLICA IDENTITY FULL (which flags every column as key)
  no longer explodes `ChangeBatch.primary_keys`.
- Slot-resume path now prepends a zero-row `is_dataset_ready=true`
  envelope with a no-op committer, so the dataset becomes ready
  without waiting for the first WAL change. Important for quiet
  sources on restart.

Tests: 28 passing (+4 new — backoff doubling/cap, transient
classifier, retry_async success and fatal paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): address another round of review feedback

- pgoutput.rs header comment: "Postgres 15+" → "Postgres 10+". pgoutput
  has existed since PG 10.
- config.rs: slot + publication names now truncate the sanitized
  dataset portion to keep the final identifier within Postgres' 63-byte
  NAMEDATALEN limit. Added slot_name_is_truncated_to_postgres_limit and
  publication_name_is_truncated_to_postgres_limit tests.
- connector-postgres/replication.rs: fail fast with an actionable
  StreamError when no primary key is available from either
  acceleration.primary_key or the source provider's constraints —
  refresh_mode: changes cannot route UPDATE/DELETE without one.
- bootstrap.rs: module-level comment no longer implies LSN-consistent
  snapshot semantics (it's at-least-once; the authoritative note is
  on `start_replication_stream`).
- bootstrap.rs: `metrics.mark_bootstrap_complete()` now runs *after*
  the REPEATABLE READ transaction's COMMIT succeeds, so
  `replication_bootstrap_complete=1` reflects the true durable
  bootstrap state (relevant when operators use the metric as a
  readiness signal).

Tests: 30 passing (+2 new — slot/publication truncation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postgres): fold replication into the postgres feature, finish TLS + TRUNCATE + LargeUtf8

Feature consolidation
- No separate `postgres-replication` feature flag. The WAL streaming path
  is now always compiled with the `postgres` feature on data_components
  and is always on in connector-postgres. Removes the feature entry in
  data_components, connector-postgres, and runtime; drops `postgres-
  replication` from the CI integration.yml FEATURES list.
- `data_components::postgres_replication` is now gated on `postgres`.

TLS (full implementation)
- New `SslMode` variants: `VerifyCa` and `VerifyFull`, matching libpq
  vocabulary. `SslMode::{requires_tls, verifies_certificate,
  verifies_hostname}` helpers.
- `pg_sslrootcert` parameter threaded through `ReplicationParams`.
- Setup + bootstrap connections now build a `postgres_native_tls::
  MakeTlsConnector` from the configured mode; `disable` still uses
  `NoTls`, everything else negotiates TLS with the appropriate
  verification level.
- WAL replication (pgwire-replication) wires our `SslMode` to the
  crate's `TlsConfig`: disabled / require / verify_ca / verify_full,
  with the optional CA path plumbed through.
- Minimal PEM certificate parser that splits a CA bundle into
  individual `native_tls::Certificate`s so users can point
  `pg_sslrootcert` at a multi-cert file.
- Removes the old `TlsNotSupported` fast-fail in setup.

TRUNCATE (full implementation)
- pgoutput `Truncate` is no longer just a warning. The client enqueues
  a synthetic `ChangeOp::Truncate` into the transaction buffer; the
  downstream `ChangeBatch` path already groups these into a
  `ChangeOperationType::Truncate` sub-batch.
- Runtime's refresh_task/changes.rs gains `process_truncate` which
  calls `DeletionTableProvider::delete_from(&[])` to unconditionally
  drop all rows in the accelerated table.
- Docs updated: `replication_truncates_total` metric now documented as
  "received and applied" rather than "received (skipped)"; Limitations
  section no longer flags TRUNCATE.

LargeUtf8 (full implementation)
- FieldBuilder (WAL path) and BootstrapBuilder (snapshot path) now
  handle `DataType::LargeUtf8` via `LargeStringBuilder` instead of
  erroring. No more "not yet supported" rejection.

Misc
- Docs minimal example uses `pg_sslmode: verify-full` with an optional
  `pg_sslrootcert` pointing at a CA PEM.
- Dropped the "No TLS" entry from the Limitations section.
- Workspace deps: added postgres-native-tls 0.5, native-tls 0.2.

30 unit tests pass. Clippy + fmt clean. Runtime tests compile with
`--features postgres,duckdb`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): 5 review fixes + trunk merge

- process_truncate now passes `[lit(true)]` as the delete filter instead
  of `&[]`. Existing DeletionTableProvider impls (DuckDB/SQLite/Postgres)
  intentionally treat empty filters as a no-op to guard against
  accidental full-table deletes, so `TRUNCATE` would have been silently
  skipped. An always-true literal produces `DELETE FROM t WHERE TRUE`
  which is the correct behavior.
- pgoutput TRUNCATE now uses `relation_ids` from the message instead of
  picking the first cached relation. Errors clearly on empty or multi-
  relation truncates (this replication path requires exactly one table
  per publication). Also errors clearly if the referenced relation
  hasn't been cached yet.
- Metric description for `replication_truncates_total`: "currently
  skipped" → "applied to the accelerator".
- Docs troubleshooting table: "TRUNCATE not yet implemented"
  entry replaced with an informational entry describing the new queued-
  for-apply log line.
- Docs metric-naming sentence: `dataset_postgres_<metric>` →
  `dataset_postgres_replication_<metric>` to match actual series names.

Also includes the clean merge of origin/trunk (2 new commits: OTEL
exporter fix, candle/TEI fork pinning).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postgres-replication): full pg↔arrow type coverage + nullability fix

Type coverage (matches datafusion-table-providers' postgres mapping)
- Timestamp(Nanosecond) for both timestamp and timestamptz — this was
  the critical gap; DTP's read_provider returns Nanosecond precision,
  so any real timestamp column would previously fail `FieldBuilder`
  construction with "unsupported Arrow data type".
- Binary (bytea) — parse pgoutput `\x...` hex text in WAL path, read
  `Vec<u8>` directly in bootstrap.
- Decimal128(precision, scale) for NUMERIC — parse text "123.45" to
  i128 with Arrow scale; rejects NaN/Infinity/overscale values with
  actionable errors.
- Int8 (pg CHAR), UInt32 (pg OID/XID), Time64(Nanosecond) (pg TIME).
- LargeUtf8 now fully handled (earlier hot-fix had it rejected).
- Arrays / Lists / Intervals rejected with explicit errors pointing
  the operator at a cast-to-scalar workaround.

Nullability fix (the DELETE regression)
- `build_change_batch` and bootstrap's `finish_batch` now construct the
  internal `data` StructArray with all fields *nullable*, regardless of
  the dataset schema's original nullability. This is required because
  REPLICA IDENTITY DEFAULT populates only PK columns in a DELETE event;
  non-PK columns are sent as null. Building the struct with the
  original nullable=false fields would fail StructArray::new validation
  and kill the stream. Downstream writes still respect the accelerator's
  nullability because `SchemaCastScanExec` re-casts to the strict schema
  on INSERT. DELETE doesn't touch non-PK columns anyway.

Comprehensive unit tests (49 passing, up from 30)
- INSERT / UPDATE / DELETE (both REPLICA IDENTITY DEFAULT and FULL) /
  TRUNCATE on both nullable and NOT NULL schemas.
- Mixed ops in a single transaction preserve order.
- Composite primary keys populate all PK entries correctly.
- Type coverage: Int8/16/32/64, UInt32, Float32/64, Boolean, Date32,
  Time64(ns), Timestamp(us/ns both naive and tz-aware), Binary(bytea),
  Decimal128(38, 2).
- `FieldBuilder::new` rejects Array / List / Interval with actionable
  messages.
- Numeric parser: standard cases, sign handling, NaN/Infinity rejection,
  overscale rejection.
- Hex decoder: round-trip, odd length, invalid digit rejection.
- Unchanged-TOAST during UPDATE errors clearly (regression guard).

Bootstrap + WAL paths share the same type coverage and nullability
semantics so ChangeBatches are structurally identical regardless of
which path produced them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): 10 review fixes — scientific notation, LSN seeding, commit-before-ready, arrow deletes

Correctness
- `expand_scientific_notation` no longer goes through `f64` (which loses
  precision and can re-emit scientific notation that breaks the
  `combined.parse::<i128>()` downstream). Now a pure string-based
  expansion that preserves full Decimal128 range and emits structured
  errors for malformed exponents/significands.
- `confirmed_flush` atomic is now seeded with the slot's `consistent_lsn`
  after setup (previously pinned to 0 until the first Commit, which
  made KeepAlive ACKs and `replication_lag_bytes` report wildly inflated
  values at startup).
- Bootstrap now commits the REPEATABLE READ transaction BEFORE yielding
  the ready-signalling envelope. If COMMIT fails the runtime never sees
  `is_dataset_ready=true`, matching durable state.

Docs
- Arrow accelerator row in the engine table: `DELETE` and `TRUNCATE`
  are applied via Arrow's `DeletionTableProvider` (previously claimed
  they were ignored — they weren't). `UPDATE`s still degrade to inserts
  because Arrow lacks `on_conflict` semantics.
- Limitations section reconciled with the fix above — Arrow is
  append-only for updates, but deletes and truncates work.

Integration tests
- `replication.rs` now commits the bootstrap / UPDATE / DELETE envelopes
  (was `let (_committer, ...)` in all three). Exercises the LSN
  acknowledgement path the runtime actually uses.
- Added `#![allow(clippy::expect_used)]` at crate level so CI's
  `-D clippy::expect_used` lint doesn't fire on the test file.

49 tests pass. Runtime builds clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): resume from confirmed_flush_lsn + TLS/hostname hardening

- slot.rs: on resume, query pg_replication_slots.confirmed_flush_lsn and
  return it as consistent_lsn so the caller seeds its in-memory atomic
  (and metrics) from the server's durable checkpoint instead of the 0
  sentinel. Handles NULL catalog values by falling back to 0.
- config.rs: drop /etc/hostname blocking-I/O fallback; env-var only.
- config.rs: PEM parser returns TruncatedPem on unmatched BEGIN marker
  and callers return EmptyCaBundle if zero certs parse, instead of
  silently loading nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): warn when TLS cert verification is disabled

CodeQL flagged the native-tls danger_accept_invalid_* calls. The code is
correct by design — it implements the libpq sslmode contract, and each
dangerous flag is gated on an explicit user opt-in via pg_sslmode — but
we now emit a tracing::warn each time the connector is built in a
non-verifying mode, so operators see the security posture of their
deployment in the logs.

Production recommendation remains pg_sslmode=verify-full (no dangerous
flags set, chain + hostname both verified).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): map sslmode=prefer to plaintext for WAL stream

Previously `SslMode::Prefer` mapped to `TlsConfig::require()` — this
silently strengthened our parsing default into "TLS required" and
broke WAL replication against non-TLS Postgres instances that the
regular connector happily talks to (the setup connection uses
tokio_postgres's real Prefer semantics).

pgwire-replication doesn't expose a safe "try TLS, fall back to
plaintext" path, so the only honest mappings are Disabled or Require.
Matching libpq's "don't block on missing TLS" intent and staying
symmetric with the setup path makes Disabled the right default
— operators who want TLS on replication set Require, VerifyCa, or
VerifyFull explicitly. A tracing::warn already fires when any
non-verifying mode is in effect.

Docs now document per-sslmode transport/verification behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): pass CI pedantic clippy

Resolves the `Rust Lint` CI failure. CI runs clippy with
-Dclippy::pedantic plus -Dclippy::unwrap_used / clone_on_ref_ptr /
allow_attributes; these weren't being caught by the looser local
clippy invocation used during earlier ticks. Changes:

- bootstrap.rs: drop unused `async`, drop the unused `_snapshot_name`
  pub field, replace `expect` with `let-else + unreachable`, use
  `Arc::clone(tz)` instead of `tz.clone()`, remove redundant param
  clone, fold the column-map lookup into the iteration path.
- changes.rs: `append` now takes `Option<&Value>` (ref_option fix);
  `Arc::clone(tz)` for TimestampMicros/Nanos finish; `.expect()` in
  tests (unwrap_used); `unwrap_err()` via `.expect_err` instead of
  `assert!(...is_err())` (assertions_on_result_states); `let-else`
  for FieldBuilder rejection tests (manual_let_else); doc backticks.
- client.rs: fold Begin/Commit/Other into one no-op arm;
  `u64::try_from` for pg_epoch_to_system_time (cast_sign_loss).
- config.rs: merge ParseCa/BuildConnector in `source()` (both
  native_tls::Error — match_same_arms); scoped `#[expect]` on the
  intentional u64→u32 truncation in xxh3_short_hash.
- mod.rs: `#[must_use]` on start_replication_stream;
  `stream_error(&Error)`; scoped `#[expect]` on err_to_stream's
  by-value signature (used as function pointer in map_err).
- resilience.rs: scoped `#[expect]` on the intentional bounded-
  millis casts in jitter().
- slot.rs: scoped `#[expect]` on the intentional u64→u32 truncation
  in format_lsn; drop redundant #[allow(dead_code)] (parse_lsn_or_zero
  is exercised by a test).
- connector-postgres lib.rs: rename `postgres_factory` to `factory`
  (struct_field_names).
- connector-postgres replication.rs: `map_or(true, ...)` →
  `is_none_or(...)`; `is_some_and(...)` for the false-default; method
  path over closure in PK mapping.
- runtime tests: `u16::try_from(port).expect(...)` (cast truncation);
  `r"..."` instead of `r#"..."#` where no quotes embedded; inlined
  format args; `expect_err` idioms.

No behavioral changes. Library test count unchanged (49 passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): 4 review fixes — publication race, async CA read, TRUNCATE comment, insta snapshot

- slot.rs: ensure_publication now routes CREATE/ALTER PUBLICATION
  through a new `ignore_duplicate_object` helper that treats SQLSTATE
  42710 as success. Two replicas racing into publication setup no
  longer fail on the loser.
- config.rs: native_tls_connector is now async and reads sslrootcert
  via tokio::fs::read instead of std::fs::read, so it doesn't block a
  Tokio reactor thread during setup/bootstrap. Callers (slot + bootstrap)
  await the connector; bootstrap moved the call inside the try_stream!
  block with explicit err_to_stream conversion.
- refresh_task/changes.rs: clarified the TRUNCATE comment to say
  *some* DeletionTableProvider impls (DuckDB in particular, with the
  no-op guard at data_components/src/duckdb.rs:137-140) treat empty
  filters as a no-op, which is why we pass `&[lit(true)]` for a uniform
  DELETE WHERE TRUE across engines.
- runtime/tests/postgres/replication_tpch.rs: replaced the
  `insta::assert_snapshot!` call with an inlined `assert_eq!` against
  a `concat!`-built expected string. The integration test no longer
  depends on a separate .snap artifact landing in the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): 4 review fixes — tpch test regions, COPY wording, param description

- replication_tpch.rs: expected_regions table now matches the 3-row
  seed (AFRICA, AMERICA, ASIA). The previous copy included EUROPE and
  MIDDLE EAST rows that were never seeded; the assertion would have
  failed deterministically once CI ran the TPC-H integration path.
- bootstrap.rs: BOOTSTRAP_BATCH_SIZE doc updated from "COPY-streaming"
  to "streaming the initial snapshot query" — the path uses `SELECT *`
  with query_raw/portal, not Postgres COPY.
- connector-postgres/src/lib.rs: replication_initial_snapshot
  parameter description changed from "COPY the table's existing
  rows" to "take an initial snapshot of the table's existing rows"
  so the wording matches the actual REPEATABLE READ + SELECT * path.
- docs/features/postgres-replication.md: same wording fix in the
  parameter reference table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): enable SCRAM auth + correct tpch order count

Two CI integration-test failures on 36f823c:

1. SCRAM authentication — pgwire-replication was pulled in with
   default-features = false, which disables its `scram` feature. All
   postgres:16 containers require SCRAM-SHA-256 auth by default, so
   every replication connect attempt failed with
   `authentication error: SCRAM authentication required but 'scram'
   feature not enabled`. Enable the `scram` feature alongside
   `tls-rustls` in the workspace dep. Fixes:
   - postgres::replication::bootstrap_then_stream_changes
   - postgres::replication::two_replicas_have_independent_slots

2. tpch_postgres_replication_end_to_end: the assertion
   `SELECT count(*) FROM tpch_orders WHERE o_orderdate < '1995-01-01'`
   was compared against 3, but the seed data has 5 matching rows
   (ids 3, 5, 6, 9, 10 with dates in 1992–1994). Corrected the
   expected count to 5 with a comment documenting which seed rows
   qualify.

Local clippy (-Dwarnings -Dclippy::pedantic) clean on data_components
with the new feature flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): switch regions assertion back to insta snapshot

Per PR feedback, use the repo's standard Insta snapshot pattern for the
final region-table assertion instead of an inline assert_eq!. Committed
the generated .snap file so CI runs deterministically without needing
INSTA_UPDATE=1. Values match the 3-row seed (AFRICA, AMERICA, ASIA).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): 5 review fixes — on_conflict guard, doc wording, comment accuracy

- connector-postgres: `build_changes_stream` now fails fast when the
  engine supports upsert (DuckDB / SQLite / PostgreSQL / Cayenne) but
  the dataset has no `acceleration.on_conflict` entry with an
  `Upsert` variant. Arrow / PartitionedArrow are exempted — they're
  documented as append-only for UPDATEs. Error tells operators exactly
  what to add.
- docs/features/postgres-replication.md: dropped the misleading
  `/etc/hostname` row from the replica-identity sources table. The
  hostname fallback is env-var only (intentionally avoids blocking
  I/O) — matches the actual implementation.
- data_components/postgres_replication/metrics.rs: module doc now
  names the OTel observable pattern correctly:
  `dataset_postgres_<metric_spec_name>` with
  `dataset_postgres_replication_lag_ms` as a concrete example.
- data_components/postgres_replication/slot.rs: dropped the incorrect
  "PG 16+" framing from the `create_logical_slot` comment. The
  `(slot_name, lsn)` return shape of `pg_create_logical_replication_slot`
  has been stable for many versions; the real limitation is that the
  SQL-function path doesn't give us `EXPORT_SNAPSHOT` semantics, hence
  the REPEATABLE READ bootstrap.

Also updated the PR description so its "postgres-replication Cargo
feature" bullet and test-plan commands match the actual build surface
(replication is part of the existing `postgres` feature).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): add dataset-name hash to default slot/publication names

Two long dataset names that share the same first ~48 characters
previously truncated to the same default slot/publication name
because only the instance hash disambiguated the slot suffix. That
means the slot/publication on Postgres would alias across datasets on
the same replica.

Added a 6-char `xxh3` hash of the full dataset name between the
truncated prefix and the instance hash:
  slot: `spice_{truncated}_{ds-hash6}_{instance-hash8}`
  pub:  `spice_{truncated}_{ds-hash6}_pub`

Added `truncated_prefix_collisions_are_disambiguated` regression test
and updated the other naming tests for the new format.

Docs updated (feature doc + PR description) to reflect the new
default shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): port-fallback + NULL confirmed_flush_lsn handling

Two review fixes:

1. runtime/tests/postgres/common.rs — the new
   `start_postgres_docker_container_with_logical_wal` helper no longer
   uses `port.try_into().unwrap_or(15432)`. A port that doesn't fit in
   u16 now returns a clear anyhow error instead of silently binding on
   15432 (which would mask the misconfiguration as a confusing
   connection failure later).

2. data_components slot.rs — do_setup now explicitly distinguishes
   three catalog states for the named slot:
     - None              → no slot, create + bootstrap
     - Some(0)           → slot exists but NULL confirmed_flush_lsn
                           (crash between slot creation and first
                           standby status); warn and treat as
                           bootstrap-required so the accelerator isn't
                           silently left missing historical rows
     - Some(lsn) lsn>0   → normal resume
   Previously Some(0) was treated as a resume with created_fresh=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(postgres-replication): sync default slot/pub names to actual naming logic

The docs and parameter descriptions still referenced the old
`spice_<dataset>_<instance-hash>` shape after the dataset-hash
segment was added in 90013a2. Updated:

- docs/features/postgres-replication.md overview step, "Start the
  runtime" bullet list, `pg_replication_slot` row in the config
  reference, and the sample `pg_replication_slots` SQL output
- connector-postgres parameter descriptions for
  `replication_slot` and `publication`

Operators can now match the actual slot names they'll see in
`pg_replication_slots` when inspecting or decommissioning replicas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): 3 review fixes — on_conflict PK match, count-column strictness, commits in two-replica test

1. connector-postgres replication.rs: tightened the `has_upsert` check.
   Previously we accepted any `Upsert` entry in `acceleration.on_conflict`;
   a misconfigured target on a non-PK column would pass but still let
   UPDATE events append duplicate rows. Now we look up the declared
   `primary_key` directly in the `on_conflict` map and require an
   `Upsert(_)` there. The error message suggests the actual PK column
   instead of a generic placeholder.

2. runtime/tests/postgres/replication_tpch.rs: `wait_for_row_count`
   previously used `unwrap_or_default()` which swallowed schema/type
   regressions (empty batch, unexpected column type, negative count) as
   "still 0 rows" and turned them into a 60-second timeout. Each
   mis-shape now produces an explicit error with the sql + actual type
   in the message.

3. runtime/tests/postgres/replication.rs: `two_replicas_have_independent_slots`
   now calls `env.commit().await?` on both bootstrap and live envelopes
   for both replicas. Previously the test skipped commits, so
   confirmed_flush_lsn never advanced and slot state could drift —
   source of potential WAL-retention / cleanup flakiness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(postgres-replication): move upsert validation after PK resolution

Per review: the missing_upsert_error was computed from
`acceleration.primary_key`/`on_conflict` BEFORE `primary_keys` was
resolved. That meant the extract_primary_keys fallback was effectively
dead for engines that need upsert — a dataset with no declared
`acceleration.primary_key` but a source-table PK would always trip the
error with a useless `<pk>` hint.

Validation now runs inside the try_stream! block after `primary_keys`
is determined, and distinguishes two cases:

1. Declared PK set but `on_conflict` missing the matching Upsert
   entry — error names the PK and suggests the exact fix.
2. Declared PK unset (source-table fallback in play) — error
   explicitly says BOTH `acceleration.primary_key` AND a matching
   `on_conflict` are required, because the accelerator's write path
   consults only the declared config, not the source table's PK.

The error messages now contain a concrete PK name (either the declared
one or the one extracted from the source table) instead of a generic
placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pull pull Bot locked and limited conversation to collaborators Apr 19, 2026
@pull pull Bot added the ⤵️ pull label Apr 19, 2026
@pull pull Bot merged commit 3582d66 into TheRakeshPurohit:trunk Apr 19, 2026
1 of 11 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant