[pull] trunk from spiceai:trunk#753
Merged
Merged
Conversation
* feat(postgres): stream WAL directly to Spice accelerators
Adds a native logical-replication path so a Postgres dataset with
`acceleration.refresh_mode: changes` streams INSERT/UPDATE/DELETE from
WAL into the local accelerator (DuckDB, SQLite, Postgres, Cayenne,
Arrow) without Debezium or Kafka.
Implementation
- New `data_components::postgres_replication` module with pgoutput
decoder, slot/publication setup, REPEATABLE READ snapshot bootstrap,
and an `LsnCommitter` that acknowledges durable LSN back to Postgres
so WAL can be reclaimed.
- `connector-postgres` now implements `supports_changes_stream` /
`changes_stream` behind a new `postgres-replication` Cargo feature
(on by default). Adds `pg_replication_slot`, `pg_publication`,
`pg_replication_initial_snapshot`, `pg_replication_temporary_slot`,
and `pg_replication_status_interval` params.
- Each Spice replica auto-gets a distinct slot name
(`spice_<dataset>_<hash(SPICE_INSTANCE_ID|hostname)>`), so multi-
replica deployments just work; publication is shared.
- Slot's server-side `confirmed_flush_lsn` is the durable checkpoint:
restarts resume without re-bootstrapping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(postgres): expose WAL replication metrics via MetricsProvider
Emits 14 OpenTelemetry observables per replicated dataset under
`dataset_postgres_replication_*`, matching the pattern used by MySQL,
Databricks, and DynamoDB:
Freshness
- replication_lag_ms (gauge, auto)
- replication_lag_bytes (gauge, auto)
- replication_confirmed_flush_lsn (gauge)
- replication_server_wal_end_lsn (gauge)
Throughput
- replication_transactions_total (counter, auto)
- replication_inserts_total (counter, auto)
- replication_updates_total (counter, auto)
- replication_deletes_total (counter, auto)
- replication_truncates_total (counter)
Bootstrap
- replication_bootstrap_rows_total (counter)
- replication_bootstrap_complete (gauge, auto)
Errors
- replication_decode_errors_total (counter, auto)
- replication_schema_mismatch_errors_total (counter, auto)
- replication_recv_errors_total (counter, auto)
Implementation mirrors the DynamoDB Streams pattern: a
MetricsCollector (AtomicU64 counters + RwLock<SystemTime> watermark)
is shared between the replication stream and the connector's
MetricsProvider impl. Docs updated with a full Metrics section
including Prometheus alert/throughput queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(postgres): end-to-end TPC-H replication integration test
New `crates/runtime/tests/postgres/replication_tpch.rs` exercises the
full stack: Postgres docker container (wal_level=logical) →
connector-postgres replication → DuckDB accelerator → SQL queries
through Runtime::datafusion().
Covers a TPC-H-shaped subset (region, nation, customer, orders) with
int4, int8, text, float8, and date columns. The test asserts:
- Initial snapshot bootstrap populates all four accelerated tables.
- INSERT/UPDATE/DELETE propagate through WAL to the accelerator.
- Date and float types round-trip accurately.
- Joins across the replicated tables return consistent results.
- A final pretty-printed region snapshot (insta) guards the row shape.
CI: add `postgres-replication` to the FEATURES list in
.github/workflows/integration.yml so nextest picks it up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): address PR review feedback
Correctness
- KeepAlive now ACKs the last *applied* LSN, never the server's wal_end
(was silently advancing the slot and risked losing un-applied changes
across restarts).
- Reject `pg_sslmode` values other than `disable` at setup with an
actionable error, instead of silently falling back to `NoTls` against
TLS-enforced Postgres.
- `Value::Unchanged` (TOAST omitted) and `Value::Binary` now fail the
stream instead of silently coercing to NULL — avoids accelerator
data corruption on UPDATE.
- `parse_lsn` returns `Result` and errors on malformed input; LSN 0 is
no longer a silent fallback.
- Guard the Commit path against publications that contain multiple
tables (would mix relations into one ChangeBatch).
- `replication_lag_ms` now uses pgoutput's `commit_time_micros` (source
commit time) instead of ingest time, so the metric matches its
description and reflects end-to-end freshness.
- Reject `DataType::LargeUtf8` in both the bootstrap and WAL-change
builders (was producing a Utf8 `StringArray` that fails struct
schema validation).
Hygiene
- Prefer `acceleration.primary_key` over the source provider's
`Constraints::PrimaryKey` when routing `ChangeBatch` PKs — matches
what the accelerator write path uses.
- Precompute dataset→relation column mapping once per batch
(O(rows × fields²) → O(rows × fields)).
- Replace `.expect("epoch")` with `unreachable!` in two places.
- Drop `SPICE_INSTANCE_ID` env-var mutation in a unit test (flaky under
concurrent test runs).
- Fix mod.rs doc comment: "COPY-based snapshot" → row streaming.
- Clarify the at-least-once snapshot/WAL boundary and note exported-
snapshot handshake as a follow-up.
- Trim unused deps on connector-postgres (arrow, twox-hash, tracing,
paste, linkme).
Docs
- Update orphaned-slot query to use `inactive_since` (PG14+) instead of
the non-existent `stats_reset`.
- Minimal spicepod example uses `pg_sslmode: disable` with a note about
TLS being a follow-up.
- Limitations section now states TLS is rejected at startup rather than
silently ignored.
Tests: 24 passing (+2 new LSN-parse rejection tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(postgres-replication): resilient reconnect + address review feedback
Resilience
- New `resilience` module with `Backoff` (500ms → 30s with ±20% jitter),
transient-error classifiers for both `pgwire_replication::PgWireError`
and `tokio_postgres::Error` (SQLSTATE 08xxx connection-exception and
57P0x admin-shutdown are retriable), and a generic `retry_async`
wrapper.
- WAL client now wraps its recv loop in an outer reconnect loop:
transient connect/recv failures drop the client, back off, and
reconnect. Relation cache + in-flight txn buffer are reset per
reconnect (Postgres re-sends Relation on demand). Only fatal errors
(auth, slot missing, schema mismatch) surface to the caller.
- Initial connect attempt done upfront so startup misconfig surfaces
immediately; transient failures on that first connect are demoted to
a warning so the dataset comes up once Postgres is reachable.
- `setup_slot_and_publication` wraps its connect+catalog SQL in the
same retry logic (2-minute budget). Idempotent SQL means no harm on
retry.
- New `dataset_postgres_replication_reconnects_total` counter, auto-
registered. Docs section explaining the resilience model + a new
"Resilience" subsection under Operations.
Review feedback (4 threads)
- Bootstrap loop now precomputes the dataset-field → pg-column index
map once from the first row's metadata (O(fields²) on first row,
O(fields) per row thereafter).
- `replication_lag_bytes` unit: "by" → "By" (UCUM for bytes).
- Client overrides pgoutput `Relation.is_key` from the dataset's
declared `acceleration.primary_key` before routing to the change
builder, so REPLICA IDENTITY FULL (which flags every column as key)
no longer explodes `ChangeBatch.primary_keys`.
- Slot-resume path now prepends a zero-row `is_dataset_ready=true`
envelope with a no-op committer, so the dataset becomes ready
without waiting for the first WAL change. Important for quiet
sources on restart.
Tests: 28 passing (+4 new — backoff doubling/cap, transient
classifier, retry_async success and fatal paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): address another round of review feedback
- pgoutput.rs header comment: "Postgres 15+" → "Postgres 10+". pgoutput
has existed since PG 10.
- config.rs: slot + publication names now truncate the sanitized
dataset portion to keep the final identifier within Postgres' 63-byte
NAMEDATALEN limit. Added slot_name_is_truncated_to_postgres_limit and
publication_name_is_truncated_to_postgres_limit tests.
- connector-postgres/replication.rs: fail fast with an actionable
StreamError when no primary key is available from either
acceleration.primary_key or the source provider's constraints —
refresh_mode: changes cannot route UPDATE/DELETE without one.
- bootstrap.rs: module-level comment no longer implies LSN-consistent
snapshot semantics (it's at-least-once; the authoritative note is
on `start_replication_stream`).
- bootstrap.rs: `metrics.mark_bootstrap_complete()` now runs *after*
the REPEATABLE READ transaction's COMMIT succeeds, so
`replication_bootstrap_complete=1` reflects the true durable
bootstrap state (relevant when operators use the metric as a
readiness signal).
Tests: 30 passing (+2 new — slot/publication truncation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(postgres): fold replication into the postgres feature, finish TLS + TRUNCATE + LargeUtf8
Feature consolidation
- No separate `postgres-replication` feature flag. The WAL streaming path
is now always compiled with the `postgres` feature on data_components
and is always on in connector-postgres. Removes the feature entry in
data_components, connector-postgres, and runtime; drops `postgres-
replication` from the CI integration.yml FEATURES list.
- `data_components::postgres_replication` is now gated on `postgres`.
TLS (full implementation)
- New `SslMode` variants: `VerifyCa` and `VerifyFull`, matching libpq
vocabulary. `SslMode::{requires_tls, verifies_certificate,
verifies_hostname}` helpers.
- `pg_sslrootcert` parameter threaded through `ReplicationParams`.
- Setup + bootstrap connections now build a `postgres_native_tls::
MakeTlsConnector` from the configured mode; `disable` still uses
`NoTls`, everything else negotiates TLS with the appropriate
verification level.
- WAL replication (pgwire-replication) wires our `SslMode` to the
crate's `TlsConfig`: disabled / require / verify_ca / verify_full,
with the optional CA path plumbed through.
- Minimal PEM certificate parser that splits a CA bundle into
individual `native_tls::Certificate`s so users can point
`pg_sslrootcert` at a multi-cert file.
- Removes the old `TlsNotSupported` fast-fail in setup.
TRUNCATE (full implementation)
- pgoutput `Truncate` is no longer just a warning. The client enqueues
a synthetic `ChangeOp::Truncate` into the transaction buffer; the
downstream `ChangeBatch` path already groups these into a
`ChangeOperationType::Truncate` sub-batch.
- Runtime's refresh_task/changes.rs gains `process_truncate` which
calls `DeletionTableProvider::delete_from(&[])` to unconditionally
drop all rows in the accelerated table.
- Docs updated: `replication_truncates_total` metric now documented as
"received and applied" rather than "received (skipped)"; Limitations
section no longer flags TRUNCATE.
LargeUtf8 (full implementation)
- FieldBuilder (WAL path) and BootstrapBuilder (snapshot path) now
handle `DataType::LargeUtf8` via `LargeStringBuilder` instead of
erroring. No more "not yet supported" rejection.
Misc
- Docs minimal example uses `pg_sslmode: verify-full` with an optional
`pg_sslrootcert` pointing at a CA PEM.
- Dropped the "No TLS" entry from the Limitations section.
- Workspace deps: added postgres-native-tls 0.5, native-tls 0.2.
30 unit tests pass. Clippy + fmt clean. Runtime tests compile with
`--features postgres,duckdb`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): 5 review fixes + trunk merge
- process_truncate now passes `[lit(true)]` as the delete filter instead
of `&[]`. Existing DeletionTableProvider impls (DuckDB/SQLite/Postgres)
intentionally treat empty filters as a no-op to guard against
accidental full-table deletes, so `TRUNCATE` would have been silently
skipped. An always-true literal produces `DELETE FROM t WHERE TRUE`
which is the correct behavior.
- pgoutput TRUNCATE now uses `relation_ids` from the message instead of
picking the first cached relation. Errors clearly on empty or multi-
relation truncates (this replication path requires exactly one table
per publication). Also errors clearly if the referenced relation
hasn't been cached yet.
- Metric description for `replication_truncates_total`: "currently
skipped" → "applied to the accelerator".
- Docs troubleshooting table: "TRUNCATE not yet implemented"
entry replaced with an informational entry describing the new queued-
for-apply log line.
- Docs metric-naming sentence: `dataset_postgres_<metric>` →
`dataset_postgres_replication_<metric>` to match actual series names.
Also includes the clean merge of origin/trunk (2 new commits: OTEL
exporter fix, candle/TEI fork pinning).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(postgres-replication): full pg↔arrow type coverage + nullability fix
Type coverage (matches datafusion-table-providers' postgres mapping)
- Timestamp(Nanosecond) for both timestamp and timestamptz — this was
the critical gap; DTP's read_provider returns Nanosecond precision,
so any real timestamp column would previously fail `FieldBuilder`
construction with "unsupported Arrow data type".
- Binary (bytea) — parse pgoutput `\x...` hex text in WAL path, read
`Vec<u8>` directly in bootstrap.
- Decimal128(precision, scale) for NUMERIC — parse text "123.45" to
i128 with Arrow scale; rejects NaN/Infinity/overscale values with
actionable errors.
- Int8 (pg CHAR), UInt32 (pg OID/XID), Time64(Nanosecond) (pg TIME).
- LargeUtf8 now fully handled (earlier hot-fix had it rejected).
- Arrays / Lists / Intervals rejected with explicit errors pointing
the operator at a cast-to-scalar workaround.
Nullability fix (the DELETE regression)
- `build_change_batch` and bootstrap's `finish_batch` now construct the
internal `data` StructArray with all fields *nullable*, regardless of
the dataset schema's original nullability. This is required because
REPLICA IDENTITY DEFAULT populates only PK columns in a DELETE event;
non-PK columns are sent as null. Building the struct with the
original nullable=false fields would fail StructArray::new validation
and kill the stream. Downstream writes still respect the accelerator's
nullability because `SchemaCastScanExec` re-casts to the strict schema
on INSERT. DELETE doesn't touch non-PK columns anyway.
Comprehensive unit tests (49 passing, up from 30)
- INSERT / UPDATE / DELETE (both REPLICA IDENTITY DEFAULT and FULL) /
TRUNCATE on both nullable and NOT NULL schemas.
- Mixed ops in a single transaction preserve order.
- Composite primary keys populate all PK entries correctly.
- Type coverage: Int8/16/32/64, UInt32, Float32/64, Boolean, Date32,
Time64(ns), Timestamp(us/ns both naive and tz-aware), Binary(bytea),
Decimal128(38, 2).
- `FieldBuilder::new` rejects Array / List / Interval with actionable
messages.
- Numeric parser: standard cases, sign handling, NaN/Infinity rejection,
overscale rejection.
- Hex decoder: round-trip, odd length, invalid digit rejection.
- Unchanged-TOAST during UPDATE errors clearly (regression guard).
Bootstrap + WAL paths share the same type coverage and nullability
semantics so ChangeBatches are structurally identical regardless of
which path produced them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): 10 review fixes — scientific notation, LSN seeding, commit-before-ready, arrow deletes
Correctness
- `expand_scientific_notation` no longer goes through `f64` (which loses
precision and can re-emit scientific notation that breaks the
`combined.parse::<i128>()` downstream). Now a pure string-based
expansion that preserves full Decimal128 range and emits structured
errors for malformed exponents/significands.
- `confirmed_flush` atomic is now seeded with the slot's `consistent_lsn`
after setup (previously pinned to 0 until the first Commit, which
made KeepAlive ACKs and `replication_lag_bytes` report wildly inflated
values at startup).
- Bootstrap now commits the REPEATABLE READ transaction BEFORE yielding
the ready-signalling envelope. If COMMIT fails the runtime never sees
`is_dataset_ready=true`, matching durable state.
Docs
- Arrow accelerator row in the engine table: `DELETE` and `TRUNCATE`
are applied via Arrow's `DeletionTableProvider` (previously claimed
they were ignored — they weren't). `UPDATE`s still degrade to inserts
because Arrow lacks `on_conflict` semantics.
- Limitations section reconciled with the fix above — Arrow is
append-only for updates, but deletes and truncates work.
Integration tests
- `replication.rs` now commits the bootstrap / UPDATE / DELETE envelopes
(was `let (_committer, ...)` in all three). Exercises the LSN
acknowledgement path the runtime actually uses.
- Added `#![allow(clippy::expect_used)]` at crate level so CI's
`-D clippy::expect_used` lint doesn't fire on the test file.
49 tests pass. Runtime builds clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): resume from confirmed_flush_lsn + TLS/hostname hardening
- slot.rs: on resume, query pg_replication_slots.confirmed_flush_lsn and
return it as consistent_lsn so the caller seeds its in-memory atomic
(and metrics) from the server's durable checkpoint instead of the 0
sentinel. Handles NULL catalog values by falling back to 0.
- config.rs: drop /etc/hostname blocking-I/O fallback; env-var only.
- config.rs: PEM parser returns TruncatedPem on unmatched BEGIN marker
and callers return EmptyCaBundle if zero certs parse, instead of
silently loading nothing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): warn when TLS cert verification is disabled
CodeQL flagged the native-tls danger_accept_invalid_* calls. The code is
correct by design — it implements the libpq sslmode contract, and each
dangerous flag is gated on an explicit user opt-in via pg_sslmode — but
we now emit a tracing::warn each time the connector is built in a
non-verifying mode, so operators see the security posture of their
deployment in the logs.
Production recommendation remains pg_sslmode=verify-full (no dangerous
flags set, chain + hostname both verified).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): map sslmode=prefer to plaintext for WAL stream
Previously `SslMode::Prefer` mapped to `TlsConfig::require()` — this
silently strengthened our parsing default into "TLS required" and
broke WAL replication against non-TLS Postgres instances that the
regular connector happily talks to (the setup connection uses
tokio_postgres's real Prefer semantics).
pgwire-replication doesn't expose a safe "try TLS, fall back to
plaintext" path, so the only honest mappings are Disabled or Require.
Matching libpq's "don't block on missing TLS" intent and staying
symmetric with the setup path makes Disabled the right default
— operators who want TLS on replication set Require, VerifyCa, or
VerifyFull explicitly. A tracing::warn already fires when any
non-verifying mode is in effect.
Docs now document per-sslmode transport/verification behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): pass CI pedantic clippy
Resolves the `Rust Lint` CI failure. CI runs clippy with
-Dclippy::pedantic plus -Dclippy::unwrap_used / clone_on_ref_ptr /
allow_attributes; these weren't being caught by the looser local
clippy invocation used during earlier ticks. Changes:
- bootstrap.rs: drop unused `async`, drop the unused `_snapshot_name`
pub field, replace `expect` with `let-else + unreachable`, use
`Arc::clone(tz)` instead of `tz.clone()`, remove redundant param
clone, fold the column-map lookup into the iteration path.
- changes.rs: `append` now takes `Option<&Value>` (ref_option fix);
`Arc::clone(tz)` for TimestampMicros/Nanos finish; `.expect()` in
tests (unwrap_used); `unwrap_err()` via `.expect_err` instead of
`assert!(...is_err())` (assertions_on_result_states); `let-else`
for FieldBuilder rejection tests (manual_let_else); doc backticks.
- client.rs: fold Begin/Commit/Other into one no-op arm;
`u64::try_from` for pg_epoch_to_system_time (cast_sign_loss).
- config.rs: merge ParseCa/BuildConnector in `source()` (both
native_tls::Error — match_same_arms); scoped `#[expect]` on the
intentional u64→u32 truncation in xxh3_short_hash.
- mod.rs: `#[must_use]` on start_replication_stream;
`stream_error(&Error)`; scoped `#[expect]` on err_to_stream's
by-value signature (used as function pointer in map_err).
- resilience.rs: scoped `#[expect]` on the intentional bounded-
millis casts in jitter().
- slot.rs: scoped `#[expect]` on the intentional u64→u32 truncation
in format_lsn; drop redundant #[allow(dead_code)] (parse_lsn_or_zero
is exercised by a test).
- connector-postgres lib.rs: rename `postgres_factory` to `factory`
(struct_field_names).
- connector-postgres replication.rs: `map_or(true, ...)` →
`is_none_or(...)`; `is_some_and(...)` for the false-default; method
path over closure in PK mapping.
- runtime tests: `u16::try_from(port).expect(...)` (cast truncation);
`r"..."` instead of `r#"..."#` where no quotes embedded; inlined
format args; `expect_err` idioms.
No behavioral changes. Library test count unchanged (49 passing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): 4 review fixes — publication race, async CA read, TRUNCATE comment, insta snapshot
- slot.rs: ensure_publication now routes CREATE/ALTER PUBLICATION
through a new `ignore_duplicate_object` helper that treats SQLSTATE
42710 as success. Two replicas racing into publication setup no
longer fail on the loser.
- config.rs: native_tls_connector is now async and reads sslrootcert
via tokio::fs::read instead of std::fs::read, so it doesn't block a
Tokio reactor thread during setup/bootstrap. Callers (slot + bootstrap)
await the connector; bootstrap moved the call inside the try_stream!
block with explicit err_to_stream conversion.
- refresh_task/changes.rs: clarified the TRUNCATE comment to say
*some* DeletionTableProvider impls (DuckDB in particular, with the
no-op guard at data_components/src/duckdb.rs:137-140) treat empty
filters as a no-op, which is why we pass `&[lit(true)]` for a uniform
DELETE WHERE TRUE across engines.
- runtime/tests/postgres/replication_tpch.rs: replaced the
`insta::assert_snapshot!` call with an inlined `assert_eq!` against
a `concat!`-built expected string. The integration test no longer
depends on a separate .snap artifact landing in the PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): 4 review fixes — tpch test regions, COPY wording, param description
- replication_tpch.rs: expected_regions table now matches the 3-row
seed (AFRICA, AMERICA, ASIA). The previous copy included EUROPE and
MIDDLE EAST rows that were never seeded; the assertion would have
failed deterministically once CI ran the TPC-H integration path.
- bootstrap.rs: BOOTSTRAP_BATCH_SIZE doc updated from "COPY-streaming"
to "streaming the initial snapshot query" — the path uses `SELECT *`
with query_raw/portal, not Postgres COPY.
- connector-postgres/src/lib.rs: replication_initial_snapshot
parameter description changed from "COPY the table's existing
rows" to "take an initial snapshot of the table's existing rows"
so the wording matches the actual REPEATABLE READ + SELECT * path.
- docs/features/postgres-replication.md: same wording fix in the
parameter reference table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): enable SCRAM auth + correct tpch order count
Two CI integration-test failures on 36f823c:
1. SCRAM authentication — pgwire-replication was pulled in with
default-features = false, which disables its `scram` feature. All
postgres:16 containers require SCRAM-SHA-256 auth by default, so
every replication connect attempt failed with
`authentication error: SCRAM authentication required but 'scram'
feature not enabled`. Enable the `scram` feature alongside
`tls-rustls` in the workspace dep. Fixes:
- postgres::replication::bootstrap_then_stream_changes
- postgres::replication::two_replicas_have_independent_slots
2. tpch_postgres_replication_end_to_end: the assertion
`SELECT count(*) FROM tpch_orders WHERE o_orderdate < '1995-01-01'`
was compared against 3, but the seed data has 5 matching rows
(ids 3, 5, 6, 9, 10 with dates in 1992–1994). Corrected the
expected count to 5 with a comment documenting which seed rows
qualify.
Local clippy (-Dwarnings -Dclippy::pedantic) clean on data_components
with the new feature flag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): switch regions assertion back to insta snapshot
Per PR feedback, use the repo's standard Insta snapshot pattern for the
final region-table assertion instead of an inline assert_eq!. Committed
the generated .snap file so CI runs deterministically without needing
INSTA_UPDATE=1. Values match the 3-row seed (AFRICA, AMERICA, ASIA).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): 5 review fixes — on_conflict guard, doc wording, comment accuracy
- connector-postgres: `build_changes_stream` now fails fast when the
engine supports upsert (DuckDB / SQLite / PostgreSQL / Cayenne) but
the dataset has no `acceleration.on_conflict` entry with an
`Upsert` variant. Arrow / PartitionedArrow are exempted — they're
documented as append-only for UPDATEs. Error tells operators exactly
what to add.
- docs/features/postgres-replication.md: dropped the misleading
`/etc/hostname` row from the replica-identity sources table. The
hostname fallback is env-var only (intentionally avoids blocking
I/O) — matches the actual implementation.
- data_components/postgres_replication/metrics.rs: module doc now
names the OTel observable pattern correctly:
`dataset_postgres_<metric_spec_name>` with
`dataset_postgres_replication_lag_ms` as a concrete example.
- data_components/postgres_replication/slot.rs: dropped the incorrect
"PG 16+" framing from the `create_logical_slot` comment. The
`(slot_name, lsn)` return shape of `pg_create_logical_replication_slot`
has been stable for many versions; the real limitation is that the
SQL-function path doesn't give us `EXPORT_SNAPSHOT` semantics, hence
the REPEATABLE READ bootstrap.
Also updated the PR description so its "postgres-replication Cargo
feature" bullet and test-plan commands match the actual build surface
(replication is part of the existing `postgres` feature).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): add dataset-name hash to default slot/publication names
Two long dataset names that share the same first ~48 characters
previously truncated to the same default slot/publication name
because only the instance hash disambiguated the slot suffix. That
means the slot/publication on Postgres would alias across datasets on
the same replica.
Added a 6-char `xxh3` hash of the full dataset name between the
truncated prefix and the instance hash:
slot: `spice_{truncated}_{ds-hash6}_{instance-hash8}`
pub: `spice_{truncated}_{ds-hash6}_pub`
Added `truncated_prefix_collisions_are_disambiguated` regression test
and updated the other naming tests for the new format.
Docs updated (feature doc + PR description) to reflect the new
default shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): port-fallback + NULL confirmed_flush_lsn handling
Two review fixes:
1. runtime/tests/postgres/common.rs — the new
`start_postgres_docker_container_with_logical_wal` helper no longer
uses `port.try_into().unwrap_or(15432)`. A port that doesn't fit in
u16 now returns a clear anyhow error instead of silently binding on
15432 (which would mask the misconfiguration as a confusing
connection failure later).
2. data_components slot.rs — do_setup now explicitly distinguishes
three catalog states for the named slot:
- None → no slot, create + bootstrap
- Some(0) → slot exists but NULL confirmed_flush_lsn
(crash between slot creation and first
standby status); warn and treat as
bootstrap-required so the accelerator isn't
silently left missing historical rows
- Some(lsn) lsn>0 → normal resume
Previously Some(0) was treated as a resume with created_fresh=false.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(postgres-replication): sync default slot/pub names to actual naming logic
The docs and parameter descriptions still referenced the old
`spice_<dataset>_<instance-hash>` shape after the dataset-hash
segment was added in 90013a2. Updated:
- docs/features/postgres-replication.md overview step, "Start the
runtime" bullet list, `pg_replication_slot` row in the config
reference, and the sample `pg_replication_slots` SQL output
- connector-postgres parameter descriptions for
`replication_slot` and `publication`
Operators can now match the actual slot names they'll see in
`pg_replication_slots` when inspecting or decommissioning replicas.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): 3 review fixes — on_conflict PK match, count-column strictness, commits in two-replica test
1. connector-postgres replication.rs: tightened the `has_upsert` check.
Previously we accepted any `Upsert` entry in `acceleration.on_conflict`;
a misconfigured target on a non-PK column would pass but still let
UPDATE events append duplicate rows. Now we look up the declared
`primary_key` directly in the `on_conflict` map and require an
`Upsert(_)` there. The error message suggests the actual PK column
instead of a generic placeholder.
2. runtime/tests/postgres/replication_tpch.rs: `wait_for_row_count`
previously used `unwrap_or_default()` which swallowed schema/type
regressions (empty batch, unexpected column type, negative count) as
"still 0 rows" and turned them into a 60-second timeout. Each
mis-shape now produces an explicit error with the sql + actual type
in the message.
3. runtime/tests/postgres/replication.rs: `two_replicas_have_independent_slots`
now calls `env.commit().await?` on both bootstrap and live envelopes
for both replicas. Previously the test skipped commits, so
confirmed_flush_lsn never advanced and slot state could drift —
source of potential WAL-retention / cleanup flakiness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(postgres-replication): move upsert validation after PK resolution
Per review: the missing_upsert_error was computed from
`acceleration.primary_key`/`on_conflict` BEFORE `primary_keys` was
resolved. That meant the extract_primary_keys fallback was effectively
dead for engines that need upsert — a dataset with no declared
`acceleration.primary_key` but a source-table PK would always trip the
error with a useless `<pk>` hint.
Validation now runs inside the try_stream! block after `primary_keys`
is determined, and distinguishes two cases:
1. Declared PK set but `on_conflict` missing the matching Upsert
entry — error names the PK and suggests the exact fix.
2. Declared PK unset (source-table fallback in play) — error
explicitly says BOTH `acceleration.primary_key` AND a matching
`on_conflict` are required, because the accelerator's write path
consults only the declared config, not the source table's PK.
The error messages now contain a concrete PK name (either the declared
one or the one extracted from the source table) instead of a generic
placeholder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )