You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pg_lake: streaming writes for INSERT, UPDATE, DELETE, COPY, and iceberg metadata
Adds an opt-in streaming-write feature that pushes bulk-write bytes
directly to pgduck_server via libpq COPY-IN, instead of writing them
to a shared filesystem under $PGDATA/pgsql_tmp and asking pgduck to
read them back. Companion to the pgduck_server RECEIVE protocol.
This decouples pgduck_server's filesystem from postgres's: the two
processes can run on different machines, different containers,
different pods, with no shared mount required.
User-visible:
- New GUC `pg_lake_engine.streaming_writes` (default off). When
flipped on, the bulk-write code paths route through the streaming
protocol; when off, all behavior is identical to today.
- No SQL syntax changes. INSERT, UPDATE, DELETE, COPY FROM STDIN,
CREATE TABLE iceberg, and iceberg metadata uploads all switch
paths transparently based on the GUC.
What this patch adds, by component:
`pg_lake_engine`:
- `OpenCSVStreamWriter` / `FinishCSVStreamWriter` /
`CSVStreamWriterDestReceiver` — the client-side primitive that
opens a libpq COPY-IN to pgduck_server's RECEIVE sink, hands back
a postgres `DestReceiver` that callers drive with rows, and
finalizes the deferred query when the stream ends.
- `StreamLocalFileToS3` — uses the same RECEIVE protocol to stream
iceberg metadata files (metadata.json, manifest list, manifest)
into pgduck for upload. Replaces the file-based path that needed
pgduck to see those files locally.
- Cooperative wait via `WaitForResult` in the new primitives:
`WaitLatchOrSocket` + `PQconsumeInput` + `CHECK_FOR_INTERRUPTS`
loop, so `statement_timeout` / SIGINT / postmaster-death actually
fire while the backend is waiting on pgduck. The naive
`PQgetResult` would not.
`pg_lake_table` (the FDW):
- `multi_data_file_dest.c` (the INSERT-side `MultiDataFileUploadDestReceiver`):
open a `CSVStreamWriter` instead of a CSV temp file when the GUC
is on. Each rotation opens a new stream; FlushChildDestReceiver
finalizes it via FinishCSVStreamWriter. The per-rotation writer
lives in the receiver's existing childContext, which gets reset
between rotations. The `partition` pointer on each
`DataFileModification` is deep-copied under `parentContext` before
that reset (NULL preserved for unpartitioned tables) so downstream
consumers in `ApplyDataFileModifications` — which runs at PRE_COMMIT
in a different memory-context lifetime — don't dereference into
freed memory.
- `pg_lake_table.c` (the UPDATE/DELETE callsites): the
`deleteStreamWriter` lazy-open in `DeleteSingleRow` opens a new
per-source-file stream for the position-delete records, and
`FinishForeignModify` calls `FinishCSVStreamWriter` to seal it.
The writer is allocated in a dedicated long-lived
`deleteStreamMemoryContext` (created at create_foreign_modify
time, sub of the FDW state context) so it survives the executor's
per-tuple resets — the same lifetime discipline
multi_data_file_dest.c already uses for its INSERT-side writer.
The "all rows deleted, drop the deletion file" optimization in
`WriteDeleteRecord` is gated to the file-based path; for the
streaming path, the writer keeps accumulating rows and the
resulting position-delete file covers all rows in the source —
semantically equivalent under iceberg's position-delete merge.
`pg_lake_copy` (the COPY pushdown):
- `copy.c` / `copy_io.c`: `COPY foreign_table FROM STDIN` opens a
`CSVStreamWriter` and pumps the client's CopyData straight through
to pgduck. The non-pushdown path (when COPY can't be pushed down)
still uses the file-based code.
Memory-context discipline:
The streaming writers live across multiple rows and sometimes across
the executor's per-tuple ExprContext reset. Each writer is allocated
in a long-lived context dedicated to that writer's lifetime — for
the INSERT side, the existing `MultiDataFileUploadDestReceiver`
childContext; for the DELETE side, a new
`deleteStreamMemoryContext` on the modify state. This mirrors how
the file-based path already explicitly allocates its
`CreateCSVDestReceiver` "in a long-lived memory context" (per the
comment in create_foreign_modify).
Compatibility:
- Default-off GUC: existing deployments are untouched. No behavior
change unless `streaming_writes = on`.
- The file-based code paths are unchanged. No regressions on
shared-filesystem topologies.
- No SQL surface changes. No changes to data layout or iceberg
catalog representation.
Testing:
Verified end-to-end against a CDC-shape workload (200 mixed
INSERT/UPDATE/DELETE events with COMMIT per event) via a local kind
cluster + a GKE deployment with pgduck_server running in a separate
pod from postgres. 10 back-to-back 600s soak runs with full
diagnostic instrumentation across the entire writer lifetime
detected zero correctness issues.
Signed-off-by: Tim McLaughlin <tim@gotab.io>
0 commit comments