Clamp or error on oversize values written to Iceberg per out_of_range_values by sfc-gh-mslot · Pull Request #371 · Snowflake-Labs/pg_lake

sfc-gh-mslot · 2026-05-27T14:59:38Z

Summary

Three opt-in GUCs (pg_lake_engine.iceberg_max_string_bytes, iceberg_max_binary_bytes, iceberg_max_nested_type_bytes; default 0 = disabled) bound the byte size of values written to Iceberg tables.
Behavior on an oversize value follows the table's out_of_range_values option — same contract as the existing temporal / numeric OOR clamp:
- 'error' (default): raise an error identifying the column / type / byte size / exceeded GUC.
- 'clamp': silently fix up the value (truncate text/bytea, NULL jsonb/json, NULL the whole container for arrays/composites/maps).
Both the per-tuple FDW write path and the SQL pushdown wrapper (around the SELECT used by INSERT..SELECT, COPY FROM, and postgres_scan) honor the policy and produce identical output under 'clamp'.

Motivation

Some downstream consumers of Iceberg tables impose per-column byte caps smaller than what PostgreSQL values in the source can carry. For example, on Snowflake the column-type byte ceilings are:

STRING / VARCHAR: 16 MiB default, up to 128 MiB when declared with an explicit larger length.
BINARY: 8 MiB default, up to 64 MiB.
OBJECT / ARRAY / VARIANT: 128 MiB.

Without a guard, rows whose individual values exceed the target column's cap reach the consumer and surface as opaque "value too long" errors when the consumer ingests them. Operators set the GUCs to whatever the destination column's actual ceiling is.

Behavior

PG type	Action under `'clamp'`
`text` / `varchar` / `bpchar`	truncate at UTF-8 character boundary
`bytea`	byte-truncate
`jsonb` / `json`	replace with NULL (truncating would yield invalid JSON)
array / composite / map	NULL the whole container if its measured byte size exceeds the nested-type limit

Under 'error' (default), the same conditions raise instead of fixing up the value. Error messages identify the column, the source type, the actual byte size, the exceeded GUC, and a hint pointing at out_of_range_values = 'clamp'.

For jsonb, the limit applies to the text-serialized form (what the consumer sees). For arrays/composites the cap is approximated by the varlena content size — Snowflake stores semi-structured data without per-record headers, so on-disk size is a close-enough proxy for the JSON length the consumer will see.

Implementation

Per-tuple FDW path (pg_lake_table.c): IcebergSizeCheckOrClampSlotInPlace dispatches on the table's outOfRangePolicy and passes the column name into IcebergSizeClampDatum so the error message can name it.
SQL pushdown path (iceberg_query_validation.c): IcebergWrapQueryWithSizeClampChecks takes the policy and emits either the existing clamp UDFs (iceberg_size_clamp_text / _blob, iceberg_byte_size-driven CASE for nested) or new check UDFs (iceberg_size_check_text / _blob, error() inside CASE for jsonb / nested) which raise on oversize.
DuckDB UDFs (duckdb_pglake/src/duckdb_pglake_extension.cpp): two new iceberg_size_check_* scalar functions throw InvalidInputException with the column name and exceeded GUC.

Test plan

pg_lake_table/tests/pytests/test_iceberg_size_clamping.py — 13 cases covering both modes (10 clamp + 3 error-default), both paths (per-tuple FDW + SQL pushdown), and all clampable types (text/varchar/bpchar/bytea/jsonb/json/array/composite/int[]/composite-of-bigint).
Existing regressions: test_special_numeric.py and test_iceberg_validation.py still pass — the new code lives alongside IcebergErrorOrClampDatum and re-uses its FDW slot-level entry point.
CI confirmation.

🤖 Generated with Claude Code

Some downstream consumers of Iceberg tables impose per-column byte caps smaller than what PostgreSQL values in the source can carry. For example, on Snowflake the column-type byte ceilings are: - STRING / VARCHAR : 16 MiB default, up to 128 MiB when declared with an explicit larger length. - BINARY : 8 MiB default, up to 64 MiB. - OBJECT / ARRAY / VARIANT : 128 MiB. This adds three opt-in GUCs (default 0 = disabled) that bound text / binary / nested-type values written to Iceberg: pg_lake_engine.iceberg_max_string_bytes pg_lake_engine.iceberg_max_binary_bytes pg_lake_engine.iceberg_max_nested_type_bytes The behavior on an oversize value follows the table's out_of_range_values option (same contract as temporal / numeric OOR): out_of_range_values = 'error' (default for Iceberg tables): Raise an error identifying the column, type, byte size, and exceeded GUC. Operators see the violation immediately rather than discovering silent data loss downstream. out_of_range_values = 'clamp': text / varchar / bpchar : truncated at a UTF-8 character boundary bytea : byte-truncated jsonb / json : NULL when the text-serialized form exceeds the limit (truncating would yield invalid JSON) array / composite / map : NULL the whole container when its measured byte size exceeds the nested-type cap Both the per-tuple FDW write path (IcebergSizeCheckOrClampSlotInPlace) and the SQL pushdown path (IcebergWrapQueryWithSizeClampChecks, around the SELECT used by INSERT..SELECT, COPY FROM, and snowflake_cdc snapshot via postgres_scan) honor the policy and produce identical clamped output under 'clamp'. duckdb_pglake adds two new check UDFs (iceberg_size_check_text / _blob) used by the SQL wrapper under 'error'. The existing iceberg_size_clamp_text / _blob and iceberg_byte_size UDFs cover the 'clamp' path; jsonb and nested 'error' cases use DuckDB's built-in error() inside CASE WHEN. Tests cover both modes via 13 cases in pg_lake_table/tests/pytests/test_iceberg_size_clamping.py. Signed-off-by: Marco Slot <marco.slot@snowflake.com>

sfc-gh-mslot force-pushed the marcoslot/clamp-large-values branch from 608e7d2 to a8ab083 Compare May 27, 2026 15:30

sfc-gh-mslot changed the title ~~pg_lake: clamp string and binary values to per-column byte limits on Iceberg writes~~ Clamp string and binary values to per-column byte limits on Iceberg writes May 27, 2026

sfc-gh-mslot marked this pull request as draft May 27, 2026 16:07

sfc-gh-mslot force-pushed the marcoslot/clamp-large-values branch 7 times, most recently from 8756635 to 65794de Compare May 28, 2026 13:20

sfc-gh-mslot force-pushed the marcoslot/clamp-large-values branch from 65794de to c1b9b98 Compare May 29, 2026 21:35

sfc-gh-mslot changed the title ~~Clamp string and binary values to per-column byte limits on Iceberg writes~~ Clamp or error on oversize values written to Iceberg per out_of_range_values May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clamp or error on oversize values written to Iceberg per out_of_range_values#371

Clamp or error on oversize values written to Iceberg per out_of_range_values#371
sfc-gh-mslot wants to merge 1 commit into
mainfrom
marcoslot/clamp-large-values

sfc-gh-mslot commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfc-gh-mslot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Behavior

Implementation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sfc-gh-mslot commented May 27, 2026 •

edited

Loading