Skip to content

Clamp or error on oversize values written to Iceberg per out_of_range_values#371

Draft
sfc-gh-mslot wants to merge 1 commit into
mainfrom
marcoslot/clamp-large-values
Draft

Clamp or error on oversize values written to Iceberg per out_of_range_values#371
sfc-gh-mslot wants to merge 1 commit into
mainfrom
marcoslot/clamp-large-values

Conversation

@sfc-gh-mslot
Copy link
Copy Markdown
Collaborator

@sfc-gh-mslot sfc-gh-mslot commented May 27, 2026

Summary

  • Three opt-in GUCs (pg_lake_engine.iceberg_max_string_bytes, iceberg_max_binary_bytes, iceberg_max_nested_type_bytes; default 0 = disabled) bound the byte size of values written to Iceberg tables.
  • Behavior on an oversize value follows the table's out_of_range_values option — same contract as the existing temporal / numeric OOR clamp:
    • 'error' (default): raise an error identifying the column / type / byte size / exceeded GUC.
    • 'clamp': silently fix up the value (truncate text/bytea, NULL jsonb/json, NULL the whole container for arrays/composites/maps).
  • Both the per-tuple FDW write path and the SQL pushdown wrapper (around the SELECT used by INSERT..SELECT, COPY FROM, and postgres_scan) honor the policy and produce identical output under 'clamp'.

Motivation

Some downstream consumers of Iceberg tables impose per-column byte caps smaller than what PostgreSQL values in the source can carry. For example, on Snowflake the column-type byte ceilings are:

  • STRING / VARCHAR: 16 MiB default, up to 128 MiB when declared with an explicit larger length.
  • BINARY: 8 MiB default, up to 64 MiB.
  • OBJECT / ARRAY / VARIANT: 128 MiB.

Without a guard, rows whose individual values exceed the target column's cap reach the consumer and surface as opaque "value too long" errors when the consumer ingests them. Operators set the GUCs to whatever the destination column's actual ceiling is.

Behavior

PG type Action under 'clamp'
text / varchar / bpchar truncate at UTF-8 character boundary
bytea byte-truncate
jsonb / json replace with NULL (truncating would yield invalid JSON)
array / composite / map NULL the whole container if its measured byte size exceeds the nested-type limit

Under 'error' (default), the same conditions raise instead of fixing up the value. Error messages identify the column, the source type, the actual byte size, the exceeded GUC, and a hint pointing at out_of_range_values = 'clamp'.

For jsonb, the limit applies to the text-serialized form (what the consumer sees). For arrays/composites the cap is approximated by the varlena content size — Snowflake stores semi-structured data without per-record headers, so on-disk size is a close-enough proxy for the JSON length the consumer will see.

Implementation

  • Per-tuple FDW path (pg_lake_table.c): IcebergSizeCheckOrClampSlotInPlace dispatches on the table's outOfRangePolicy and passes the column name into IcebergSizeClampDatum so the error message can name it.
  • SQL pushdown path (iceberg_query_validation.c): IcebergWrapQueryWithSizeClampChecks takes the policy and emits either the existing clamp UDFs (iceberg_size_clamp_text / _blob, iceberg_byte_size-driven CASE for nested) or new check UDFs (iceberg_size_check_text / _blob, error() inside CASE for jsonb / nested) which raise on oversize.
  • DuckDB UDFs (duckdb_pglake/src/duckdb_pglake_extension.cpp): two new iceberg_size_check_* scalar functions throw InvalidInputException with the column name and exceeded GUC.

Test plan

  • pg_lake_table/tests/pytests/test_iceberg_size_clamping.py — 13 cases covering both modes (10 clamp + 3 error-default), both paths (per-tuple FDW + SQL pushdown), and all clampable types (text/varchar/bpchar/bytea/jsonb/json/array/composite/int[]/composite-of-bigint).
  • Existing regressions: test_special_numeric.py and test_iceberg_validation.py still pass — the new code lives alongside IcebergErrorOrClampDatum and re-uses its FDW slot-level entry point.
  • CI confirmation.

🤖 Generated with Claude Code

@sfc-gh-mslot sfc-gh-mslot force-pushed the marcoslot/clamp-large-values branch from 608e7d2 to a8ab083 Compare May 27, 2026 15:30
@sfc-gh-mslot sfc-gh-mslot changed the title pg_lake: clamp string and binary values to per-column byte limits on Iceberg writes Clamp string and binary values to per-column byte limits on Iceberg writes May 27, 2026
@sfc-gh-mslot sfc-gh-mslot marked this pull request as draft May 27, 2026 16:07
@sfc-gh-mslot sfc-gh-mslot force-pushed the marcoslot/clamp-large-values branch 7 times, most recently from 8756635 to 65794de Compare May 28, 2026 13:20
Some downstream consumers of Iceberg tables impose per-column byte
caps smaller than what PostgreSQL values in the source can carry.
For example, on Snowflake the column-type byte ceilings are:

  - STRING / VARCHAR : 16 MiB default, up to 128 MiB when declared
                       with an explicit larger length.
  - BINARY           : 8 MiB default, up to 64 MiB.
  - OBJECT / ARRAY /
    VARIANT          : 128 MiB.

This adds three opt-in GUCs (default 0 = disabled) that bound text /
binary / nested-type values written to Iceberg:

  pg_lake_engine.iceberg_max_string_bytes
  pg_lake_engine.iceberg_max_binary_bytes
  pg_lake_engine.iceberg_max_nested_type_bytes

The behavior on an oversize value follows the table's
out_of_range_values option (same contract as temporal / numeric OOR):

  out_of_range_values = 'error' (default for Iceberg tables):
    Raise an error identifying the column, type, byte size, and
    exceeded GUC.  Operators see the violation immediately rather
    than discovering silent data loss downstream.

  out_of_range_values = 'clamp':
    text / varchar / bpchar : truncated at a UTF-8 character boundary
    bytea                   : byte-truncated
    jsonb / json            : NULL when the text-serialized form
                              exceeds the limit (truncating would
                              yield invalid JSON)
    array / composite / map : NULL the whole container when its
                              measured byte size exceeds the
                              nested-type cap

Both the per-tuple FDW write path
(IcebergSizeCheckOrClampSlotInPlace) and the SQL pushdown path
(IcebergWrapQueryWithSizeClampChecks, around the SELECT used by
INSERT..SELECT, COPY FROM, and snowflake_cdc snapshot via
postgres_scan) honor the policy and produce identical clamped output
under 'clamp'.

duckdb_pglake adds two new check UDFs (iceberg_size_check_text /
_blob) used by the SQL wrapper under 'error'.  The existing
iceberg_size_clamp_text / _blob and iceberg_byte_size UDFs cover
the 'clamp' path; jsonb and nested 'error' cases use DuckDB's
built-in error() inside CASE WHEN.

Tests cover both modes via 13 cases in
pg_lake_table/tests/pytests/test_iceberg_size_clamping.py.

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
@sfc-gh-mslot sfc-gh-mslot force-pushed the marcoslot/clamp-large-values branch from 65794de to c1b9b98 Compare May 29, 2026 21:35
@sfc-gh-mslot sfc-gh-mslot changed the title Clamp string and binary values to per-column byte limits on Iceberg writes Clamp or error on oversize values written to Iceberg per out_of_range_values May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant