Clamp or error on oversize values written to Iceberg per out_of_range_values#371
Draft
sfc-gh-mslot wants to merge 1 commit into
Draft
Clamp or error on oversize values written to Iceberg per out_of_range_values#371sfc-gh-mslot wants to merge 1 commit into
sfc-gh-mslot wants to merge 1 commit into
Conversation
608e7d2 to
a8ab083
Compare
8756635 to
65794de
Compare
Some downstream consumers of Iceberg tables impose per-column byte
caps smaller than what PostgreSQL values in the source can carry.
For example, on Snowflake the column-type byte ceilings are:
- STRING / VARCHAR : 16 MiB default, up to 128 MiB when declared
with an explicit larger length.
- BINARY : 8 MiB default, up to 64 MiB.
- OBJECT / ARRAY /
VARIANT : 128 MiB.
This adds three opt-in GUCs (default 0 = disabled) that bound text /
binary / nested-type values written to Iceberg:
pg_lake_engine.iceberg_max_string_bytes
pg_lake_engine.iceberg_max_binary_bytes
pg_lake_engine.iceberg_max_nested_type_bytes
The behavior on an oversize value follows the table's
out_of_range_values option (same contract as temporal / numeric OOR):
out_of_range_values = 'error' (default for Iceberg tables):
Raise an error identifying the column, type, byte size, and
exceeded GUC. Operators see the violation immediately rather
than discovering silent data loss downstream.
out_of_range_values = 'clamp':
text / varchar / bpchar : truncated at a UTF-8 character boundary
bytea : byte-truncated
jsonb / json : NULL when the text-serialized form
exceeds the limit (truncating would
yield invalid JSON)
array / composite / map : NULL the whole container when its
measured byte size exceeds the
nested-type cap
Both the per-tuple FDW write path
(IcebergSizeCheckOrClampSlotInPlace) and the SQL pushdown path
(IcebergWrapQueryWithSizeClampChecks, around the SELECT used by
INSERT..SELECT, COPY FROM, and snowflake_cdc snapshot via
postgres_scan) honor the policy and produce identical clamped output
under 'clamp'.
duckdb_pglake adds two new check UDFs (iceberg_size_check_text /
_blob) used by the SQL wrapper under 'error'. The existing
iceberg_size_clamp_text / _blob and iceberg_byte_size UDFs cover
the 'clamp' path; jsonb and nested 'error' cases use DuckDB's
built-in error() inside CASE WHEN.
Tests cover both modes via 13 cases in
pg_lake_table/tests/pytests/test_iceberg_size_clamping.py.
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
65794de to
c1b9b98
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pg_lake_engine.iceberg_max_string_bytes,iceberg_max_binary_bytes,iceberg_max_nested_type_bytes; default 0 = disabled) bound the byte size of values written to Iceberg tables.out_of_range_valuesoption — same contract as the existing temporal / numeric OOR clamp:'error'(default): raise an error identifying the column / type / byte size / exceeded GUC.'clamp': silently fix up the value (truncate text/bytea, NULL jsonb/json, NULL the whole container for arrays/composites/maps).SELECTused byINSERT..SELECT,COPY FROM, andpostgres_scan) honor the policy and produce identical output under'clamp'.Motivation
Some downstream consumers of Iceberg tables impose per-column byte caps smaller than what PostgreSQL values in the source can carry. For example, on Snowflake the column-type byte ceilings are:
STRING/VARCHAR: 16 MiB default, up to 128 MiB when declared with an explicit larger length.BINARY: 8 MiB default, up to 64 MiB.OBJECT/ARRAY/VARIANT: 128 MiB.Without a guard, rows whose individual values exceed the target column's cap reach the consumer and surface as opaque "value too long" errors when the consumer ingests them. Operators set the GUCs to whatever the destination column's actual ceiling is.
Behavior
'clamp'text/varchar/bpcharbyteajsonb/jsonUnder
'error'(default), the same conditions raise instead of fixing up the value. Error messages identify the column, the source type, the actual byte size, the exceeded GUC, and a hint pointing atout_of_range_values = 'clamp'.For
jsonb, the limit applies to the text-serialized form (what the consumer sees). For arrays/composites the cap is approximated by the varlena content size — Snowflake stores semi-structured data without per-record headers, so on-disk size is a close-enough proxy for the JSON length the consumer will see.Implementation
pg_lake_table.c):IcebergSizeCheckOrClampSlotInPlacedispatches on the table'soutOfRangePolicyand passes the column name intoIcebergSizeClampDatumso the error message can name it.iceberg_query_validation.c):IcebergWrapQueryWithSizeClampCheckstakes the policy and emits either the existing clamp UDFs (iceberg_size_clamp_text/_blob,iceberg_byte_size-driven CASE for nested) or new check UDFs (iceberg_size_check_text/_blob,error()inside CASE for jsonb / nested) which raise on oversize.duckdb_pglake/src/duckdb_pglake_extension.cpp): two newiceberg_size_check_*scalar functions throwInvalidInputExceptionwith the column name and exceeded GUC.Test plan
pg_lake_table/tests/pytests/test_iceberg_size_clamping.py— 13 cases covering both modes (10 clamp + 3 error-default), both paths (per-tuple FDW + SQL pushdown), and all clampable types (text/varchar/bpchar/bytea/jsonb/json/array/composite/int[]/composite-of-bigint).test_special_numeric.pyandtest_iceberg_validation.pystill pass — the new code lives alongsideIcebergErrorOrClampDatumand re-uses its FDW slot-level entry point.🤖 Generated with Claude Code