Commit 8756635
pg_lake: clamp string and binary values to per-column byte limits on Iceberg writes
Some downstream consumers of Iceberg tables impose per-column byte
caps smaller than what PostgreSQL values in the source can carry.
For example, on Snowflake the column-type byte ceilings are:
- STRING / VARCHAR : 16 MiB default, up to 128 MiB when declared
with an explicit larger length.
- BINARY : 8 MiB default, up to 64 MiB.
- OBJECT / ARRAY /
VARIANT : 128 MiB.
Without a guard, rows whose individual values exceed the target
column's cap reach the consumer and surface as opaque "value too
long" errors when the consumer ingests them. This adds an opt-in,
GUC-driven clamp that fires consistently across both the per-tuple
FDW path and the DuckDB-pushdown paths (INSERT..SELECT, snapshot
loads via postgres_scan, COPY FROM, etc.).
Three new GUCs (PGC_USERSET, GUC_UNIT_BYTE, default 0 = disabled, no
behavior change unless set):
- pg_lake_engine.iceberg_max_string_bytes governs text, varchar,
bpchar, jsonb, json (per-leaf clamp). text/varchar/bpchar are
truncated at a UTF-8 character boundary; jsonb/json are replaced
with NULL since truncating the serialized form would yield invalid
JSON.
- pg_lake_engine.iceberg_max_binary_bytes governs bytea (byte-truncate).
- pg_lake_engine.iceberg_max_aggregate_bytes governs the on-disk
size of array, composite, and map values that land in OBJECT/
ARRAY/VARIANT columns. The whole container is replaced with NULL
when its varlena exceeds the limit; we deliberately do NOT recurse
into elements/fields to per-leaf-clamp inner strings, because (a)
inner leaves don't have their own column cap on the consumer side
— they're just JSON inside the parent OBJECT/ARRAY/VARIANT — and
(b) when iceberg_max_string_bytes <= iceberg_max_aggregate_bytes,
no inner leaf can exceed the per-leaf limit while staying inside
an under-cap container. Distinct from the string GUC because
consumers usually cap semi-structured types more loosely than
STRING (e.g. on Snowflake: 128 MiB vs. 16 MiB).
Operators set the GUCs to the target column's actual ceiling.
Two complementary code paths cover all writers:
1. **Per-tuple FDW path** (INSERT .. VALUES, single-row UPDATE/DELETE
pipelines): IcebergSizeClampSlotInPlace runs in-process after the
existing temporal/numeric clamp and before ExecConstraints.
PgLakeModifyState gains a needsSizeClamping flag computed once at
init via TupleDescNeedsIcebergSizeClamping, so the per-row work is
skipped entirely on tables whose columns cannot trigger size
clamping.
Hot-path optimizations: IcebergSizeClampDatum short-circuits
varlena values whose total on-disk size is comfortably under every
active limit (toast_raw_datum_size against the smallest active
GUC, halved to cover jsonb's binary-vs-text gap), avoiding even
the leaf scalar dispatch on small values. IcebergSizeClampString-
Scalar additionally pre-checks jsonb binary size against half the
string limit before falling back to jsonb_out, since text
serialization of typical jsonb data is bounded by a small constant
factor over the binary representation. For arrays/composites/
maps, no deconstruct/deform happens: a single toast_raw_datum_size
compared against iceberg_max_aggregate_bytes decides pass-through
vs. NULL.
2. **DuckDB-pushdown path** (INSERT..SELECT, snapshot/initial-copy
via postgres_scan, COPY FROM, compaction): a new
IcebergWrapQueryWithSizeClampChecks rewriter wraps the inner
SELECT with an outer projection that calls two new DuckDB scalar
UDFs registered in duckdb_pglake (iceberg_size_clamp_text and
iceberg_size_clamp_blob) for lossless leaf truncation, expresses
jsonb/json NULL-on-overflow inline via strlen(::VARCHAR), and
applies aggregate-NULL inline as
`CASE WHEN strlen(<container>::VARCHAR) > <aggregate_max> THEN NULL
ELSE <container> END`. No list_transform / struct_pack inside
containers, mirroring the per-tuple side. The wrapper is hooked
into pg_lake_engine/src/pgduck/write_data.c alongside the existing
out_of_range_values clamp wrapper, so all pushdown writers
(INSERT..SELECT, snapshot via postgres_scan, COPY FROM,
vacuum/compaction) flow through it without per-callsite plumbing.
Per-leaf truncation output is byte-identical between the two paths.
The aggregate threshold uses different proxies — varlena size on the
PG side, JSON-serialized text length on the SQL side — but both are
within a small constant factor of the consumer-visible size.
Tests: pg_lake_table/tests/pytests/test_iceberg_size_clamping.py
covers UTF-8 boundary truncation, bytea byte truncation, jsonb/json →
NULL, NOT NULL constraint violation after clamp-to-NULL, container
NULLing on aggregate overflow (text[], composite-of-text, int[],
all-bigint composite), the disabled-by-default no-op path, the
only-one-GUC-set path, INSERT..SELECT pushdown clamping across
text/bytea/jsonb, and INSERT..SELECT pushdown aggregate NULLing of
int[].
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>1 parent 139aeca commit 8756635
11 files changed
Lines changed: 1457 additions & 0 deletions
File tree
- duckdb_pglake/src
- pg_lake_engine
- include/pg_lake/pgduck
- src
- pgduck
- pg_lake_table
- src/fdw
- tests/pytests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
389 | 389 | | |
390 | 390 | | |
391 | 391 | | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
392 | 560 | | |
393 | 561 | | |
394 | 562 | | |
| |||
438 | 606 | | |
439 | 607 | | |
440 | 608 | | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
441 | 633 | | |
442 | 634 | | |
443 | 635 | | |
| |||
Lines changed: 24 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
Lines changed: 15 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
40 | 55 | | |
41 | 56 | | |
42 | 57 | | |
| |||
Lines changed: 20 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
47 | 48 | | |
48 | 49 | | |
49 | 50 | | |
| |||
186 | 187 | | |
187 | 188 | | |
188 | 189 | | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
189 | 234 | | |
190 | 235 | | |
191 | 236 | | |
| |||
0 commit comments