Skip to content

iceberg: add case_sensitive_columns option#4365

Merged
Jeffail merged 3 commits intomainfrom
iceberg-column-names
Apr 30, 2026
Merged

iceberg: add case_sensitive_columns option#4365
Jeffail merged 3 commits intomainfrom
iceberg-column-names

Conversation

@Jeffail
Copy link
Copy Markdown
Contributor

@Jeffail Jeffail commented Apr 29, 2026

Summary

  • Adds a top-level case_sensitive_columns boolean to the iceberg output (default true, preserving released behaviour). When set to false, column-name matching follows iceberg's recommended case-insensitive convention end-to-end.
  • The flag is plumbed through every site that compares iceberg field names: shredder lookup, schema_metadata traversal, partition spec column references, struct inference (auto-inferred and metadata-driven), top-level table-creation duplicate detection, sink dedup keys for cross-message new fields, and the caseSensitive argument to iceberg-go's UpdateSchema. Ambiguous case-only duplicates in input records are rejected at shredding and creation.
  • Resolves a customer-reported issue where messages with upper-case keys against a table with lower-case columns triggered spurious schema evolution rather than routing into the existing columns.

Test plan

  • task fmt
  • task lint (0 issues)
  • task test:unit
  • New unit tests across the shredder, type inferrer/resolver, partition-spec parser, router, and writer sinks — covering both the case-insensitive path and the case-sensitive default
  • Integration tests (internal/impl/iceberg/integration):
    • CaseInsensitiveColumnMatching — pre-created lowercase table, uppercase-keyed messages, no schema evolution, data lands in the right columns
    • CaseInsensitiveCreateTimeDuplicate — record with id and ID rejected at create time
    • CaseInsensitiveNewFieldDedupAcrossBatch — two messages with case-only-different new fields produce exactly one new column
    • CaseInsensitivePartitionSpecCreate — partition spec with uppercase column reference resolves against schema inferred with lowercase keys

Adds a new top-level config field that makes column-name matching follow
iceberg's recommended case-insensitive convention end-to-end when set to
false. Defaults to true to preserve released behaviour.

When false, the option is plumbed through:
  - shredder lookup (input keys → schema field names)
  - schema_metadata findCommonField traversal
  - partition spec column references (top-level and nested)
  - struct inference (auto-inferred and metadata-driven), rejecting
    nested keys that fold to the same lowercase string
  - top-level table creation, rejecting case-only-duplicate keys
  - sink dedup keys, so cross-message new fields differing only in
    case fold to a single schema-evolution attempt
  - catalogx UpdateSchema's caseSensitive flag

The shredder errors on ambiguous case-only duplicates (two input keys
mapping to the same schema column).

Adds unit coverage for each touchpoint and four integration tests
covering the column-matching path, create-time duplicate rejection,
cross-message new-field dedup, and case-insensitive partition spec
parsing during table creation.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Commits
LGTM

Review
Adds an end-to-end case_sensitive_columns option for the iceberg output, plumbed through the shredder, schema_metadata resolution, partition spec parser, struct inference, table creation, sink dedup, and catalogx UpdateSchema. Defaults to true for backwards compatibility, with explicit ambiguity detection at every layer where case-only duplicates could otherwise cause silent data loss or schema corruption. Unit tests cover each touchpoint and four integration tests exercise the column-matching path, create-time duplicate rejection, cross-message new-field dedup, and case-insensitive partition spec parsing.

LGTM

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Commits
LGTM

Review
Comprehensive plumbing of the new case_sensitive_columns flag through every iceberg comparison site (shredder, schema_metadata traversal, partition spec parser, type inference/resolver, sink dedup, UpdateSchema). Default true preserves released behaviour. Case-only ambiguity is rejected at the right points. Unit and integration coverage exercise both modes.

LGTM

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Commits
LGTM

Review
The PR threads a new case_sensitive_columns config flag (default true for backward compatibility) through the iceberg output's shredder, type inferrer, type resolver, partition spec parser, sinks, and catalogx.UpdateSchema. Case-only duplicate detection is added at table creation, struct inference, and shredding, and new-field dedup folds case in case-insensitive mode. Unit and integration coverage is comprehensive across each touchpoint.

LGTM

@Jeffail Jeffail merged commit 757dbcd into main Apr 30, 2026
11 checks passed
@Jeffail Jeffail deleted the iceberg-column-names branch April 30, 2026 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants