feat: Variant Support by c-thiel · Pull Request #2188 · apache/iceberg-rust

c-thiel · 2026-02-28T14:09:03Z

Which issue does this PR close?

Variant Support.
Arrow value support is currently missing as I am unsure how we want to extend Literal

What changes are included in this PR?

Core: Variant Type

crates/iceberg/src/spec/datatypes.rs — new Variant type
crates/iceberg/src/spec/values/literal.rs — Variant literal value
crates/iceberg/src/spec/schema/ — visitor, index, pruning, mod, id reassigner all handle Variant
crates/iceberg/src/spec/table_metadata.rs — metadata support

Avro

crates/iceberg/src/avro/schema.rs — read/write Variant in Avro

Arrow

crates/iceberg/src/arrow/schema.rs — map Variant to Arrow type
crates/iceberg/src/arrow/reader.rs — read Variant from Arrow
crates/iceberg/src/arrow/value.rs — Arrow value conversion
Minor fixes in caching_delete_file_loader.rs and nan_val_cnt_visitor.rs

Parquet

crates/iceberg/src/writer/file_writer/parquet_writer.rs — write Variant columns

Tests & Dev

crates/integration_tests/tests/read_variant.rs — new integration test for reading Variant data
dev/spark/provision.py — Spark provisioning to generate Variant test data

Are these changes tested?

Sure! Even integration tested :)

c-thiel · 2026-03-02T08:59:54Z

        let table_creation = TableCreation::builder()
            .name(name.clone())
            .schema(iceberg_schema)
+            .format_version(format_version)


Before this change existing tests where rightfully failing as we used to create a V2 table with a NS Timestamp column:
https://github.com/apache/iceberg-rust/actions/runs/22522306915/job/65248930667

This new logic determines the min format version required and uses that - but at least V2. Thus we switch now to V3 for ns timestamps.

c-thiel · 2026-03-18T07:06:01Z

@CTTY @liurenjie1024 @Xuanwo this would be ready for review!

CTTY

Thanks for the feature! Just took a look.

Also the test seems to be failing

c-thiel · 2026-04-10T01:38:10Z

@CTTY ready for another round!

CTTY

Mostly LGTM! Left some minor comments

Co-authored-by: Shawn Chang <yxchang@amazon.com>

c-thiel · 2026-05-13T07:26:07Z

@CTTY ready for another round!

CTTY

LGTM, thanks for working on this!

cc @blackmwk to also take a pass

mbutrovich · 2026-05-29T20:16:07Z

+    }

-        Ok(())
+    fn variant(&mut self, _v: &VariantType) -> Result<Self::T> {


Not really a comment on this line, but: iceberg-java's TypeToMessageType#variant writes the Parquet group with LogicalTypeAnnotation.variantType(VARIANT_SPEC_VERSION). The Rust write path here doesn't add that annotation, so files written by iceberg-rust would carry a plain Struct(Binary, Binary) without the variant logical type marker. The integration tests are read-only against Spark-written data, so it isn't caught. Worth a tracking issue, or already on the roadmap?

this PR is unshredded variants only, with two follow-ups:

Write annotation (your comment): we emit a plain Struct(Binary,Binary) without variantType(...) since variant_experimental is off. Doesn't break Java read-back — it resolves variant by field-id, not the annotation — but I'll track adding it.

Shredded reads: a typed_value sub-field was being silently dropped (corrupt data). Added a guard that returns FeatureUnsupported + a test; full shredding reconstruction is a follow-up.

Guard added here:
b702f5a

Regarding 1)
iceberg-rust writes via AsyncArrowWriter, which derives the Parquet schema from the Arrow schema. In parquet 58.1.0, that path only emits the VARIANT annotation when the field carries the parquet_variant_compute::VariantType extension type and variant_experimental is enabled (otherwise logical_type_for_struct is a stub returning None). I couldn't find a public per-field hook to inject the annotation onto a plain Struct(Binary,Binary).

So the real cost is: enable variant_experimental + attach the extension type to the field. Two risksthat I se:

Turning on the feature may change how the reader decodes a VARIANT-annotated group (native VariantArray instead of Struct{metadata,value}) — could break the current read path that expects the struct.

New experimental dep surface.

Not sure how we should proceed. I think this maybe should be a separate issue?

I created this for now:
#2546

…g data

# Conflicts: # crates/iceberg/public-api.txt

mbutrovich

Thanks for addressing the feedback @c-thiel! This LGTM! Looking forward to putting some queries through this.

nssalian

lgtm. Thanks for the work @c-thiel!

blackmwk

Thanks @c-thiel for this pr, just finished first round of review. I think this is too large and we may need several other rounds of review before merging.

blackmwk · 2026-06-03T11:59:35Z

+    /// Returns the minimum [`FormatVersion`] required to represent all types in this schema.
+    ///
+    /// Defaults to `FormatVersion::V1` if all types are universally supported.
+    pub fn min_format_version(&self) -> FormatVersion {


This api looks odd to me, it makes me feel like reading a field, which it's not that cheap. I think a better solution is to add a SchemaVisitor implementation, which check if it's compatible with some FormatVersion.

Agreed on the visitor — I'll move the computation into a SchemaVisitor (T = FormatVersion), which also lets me drop the recursive Type::min_format_version you flagged below. I'd keep a public method returning the value though: datafusion uses it to derive the table's format version (min_format_version().max(V2)), not to check a known one, so a pure is_compatible(fv) predicate wouldn't cover that use. So: visitor under the hood, thin query on top.

Rolling back:
I tried the SchemaVisitor approach first but moved to a small leaf_min_format_version(&Type) helper + iteration over the flattened id_to_field. Three reasons it's cleaner here:

Infallibility vs. the trait. SchemaVisitor returns Result on every method, but this logic can't fail — so it needed .expect("never fails") at the call sites.

Wrong shape for the check: check_format_compatibility wants a shallow, per-field test (each flattened field judged by its own type, so blame lands on the leaf, not its container). The visitor is a recursive tree-folder; reusing it meant calling visit_type on single leaves behind an !is_nested() guard — indirect, and it silently couples that guard to which visitor arms count as "leaf."

Cost for no benefit. The visitor allocated a Vec<FormatVersion> per struct level in min_format_version (the flat fold allocates nothing) and was ~40 lines of 7-method boilerplate for a 5-line rule.

The flat form is also closer to Java, which keeps this as a static MIN_FORMAT_VERSIONS map in Schema iterated over lazyIdToField() — not a visitor. So: one match as the single source of truth, both min_format_version and check_format_compatibility route through it, and the per-field iteration mirrors checkCompatibility directly.

Fixed in ef25e2c

blackmwk · 2026-06-03T12:02:12Z

+    ///   which older readers can't honor. `write_default` only affects newly
+    ///   written rows (physically materialized, read the same at any version), so
+    ///   it is not checked.
+    pub fn check_format_compatibility(&self, format_version: FormatVersion) -> Result<()> {


See the comment below, we should use a visitor for this.

Addressed together with the visitor discussion above — #2188 (comment)

blackmwk · 2026-06-03T12:02:57Z

+                .name_by_field_id(field.id)
+                .unwrap_or(field.name.as_str());
+
+            let min_version = field.field_type.min_format_version();


This is error prone. Java's approach uses TypeID, but this method will calls recursively again and again for nested data types.

Switched to Java's shape: a shallow leaf_min_format_version(&Type) (a match, like Java's MIN_FORMAT_VERSIONS TypeID lookup) applied per field over the flattened id_to_field (like lazyIdToField()) — no recursion. This also fixed a real bug the recursion caused: it blamed the container for a nested v3 type.

We have a test for it now.

blackmwk · 2026-06-03T12:05:08Z

When you change these implementations, I think we need to add some ut the verify that these changes are correct.

Added test_prune_columns_variant — a variant prunes like a primitive leaf: selecting it keeps it (same for full and non-full projection), selecting a sibling drops it. I also mirrored Java's TestTypeUtil variant coverage for the other arms this PR touched: test_reassign_ids_variant (id_reassigner) and test_assign_fresh_ids_variant (schema-evolution id assignment).

Added in 1673a2b

Do you think anything else is missing?

The index_by_id/index_by_name arms are no-ops for variant, so I left those.

I'm not talking about prune_columns only, I mean all other affects parts.

More tests added in 16c81e0 - let me know if you feel something is missing!

blackmwk · 2026-06-03T12:11:23Z

I don't understand why you put these tests in integration tests. We should gradually remove this integration tests, so we should not do this unnecessary we have no other choice. These tests are almost all about arrow readers, why not put them in arrow module?

Moved in fd6a91b

I had put these in integration_tests because they were genuinely Spark integration tests (reading real Spark-written variant data end-to-end through the REST catalog), and I wasn't aware we're trying to phase that suite out. Now that I know, I've moved them.

Mitigation: the coverage is now in arrow/reader/projection.rs as self-contained unit tests over synthetic variant Parquet (full scan, variant-only, sibling-only, nested-in-struct) — they drive ArrowReader end-to-end (projection mask + decode), assert the id values and the variant metadata/value bytes round-trip exactly. read_variant.rs is deleted and the variant tables are removed from dev/spark/provision.py, so the PR no longer touches Spark provisioning at all.

The one thing we lose is the real cross-engine interop check. If we want that back later it's a single-commit revert, but I think the unit tests cover the reader logic that actually mattered here.

Longer term a cleaner interop source could be apache/parquet-testing (engine-agnostic, spec-canonical variant vectors) rather than Spark — I'd wire that in when we add variant value decoding / take on the #2546 annotation+shredding work, since that's what the corpus actually exercises.

…jection

c-thiel · 2026-06-04T17:31:00Z

@blackmwk ready for another round!

blackmwk · 2026-06-05T10:40:50Z

 pub const SCHEMA_NAME_DELIMITER: &str = ".";
+/// Minimum format version that allows non-null field default values.
+/// Mirrors Java's `Schema.DEFAULT_VALUES_MIN_FORMAT_VERSION`.
+pub const MIN_FORMAT_VERSION_DEFAULT_VALUES: FormatVersion = FormatVersion::V3;


Suggested change

pub const MIN_FORMAT_VERSION_DEFAULT_VALUES: FormatVersion = FormatVersion::V3;

pub(crate) const DEFAULT_VALUES_MIN_FORMAT_VERSION: FormatVersion = FormatVersion::V3;

This is not a public api.

Applied. That said, I'd lean toward keeping spec-defined version floors like this pub. iceberg-rust is an SDK for empowered users — I don't think we should be overly protective with visibility. Downstream catalogs/engines (Lakekeeper, for us) that gate default-value writes on format version otherwise have to re-declare this constant locally.

blackmwk · 2026-06-05T10:41:47Z

+/// `TimestampNs` / `TimestamptzNs` / `Variant` require v3; everything else (including
+/// nested types, validated per-leaf elsewhere) is valid from v1. Single source of truth
+/// for the type version rules, mirroring Java's `Schema.MIN_FORMAT_VERSIONS`.
+fn leaf_min_format_version(field_type: &Type) -> FormatVersion {


This should be part of Type.

15884b9 moved the shallow per-type rule onto Type::min_format_version - just like it was originally, just with the fixed recursion from the last review.

blackmwk · 2026-06-05T10:49:06Z

+    /// Minimum [`FormatVersion`] required to represent all *types* in this schema.
+    ///
+    /// Types only; for initial-default version floors see [`Schema::check_format_compatibility`].
+    pub fn min_format_version(&self) -> FormatVersion {


Suggested change

pub fn min_format_version(&self) -> FormatVersion {

pub fn calc_min_compatible_format(&self) -> FormatVersion {

Also please change comments to clarify that this will visit whole schema to get this, or how about we store it a lazy field in Schema?

15884b9 Renamed to calc_min_compatible_format and doc'd that it walks every field. Skipped the lazy field because: the only caller is datafusion's register_table today — once per CREATE TABLE, never a hot path. Caching a once-per-table O(fields) call would add a field to Schema (and to its Clone/serde/Eq surface) for not much win. Easy to add later, non-breaking, if a hot consumer ever appears.

blackmwk · 2026-06-05T10:51:00Z

+    /// Returns an error listing every field incompatible with `format_version`.
+    /// Mirrors Java's `Schema.checkCompatibility()`. Two checks per field:
+    ///
+    /// - **Type** — per `leaf_min_format_version`.


Suggested change

/// - **Type** — per `leaf_min_format_version`.

/// - **Type** — Minimum format version required to support that type, without taking nested filed types into account.

The leaf_min_format_version is implementation detail, and we should not show it in comments.

blackmwk · 2026-06-05T10:53:34Z

+            if format_version < min_version {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());


This is a bug, we should return error.

blackmwk · 2026-06-05T10:53:44Z

+            {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());


blackmwk · 2026-06-05T12:06:25Z

I'm not talking about prune_columns only, I mean all other affects parts.

- Move the shallow per-type version rule onto `Type::min_format_version` (pub(crate), non-recursive — mirrors Java's MIN_FORMAT_VERSIONS type-id lookup); drop the free `leaf_min_format_version` helper. - Rename `Schema::min_format_version` -> `calc_min_compatible_format` and doc that it walks every field (it is O(fields), not a cheap getter). - Make `DEFAULT_VALUES_MIN_FORMAT_VERSION` pub(crate) and rename to match Java; it has no external consumer. - Error (Unexpected) instead of silently falling back to an unqualified field name when the id is missing from the name index. - Regenerate iceberg public-api baseline for the above.

Add variant unit tests for the visitor arms the PR touched that produce observable output: iceberg->arrow type, arrow->iceberg value (unsupported), parquet path indexing, schema index-by-id/name, and the Glue/Hive type mappings. Avro was already covered; pure no-op arms are skipped. Also apply rustfmt fixups (import order, format! wrap) from the prior commit.

c-thiel added 2 commits February 28, 2026 15:05

feat: Variant Support

488298d

fix: TableCreation uses correct format version

466b071

c-thiel commented Mar 2, 2026

View reviewed changes

brgr-s mentioned this pull request Mar 4, 2026

Geo type changes #2019

Closed

Merge branch 'main' into feat/variant-support

a70950e

CTTY reviewed Mar 19, 2026

View reviewed changes

Comment thread crates/iceberg/src/avro/schema.rs Outdated

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

Comment thread crates/iceberg/src/avro/schema.rs

Comment thread crates/catalog/glue/src/schema.rs Outdated

c-thiel added 2 commits April 10, 2026 02:04

Merge branch 'main' into feat/variant-support

3ca9ebe

add nesting support, add Glue & HMS

4269b7d

Merge apache/main into feat/variant-support

c18ff16

nssalian mentioned this pull request May 6, 2026

EPIC: v3 Support Tracking #2411

Open

Shekharrajak mentioned this pull request May 10, 2026

feat: add PrimitiveType::Variant to iceberg spec #2423

Open

Merge branch 'main' into feat/variant-support

dffb01c

CTTY reviewed May 13, 2026

View reviewed changes

Comment thread crates/iceberg/src/spec/schema/mod.rs Outdated

Comment thread crates/iceberg/src/spec/datatypes.rs Outdated

Comment thread crates/iceberg/src/spec/table_metadata.rs Outdated

c-thiel and others added 4 commits May 13, 2026 08:33

Improve "invalid schema" error message

7f663e1

Co-authored-by: Shawn Chang <yxchang@amazon.com>

Merge branch 'main' into feat/variant-support

d7d9933

address comments

7d46d0c

Merge branch 'main' into feat/variant-support

1c93191

CTTY approved these changes May 14, 2026

View reviewed changes

Comment thread crates/iceberg/src/avro/schema.rs Outdated

c-thiel added 3 commits May 20, 2026 12:24

Merge branch 'origin/main' into feat/variant-support

0c341b8

assert variant record schema with let-else

71ef18a

Merge branch 'main' into feat/variant-support

1c54224

dannycjones mentioned this pull request May 29, 2026

Tracking issues of Iceberg Rust 0.10.0 Release #2527

Open

18 tasks

nssalian reviewed May 29, 2026

View reviewed changes

Comment thread crates/iceberg/src/spec/schema/mod.rs Outdated