WIP: Vortex Iceberg #1

a10y · 2025-03-08T17:48:42Z

Generic reader
Spark reader (row-based)
Spark reader (vectorized)
Writers

I tried to match existing Iceberg code style, I relied on the ORC implementation for inspiration. I think I managed to break down the different layers

ReaderBuilder provides a mapping between the FileFormat's object model and the Iceberg object model via ReaderFunction<D> where the D is the record type.
ValueReaders are how record entries are extracted from column vectors
GenericXYZ is how Iceberg GenericRecord gets read/written to appropriate files
Every format needs both row-based accessors to records (for deletion files) and batch-based accessors to records (for data files)

… issues

a10y · 2025-03-08T17:49:52Z

gradle/libs.versions.toml

+vortex-jni = { module = "dev.vortex:vortex-jni", version.ref = "vortex" }
+vortex-spark = { module = "dev.vortex:vortex-spark", version.ref = "vortex" }


I have not setup maven central publishing for these jars so just hosting them out of mavenLocal for the moment

pvary · 2025-03-10T14:04:53Z

vortex/src/main/java/org/apache/iceberg/vortex/Vortex.java

+    public ReadBuilder withNameMapping(NameMapping newNameMapping) {
+      // TODO(aduffy): is this for field renames? Figure out how to patch this through.
+      return this;
+    }


Iceberg uses column id to map the Iceberg table columns to the data file columns.
New data files written by Iceberg contain the Iceberg column Id in the data file metadata.
When files generated by external writers (For example: migration use-cases for old Hive tables) then column name based mapping is used. This mapping is driven by the NameMapping provided here.

pvary · 2025-03-10T14:13:22Z

vortex/src/main/java/org/apache/iceberg/vortex/VortexSchemaWithTypeVisitor.java

+import org.apache.iceberg.types.Types;
+
+public abstract class VortexSchemaWithTypeVisitor<T> {
+  // What is the point of this??


Base class which is used by the engine specific implementations to map the FileFormat schema to the Engine specific schema. The mapping is done through the intermediate Iceberg schema. The Visitor's responsibility is to bring the Engine specific part

pvary · 2025-03-10T14:15:14Z

vortex/src/main/java/org/apache/iceberg/vortex/VortexSchemaWithTypeVisitor.java

+    // TODO(aduffy): metadata in Vortex schemas to allow embedding the Iceberg field ID number?
+    //  For now we just use the field index, which might not be right when we have projections...


Minimally column name based mapping should be used.

pvary and others added 21 commits February 27, 2025 22:31

Proposed API interfaces

a658fa7

Parquet/ORC/Avro implementation of the new reader/writer interfaces

21a0ed1

Implementation of the Generic reader/writer classes

c770d1e

Arrow implementation

bbbfae8

Spark implementation

2e0d171

Flink implementation

19db4be

spotless fix

576fffa

read->readerBuilder

4413321

Move static initializer method to an inner class to avoid classloader…

8ca5951

… issues

Reader API changes and DynMethod registry

ded7fea

Test fix for readers

d7d7cb4

Writer changes with new classes

0193655

Make the code a bit more pretty

271c2c4

Test refactor

d1a720e

Javadoc and some formatting

0204227

Remove new reader, pimp the old one

b9d7ad2

Deprecations/checks and found new places to apply the changes

95cb083

Remove Initializer

0399622

Add iceberg-vortex subproject with dependencies on mavenLocal JARs

ef46b38

Implement generic reader for Vortex

0856cc1

Generic reader JUnit test

9e6e4fe

github-actions bot added BUILD API CORE SPARK labels Mar 8, 2025

a10y commented Mar 8, 2025

View reviewed changes

pvary reviewed Mar 10, 2025

View reviewed changes

pvary force-pushed the file_Format_api_without_base branch from 0399622 to 815fff2 Compare March 11, 2025 16:28

pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from c6929b1 to 05309bc Compare May 26, 2025 15:15

pvary force-pushed the file_Format_api_without_base branch from 820cd3e to 0c35110 Compare June 27, 2025 10:35

pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from 16265b6 to a04dec0 Compare July 22, 2025 13:25

pvary force-pushed the file_Format_api_without_base branch 3 times, most recently from f828c3c to eb205c1 Compare August 15, 2025 21:50

pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from a7ce1cb to 40f91b4 Compare August 25, 2025 21:09

pvary force-pushed the file_Format_api_without_base branch from df3d7c4 to 54ff177 Compare October 8, 2025 08:54

pvary force-pushed the file_Format_api_without_base branch 14 times, most recently from 454fa9c to 24ab0ba Compare November 11, 2025 08:45

pvary force-pushed the file_Format_api_without_base branch 4 times, most recently from c61b8ba to 5f91620 Compare November 24, 2025 08:22

pvary force-pushed the file_Format_api_without_base branch from 5f91620 to c3babfd Compare December 3, 2025 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Vortex Iceberg #1

WIP: Vortex Iceberg #1

Uh oh!

a10y commented Mar 8, 2025 •

edited

Loading

Uh oh!

a10y Mar 8, 2025

Uh oh!

pvary Mar 10, 2025

Uh oh!

pvary Mar 10, 2025

Uh oh!

pvary Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		vortex-jni = { module = "dev.vortex:vortex-jni", version.ref = "vortex" }
		vortex-spark = { module = "dev.vortex:vortex-spark", version.ref = "vortex" }

		// TODO(aduffy): metadata in Vortex schemas to allow embedding the Iceberg field ID number?
		// For now we just use the field index, which might not be right when we have projections...

WIP: Vortex Iceberg #1

Are you sure you want to change the base?

WIP: Vortex Iceberg #1

Uh oh!

Conversation

a10y commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a10y Mar 8, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

a10y commented Mar 8, 2025 •

edited

Loading