Skip to content

Conversation

@a10y
Copy link

@a10y a10y commented Mar 8, 2025

  • Generic reader
  • Spark reader (row-based)
  • Spark reader (vectorized)
  • Writers

I tried to match existing Iceberg code style, I relied on the ORC implementation for inspiration. I think I managed to break down the different layers

  • ReaderBuilder provides a mapping between the FileFormat's object model and the Iceberg object model via ReaderFunction<D> where the D is the record type.
  • ValueReaders are how record entries are extracted from column vectors
  • GenericXYZ is how Iceberg GenericRecord gets read/written to appropriate files
  • Every format needs both row-based accessors to records (for deletion files) and batch-based accessors to records (for data files)

Comment on lines +167 to +168
vortex-jni = { module = "dev.vortex:vortex-jni", version.ref = "vortex" }
vortex-spark = { module = "dev.vortex:vortex-spark", version.ref = "vortex" }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not setup maven central publishing for these jars so just hosting them out of mavenLocal for the moment

Comment on lines +104 to +107
public ReadBuilder withNameMapping(NameMapping newNameMapping) {
// TODO(aduffy): is this for field renames? Figure out how to patch this through.
return this;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg uses column id to map the Iceberg table columns to the data file columns.
New data files written by Iceberg contain the Iceberg column Id in the data file metadata.
When files generated by external writers (For example: migration use-cases for old Hive tables) then column name based mapping is used. This mapping is driven by the NameMapping provided here.

import org.apache.iceberg.types.Types;

public abstract class VortexSchemaWithTypeVisitor<T> {
// What is the point of this??
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base class which is used by the engine specific implementations to map the FileFormat schema to the Engine specific schema. The mapping is done through the intermediate Iceberg schema. The Visitor's responsibility is to bring the Engine specific part

Comment on lines +57 to +58
// TODO(aduffy): metadata in Vortex schemas to allow embedding the Iceberg field ID number?
// For now we just use the field index, which might not be right when we have projections...
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimally column name based mapping should be used.

@pvary pvary force-pushed the file_Format_api_without_base branch from 0399622 to 815fff2 Compare March 11, 2025 16:28
@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from c6929b1 to 05309bc Compare May 26, 2025 15:15
@pvary pvary force-pushed the file_Format_api_without_base branch from 820cd3e to 0c35110 Compare June 27, 2025 10:35
@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from 16265b6 to a04dec0 Compare July 22, 2025 13:25
@pvary pvary force-pushed the file_Format_api_without_base branch 3 times, most recently from f828c3c to eb205c1 Compare August 15, 2025 21:50
@pvary pvary force-pushed the file_Format_api_without_base branch 2 times, most recently from a7ce1cb to 40f91b4 Compare August 25, 2025 21:09
@pvary pvary force-pushed the file_Format_api_without_base branch from df3d7c4 to 54ff177 Compare October 8, 2025 08:54
@pvary pvary force-pushed the file_Format_api_without_base branch 14 times, most recently from 454fa9c to 24ab0ba Compare November 11, 2025 08:45
@pvary pvary force-pushed the file_Format_api_without_base branch 4 times, most recently from c61b8ba to 5f91620 Compare November 24, 2025 08:22
@pvary pvary force-pushed the file_Format_api_without_base branch from 5f91620 to c3babfd Compare December 3, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants