-
Notifications
You must be signed in to change notification settings - Fork 0
WIP: Vortex Iceberg #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: file_Format_api_without_base
Are you sure you want to change the base?
WIP: Vortex Iceberg #1
Conversation
| vortex-jni = { module = "dev.vortex:vortex-jni", version.ref = "vortex" } | ||
| vortex-spark = { module = "dev.vortex:vortex-spark", version.ref = "vortex" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not setup maven central publishing for these jars so just hosting them out of mavenLocal for the moment
| public ReadBuilder withNameMapping(NameMapping newNameMapping) { | ||
| // TODO(aduffy): is this for field renames? Figure out how to patch this through. | ||
| return this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg uses column id to map the Iceberg table columns to the data file columns.
New data files written by Iceberg contain the Iceberg column Id in the data file metadata.
When files generated by external writers (For example: migration use-cases for old Hive tables) then column name based mapping is used. This mapping is driven by the NameMapping provided here.
| import org.apache.iceberg.types.Types; | ||
|
|
||
| public abstract class VortexSchemaWithTypeVisitor<T> { | ||
| // What is the point of this?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base class which is used by the engine specific implementations to map the FileFormat schema to the Engine specific schema. The mapping is done through the intermediate Iceberg schema. The Visitor's responsibility is to bring the Engine specific part
| // TODO(aduffy): metadata in Vortex schemas to allow embedding the Iceberg field ID number? | ||
| // For now we just use the field index, which might not be right when we have projections... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimally column name based mapping should be used.
0399622 to
815fff2
Compare
c6929b1 to
05309bc
Compare
820cd3e to
0c35110
Compare
16265b6 to
a04dec0
Compare
f828c3c to
eb205c1
Compare
a7ce1cb to
40f91b4
Compare
df3d7c4 to
54ff177
Compare
454fa9c to
24ab0ba
Compare
c61b8ba to
5f91620
Compare
5f91620 to
c3babfd
Compare
I tried to match existing Iceberg code style, I relied on the ORC implementation for inspiration. I think I managed to break down the different layers
ReaderBuilderprovides a mapping between theFileFormat's object model and the Iceberg object model viaReaderFunction<D>where theDis the record type.ValueReadersare how record entries are extracted from column vectorsGenericXYZis how IcebergGenericRecordgets read/written to appropriate files