-
Notifications
You must be signed in to change notification settings - Fork 749
[GOBBLIN-2230] Evolve iceberg table schema for partition copy #4142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2230] Evolve iceberg table schema for partition copy #4142
Conversation
7bb0252 to
c2db4e8
Compare
...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java
Show resolved
Hide resolved
c2db4e8 to
c2c98d3
Compare
| this.getSrcIcebergTable().accessTableMetadata().schema(), | ||
| this.partitionColumnName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetching schema after generating files may lead to corruption in very high concurrency scenario, we should get schema when we get manifest files similar to what done in full table replication.
| * @param partitionValue the partition column value on which data files will be replaced | ||
| */ | ||
| protected void overwritePartition(List<DataFile> dataFiles, String partitionColName, String partitionValue) | ||
| protected void updateSchemaAndPartition(List<DataFile> dataFiles, Schema updatedSchema, String partitionColName, String partitionValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update Java doc here as well
Also name of function maybe changed to updateSchemaAndOverwritePartition ?
|
|
||
| updateSchema(updatedSchema, false); | ||
| overwritePartition(dataFiles, partitionColName, partitionValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be part of a single table transaction commit ?
Is there any issue with using of that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating schema like this is not possible in a single transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay then lets break this down so that retryer of Overwritepartitionstep doesn't retry schema update again, might need to refactor IcebergOverwritePartitionsStep to call both function sequentially one after other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This partition commit and schema update step should be made similar to what done in full table i.e. even the destination table schema should not have been updated in between the job or this should be documented properly and tested as well
3056fbc to
23b6b15
Compare
| } | ||
|
|
||
| private Retryer<Void> createSchemaUpdateRetryer() { | ||
| Config config = ConfigFactory.parseProperties(this.properties); | ||
| Config retryerOverridesConfig = config.hasPath(IcebergOverwritePartitionsStep.SCHEMA_UPDATE_RETRYER_CONFIG_PREFIX) | ||
| ? config.getConfig(IcebergOverwritePartitionsStep.SCHEMA_UPDATE_RETRYER_CONFIG_PREFIX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this be combined with above function by passing prefix as function param ?
...main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetFinder.java
Show resolved
Hide resolved
|
|
||
| TableOperations tableOps = catalog.newTableOps(tableId); | ||
| IcebergTable icebergTable = new IcebergTable(tableId, | ||
| catalog.newTableOps(tableId), | ||
| tableOps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, is this change required ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, not required.. it was required for previous version but didn't revert the change completely
23b6b15 to
c188edd
Compare
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Tests
Commits