-
Notifications
You must be signed in to change notification settings - Fork 187
Delta Kernel Draft PR #729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
<module>xtable-aws</module> | ||
<module>xtable-hive-metastore</module> | ||
<module>xtable-service</module> | ||
<!-- <module>xtable-service</module>--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be added back, any reason why you had to comment this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavk1992 I think if you rebase with latest main branch you shouldn't see those failures.
<dependency> | ||
<groupId>io.delta</groupId> | ||
<artifactId>delta-kernel-api</artifactId> | ||
<version>4.0.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a properly in the root pom called <delta.kernel.version>4.0.0</delta.kernel.version>, instead of using the hardcoded value?
Also curious how you ended up choosing delta kernel version, is there some specific version that needs to align with delta lake version we have in the repo?
public class DeltaKernelConversionSourceProvider extends ConversionSourceProvider<Long> { | ||
@Override | ||
public DeltaKernelConversionSource getConversionSourceInstance(SourceTable sourceTable) { | ||
Configuration hadoopConf = new Configuration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why you are creating a new hadoopConf, can you instead use the hadoopConf from the parent class similar to what DeltaConversionSourceProvider
does.
return INSTANCE; | ||
} | ||
|
||
public InternalSchema toInternalSchema_v2(StructType structType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call this toInternalSchema
, since its in its own distinct class right?
// Get schema from Delta Kernel's snapshot | ||
io.delta.kernel.types.StructType schema = snapshot.getSchema(); | ||
|
||
System.out.println("Kernelschema: " + schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] well need to remove these in final version of the pr.
* Converts between Delta and InternalTable schemas. Some items to be aware of: | ||
* | ||
* <ul> | ||
* <li>Delta schemas are represented as Spark StructTypes which do not have enums so the enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can live this file as is right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavk1992 let's remove the changes to this file. They don't seem necessary
import org.apache.xtable.spi.extractor.ConversionSource; | ||
|
||
@Builder | ||
public class DeltaKernelConversionSource implements ConversionSource<Long> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need full implementation of all the interface methods, otherwise this will fail during the table format sync. Can you refer to the impl for DeltaConversionSource for these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the methods have been added, just the commit backlog one is not fully resolved.
xtable-core/src/main/java/org/apache/xtable/delta/DeltaKernelDataFileExtractor.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/delta/DeltaKernelDataFileExtractor.java
Show resolved
Hide resolved
<module>xtable-aws</module> | ||
<module>xtable-hive-metastore</module> | ||
<module>xtable-service</module> | ||
<!-- <module>xtable-service</module>--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why comment this?
# limitations under the License. | ||
# | ||
junit.jupiter.execution.parallel.enabled=true | ||
junit.jupiter.execution.parallel.enabled=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run the tests locally by setting this to true and they pass, can we revert this config and see the GH build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great progress @vaibhavk1992, added some comments.
<module>xtable-aws</module> | ||
<module>xtable-hive-metastore</module> | ||
<module>xtable-service</module> | ||
<!-- <module>xtable-service</module>--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavk1992 I think if you rebase with latest main branch you shouldn't see those failures.
</executions> | ||
<configuration> | ||
<skip>${skipUTs}</skip> | ||
<redirectTestOutputToFile>true</redirectTestOutputToFile> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why revert this?
- | ||
tableBasePath: /Desktop/opensource/iceberg/warehouse/demo/nyc/taxis | ||
tableDataPath: /Desktop/opensource/iceberg/warehouse/demo/nyc/taxis/data | ||
tableBasePath: /Users/vaibhakumar/Desktop/opensource/iceberg/warehouse/demo/nyc/taxis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one can be reverted too?
xtable-core/src/main/java/org/apache/xtable/delta/DeltaKernelActionsConverter.java
Outdated
Show resolved
Hide resolved
* @param versionToStartFrom The version to start from. | ||
*/ | ||
@Builder | ||
public DeltaKernelIncrementalChangesState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinishjail97 This is where we need the implementation to happen for response conversions from kernel
|
||
long versionNumberAtLastSyncInstant = snapshot.getVersion(); | ||
System.out.println("versionNumberAtLastSyncInstant: " + versionNumberAtLastSyncInstant); | ||
// resetState(0, engine,table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the place which I have commented @vinishjail97 , call is happening inside this method
@vaibhavk1992 can you write up a summary of next steps and blockers for this feature? |
Below is the summary of the difference between two schemes (Delta vs Kernel) also added the what remains the difference between two. Comparison of Schema Responses: Delta Kernel vs Delta LogThis document outlines the differences in schema responses when using Delta Kernel and Delta Log APIs to retrieve changes in a Delta table. The comparison highlights the structure and format of the responses, providing insights into how the two approaches differ. Delta Kernel Schema ResponseWhen using the Sample Output1 row is an object ==> io.delta.kernel.internal.data.ColumnarBatchRow@20c03e47 Key Characteristics
Use CaseThis format is suitable for low-level data processing where the focus is on performance and accessing raw data. Delta Log Schema ResponseWhen using the
This issue is currently in blocked state. I raised it with delta team quite a few time but no response over it. |
@vaibhavk1992 I pushed some changes in the latest commit to extract the add and remove files per version. |
import org.apache.xtable.model.schema.InternalType; | ||
import org.apache.xtable.schema.SchemaUtils; | ||
|
||
public class DeltaKernelSchemaExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a unit test for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked in the current running test cases wherever we are using getCurrentSnapshot we are calling this class and the methods in it. So the checks are already in place. @the-other-tim-brown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not asking for a functional test, I am asking for unit testing. We want high coverage with the testing of these key components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-other-tim-brown I have added one of the unit test testConvertFromDeltaPartitionSinglePartition
. Please confirm if this is looking fine, I will add others too.
|
||
@Log4j2 | ||
@NoArgsConstructor(access = AccessLevel.PRIVATE) | ||
public class DeltaKernelPartitionExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a unit test for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked in the current running test cases wherever we are using getCurrentSnapshot and making InternalTable we are using this class and the methods in it. So the checks are already in place. @the-other-tim-brown
*/ | ||
@Log4j2 | ||
@NoArgsConstructor(access = AccessLevel.PRIVATE) | ||
public class DeltaKernelStatsExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, we could use some unit testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked in the current running test cases under test testInsertsUpsertsAndDeletes we are using getTableChangeForCommit so we are using this class and the methods in it. So the checks are already in place. @the-other-tim-brown
; | ||
// String tableBasePath = snapshot.dataPath().toUri().toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be cleaned up?
* Converts between Delta and InternalTable schemas. Some items to be aware of: | ||
* | ||
* <ul> | ||
* <li>Delta schemas are represented as Spark StructTypes which do not have enums so the enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavk1992 let's remove the changes to this file. They don't seem necessary
this.fields = schema.getFields(); | ||
|
||
StructType fullSchema = snapshot.getSchema(); // The full table schema | ||
List<String> partitionColumns = snapshot.getPartitionColumnNames(); // List<String> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment // List<String>
?
StructType fullSchema = snapshot.getSchema(); // The full table schema | ||
List<String> partitionColumns = snapshot.getPartitionColumnNames(); // List<String> | ||
|
||
List<StructField> partitionFields_strfld = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Use camelCase for variable names
this.dataFilesIterator = | ||
Collections | ||
.emptyIterator(); // Initialize the dataFilesIterator by iterating over the scan files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment does not seem necessary since we assign at line 139
Instant deltaCommitInstant = Instant.ofEpochMilli(snapshot.getTimestamp(engine)); | ||
return deltaCommitInstant.equals(instant) || deltaCommitInstant.isBefore(instant); | ||
} catch (Exception e) { | ||
System.err.println( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logging instead of print lines
} | ||
|
||
@Test | ||
void testConvertFromDeltaPartitionSinglePartition() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-other-tim-brown I have added this unit test, Please confirm if this looks fine I will add the others too.
Important Read
What is the purpose of the pull request
(For example: This pull request implements the sync for delta format.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)