-
Couldn't load subscription status.
- Fork 1.9k
[KERNEL] Extend getStatsSchema to include collated stats
#5380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
4840c33 to
dc9cba1
Compare
| /** | ||
| * Set of {@link CollationIdentifier}s referenced by this predicate or any of its child | ||
| * expressions | ||
| */ | ||
| private final Set<CollationIdentifier> collationIdentifiers; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have this field. We already have collationIdentifier in Predicate, so having this just increases complexity. Also, getReferencedCollations is only called once in the codebase, so we don't lose much by not persisting its value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments/questions on the src code. Will review the tests in a bit.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/DataSkippingPredicate.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/DataSkippingPredicate.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/DataSkippingPredicate.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/StatsSchemaHelper.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Given a data schema and a set of collation identifiers returns the expected schema for | ||
| * collation-aware statistics columns. This means 1) replace logical names with physical names 2) | ||
| * set nullable=true 3) only keep collated-stats eligible fields (`StringType` fields) | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what we do in delta spark for the stats read schema? Or is there an optimization where we only read the collated stats for columns referenced in the predicate (with that collation)?
Maybe this is a future optimization we could consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can optimize it like that. We can do this for both collated and binary stats. Let's leave this for future optimization so we don't complicate this PR further.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/StatsSchemaHelper.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/StatsSchemaHelper.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments + pls fix the failing tests. Afterward, LGTM
| } | ||
| } | ||
|
|
||
| test("partition and data skipping - combined pruning on partition and data columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - add collation desc to title
| } | ||
| } | ||
|
|
||
| test("data skipping - predicates with SPARK.UTF8_BINARY on data column") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test case over nested string column?
| test("pruneStatsSchema - collated min/max columns") { | ||
| val utf8Lcase = CollationIdentifier.fromString("SPARK.UTF8_LCASE") | ||
| val unicode = CollationIdentifier.fromString("ICU.UNICODE") | ||
| val testSchema = new StructType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - I think you could define vals to make this simpler that can be reused?
val nestedField = ...
val s1Field = ...
val allFields = ... // (s1 + i1 + i2 + nested)
Which Delta project/connector is this regarding?
Description
In this PR,
getStatsSchemais extended to include the collated stats schema for any collation referenced byDataSkippingPredicate(schema example). Also, added E2E tests for collated data skipping.How was this patch tested?
New tests.
Does this PR introduce any user-facing changes?
No.