-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark, API: Enhance hashing efficiency by operating on raw UTF-8 bytes #12657
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xiaoxuandev I think the general fix is great, just had a comment about keeping the existing hash(String)
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xiaoxuandev! Just had some non-blocking nits. I'll leave it open for just a bit in case others wanted to review.
@@ -82,7 +82,12 @@ public void testSpecValues() { | |||
Assert.assertEquals( | |||
"Spec example: hash(\"iceberg\") = 1210000089", | |||
1210000089, | |||
new BucketFunction.BucketString().hash("iceberg")); | |||
new BucketFunction.BucketString().hash("iceberg".getBytes(StandardCharsets.UTF_8))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We could just leave this the same as before just to keep the diff smaller.
@@ -74,10 +74,14 @@ public void testSpecValues() { | |||
.as("Spec example: hash(2017-11-16T22:31:08) = -2047944441") | |||
.isEqualTo(-2047944441); | |||
|
|||
assertThat(new BucketFunction.BucketString().hash("iceberg")) | |||
assertThat(new BucketFunction.BucketString().hash("iceberg".getBytes(StandardCharsets.UTF_8))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Same as above, we could just leave this assertion the same as earlier just to keep the diff smaller.
This change removes unnecessary
toString()
conversions by directly processing UTF-8 bytes, improving efficiency and reducing memory allocations.