-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Use short string in Variant when possible #13284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@aihuaxu @RussellSpitzer Please review. |
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/variants/TestPrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/variants/TestPrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/variants/TestPrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/variants/TestSerializedArray.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation may not be correct, added some suggestions to clean up and added what I think is the right impl
for (int i = 0; i < 10_000; i += 1) { | ||
fields.put( | ||
RandomUtil.generateString(10, random), | ||
VariantTestUtil.createString(RandomUtil.generateString(10, random))); | ||
VariantTestUtil.createShortString(RandomUtil.generateString(10, random))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change here? The test should work with the previous method correct? I don't think it's broken using the larger header and since this test is explicitly trying to to make a large object shouldn't we want that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, because assertVariant is assuming that a string this small should be a short string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should clean up the testUtil then to have createString use the correct layout depending on the input string length. Better to not leave in a method that isn't correct.
@@ -442,6 +442,7 @@ public void testString() { | |||
|
|||
assertThat(value.type()).isEqualTo(PhysicalType.STRING); | |||
assertThat(value.get()).isEqualTo("iceberg"); | |||
assertThat(value.sizeInBytes()).isEqualTo(12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use the assertVariantString here right?
@@ -107,6 +121,15 @@ static SerializedPrimitive createString(String string) { | |||
return SerializedPrimitive.from(buffer, buffer.get(0)); | |||
} | |||
|
|||
/** Creates a short string primitive of max 63 chars to use only 1 header */ | |||
static SerializedShortString createShortString(String string) { | |||
byte[] utf8 = string.getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put a precondition in here to make sure no one uses this test method with a string of longer than 63 bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create string above should have a check that the length is 64 or greater
I actually think we shouldn't have a precondition for createString,
createString is valid with short strings as well only shortString is invalid with long strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more here, we probably do need to keep some tests making sure a "short" string still works in a "big string" object
@@ -107,6 +121,15 @@ static SerializedPrimitive createString(String string) { | |||
return SerializedPrimitive.from(buffer, buffer.get(0)); | |||
} | |||
|
|||
/** Creates a short string primitive of max 63 chars to use only 1 header */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, the max length here is 63 bytes right? 0b11111101 >> 2 = 63
And it's bytes and not characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few remaining test nits to clean up but I think this is a good improvement so far.
For followups -
Make sure we have coverage for small - 5 byte header strings
PR for #13282.
Save 4 bytes for short strings stored in Variant by reserving first byte only for header. Regular strings continue to use 1 (header) + 4 (offset sizes) bytes.