Skip to content

Use short string in Variant when possible #13284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

manirajv06
Copy link

PR for #13282.

Save 4 bytes for short strings stored in Variant by reserving first byte only for header. Regular strings continue to use 1 (header) + 4 (offset sizes) bytes.

@github-actions github-actions bot added the core label Jun 9, 2025
@manirajv06
Copy link
Author

@aihuaxu @RussellSpitzer Please review.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation may not be correct, added some suggestions to clean up and added what I think is the right impl

for (int i = 0; i < 10_000; i += 1) {
fields.put(
RandomUtil.generateString(10, random),
VariantTestUtil.createString(RandomUtil.generateString(10, random)));
VariantTestUtil.createShortString(RandomUtil.generateString(10, random)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change here? The test should work with the previous method correct? I don't think it's broken using the larger header and since this test is explicitly trying to to make a large object shouldn't we want that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, because assertVariant is assuming that a string this small should be a short string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should clean up the testUtil then to have createString use the correct layout depending on the input string length. Better to not leave in a method that isn't correct.

@@ -442,6 +442,7 @@ public void testString() {

assertThat(value.type()).isEqualTo(PhysicalType.STRING);
assertThat(value.get()).isEqualTo("iceberg");
assertThat(value.sizeInBytes()).isEqualTo(12);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use the assertVariantString here right?

@@ -107,6 +121,15 @@ static SerializedPrimitive createString(String string) {
return SerializedPrimitive.from(buffer, buffer.get(0));
}

/** Creates a short string primitive of max 63 chars to use only 1 header */
static SerializedShortString createShortString(String string) {
byte[] utf8 = string.getBytes(StandardCharsets.UTF_8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put a precondition in here to make sure no one uses this test method with a string of longer than 63 bytes

Copy link
Member

@RussellSpitzer RussellSpitzer Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create string above should have a check that the length is 64 or greater
I actually think we shouldn't have a precondition for createString,

createString is valid with short strings as well only shortString is invalid with long strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more here, we probably do need to keep some tests making sure a "short" string still works in a "big string" object

@@ -107,6 +121,15 @@ static SerializedPrimitive createString(String string) {
return SerializedPrimitive.from(buffer, buffer.get(0));
}

/** Creates a short string primitive of max 63 chars to use only 1 header */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, the max length here is 63 bytes right? 0b11111101 >> 2 = 63
And it's bytes and not characters

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few remaining test nits to clean up but I think this is a good improvement so far.

For followups -
Make sure we have coverage for small - 5 byte header strings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants