Skip to content

MINOR: perf optimization for header serialization and type conversion#21762

Open
mjsax wants to merge 8 commits intoapache:trunkfrom
mjsax:minor-optimize-header-serializer-bytebuffer-no-bytearrayoutputstream
Open

MINOR: perf optimization for header serialization and type conversion#21762
mjsax wants to merge 8 commits intoapache:trunkfrom
mjsax:minor-optimize-header-serializer-bytebuffer-no-bytearrayoutputstream

Conversation

@mjsax
Copy link
Member

@mjsax mjsax commented Mar 15, 2026

This PR replaces the usage of ByteArrayOutputStreams with ByteBuffers.
Some manual benchmarks show a perf improvement for the value-ts-header
and session-header converters of 3x.

mjsax added 2 commits March 14, 2026 23:50
This PR replaces the usage of ByteArrayOutputStreams with ByteBuffers.
Some manual benchmarks show a perf improvement for the value-ts-header
and session-header convertes of 3x.
@mjsax mjsax added streams kip Requires or implements a KIP labels Mar 15, 2026
record.timestampType(),
record.serializedKeySize(),
record.serializedValueSize(),
recordValueWithTimestamp != null ? recordValueWithTimestamp.length : 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stumbled over this by chance -- it's a long standing minor bug... Fixing on the side.

record.timestampType(),
record.serializedKeySize(),
record.serializedValueSize(),
recordValueWithTimestampAndHeaders != null ? recordValueWithTimestampAndHeaders.length : 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same...

record.timestampType(),
record.serializedKeySize(),
record.serializedValueSize(),
recordValueWithHeaders != null ? recordValueWithHeaders.length : 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more


private final RecordConverter timestampedValueConverter = rawValueToTimestampedValue();
private final RecordConverter headersValueConverter = rawValueToHeadersValue();
private final RecordConverter sessionValueConverter = rawValueToSessionHeadersValue();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding missing case for session-header-store

{50, 2, 20, 104, 101, 97, 100, 101, 114, 45, 107, 101, 121, 24, 104, 101, 97, 100, 101,
114, 45, 118, 97, 108, 117, 101, 0, 0, 0, 0, 0, 0, 0, 10, 0};
{50, 2, 20, 'h', 'e', 'a', 'd', 'e', 'r', '-', 'k', 'e', 'y', 24, 'h', 'e', 'a', 'd', 'e',
'r', '-', 'v', 'a', 'l', 'u', 'e', 0, 0, 0, 0, 0, 0, 0, 10, value[0]};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a way to make this easier to read :)

int estimatedBufferSize = 5;
for (final Header header : headersArray) {
// adding 5 bytes for varint encoding of header-key length
estimatedBufferSize += 5 + header.key().length();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use key.getBytes(StandardCharsets.UTF_8).length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to reuse the byte array from key.getBytes(StandardCharsets.UTF_8)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. UTF-16 vs UTF-8...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, The serialized format writes UTF-8 bytes, while String.length() counts UTF-16 code units.

*/
static ByteBuffer prepareByteBufferWithSizePrefix(final int prefix, final int bufferSize) {
final ByteBuffer varLengthBuffer = ByteBuffer.allocate(5); // 5 bytes for max varint encoding
ByteUtils.writeVarint(prefix, varLengthBuffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you leverage ByteUtils.sizeOfVarint to avoid creating the temporary buffer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I missed that we have sizeOfVarint -- I was actually considering to maybe add such a helper -- this simplifies things quite a bit.

int estimatedBufferSize = 5;
for (final Header header : headersArray) {
// adding 5 bytes for varint encoding of header-key length
estimatedBufferSize += 5 + header.key().length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to reuse the byte array from key.getBytes(StandardCharsets.UTF_8)


ByteUtils.writeVarint(headersArray.length, out);
// start with 5 bytes for varint encoding of header count
int estimatedBufferSize = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use sizeOfVarint to get exact size?

int exactBufferSize = ByteUtils.sizeOfVarint(headersArray.length);
    
    for (final Header header : headersArray) {
        final String headerKey = header.key();
        
        int headerKeySize = Utils.utf8Length(headerKey);
        exactBufferSize += ByteUtils.sizeOfVarint(headerKeySize) + headerKeySize;

        final byte[] value = header.value();
        if (value == null) {
            exactBufferSize += ByteUtils.sizeOfVarint(-1);
        } else {
            exactBufferSize += ByteUtils.sizeOfVarint(value.length) + value.length;
        }
    }

@mjsax
Copy link
Member Author

mjsax commented Mar 15, 2026

Thanks @nileshkumar3 @chia7712 -- pushed an update. Needed to restructure some parts of the code...


final static class PreSerializedHeaders {
final int requiredBufferSizeForHeaders;
final byte[][][] serializedHeaders;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 3-dimensional array is a very bad idea... totally kills performance...

public class HeadersSerializerTest {

@Test
public void test() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, 500 million iterations!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I gives me only a benchmark runtime of 20 sec (with the current code)...

PreSerializedHeaders(
final int requiredBufferSizeForHeaders,
final byte[][] rawHeaderKeys,
final byte[][] rawHeaderValued
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rawHeaderValues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kip Requires or implements a KIP streams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants