[Bug] [Zeta] Fix negative array size exception by lm-ylj · Pull Request #10827 · apache/seatunnel

lm-ylj · 2026-04-27T08:38:13Z

Purpose of this pull request

close #10826

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

dybyte

+1 if CI passes

nzw921rx · 2026-04-27T10:03:44Z

@lm-ylj Thank you for your submission. This is a great repair,

One compatibility concern remains with the current change.

Legacy data with arity <= 127 is encoded as 1 byte, but the new code reads it as 4 bytes using readInt(). This causes the stream position to shift by 3 bytes and breaks deserialization of all subsequent fields.

Impact:

This primarily affects users performing rolling upgrades with mixed old/new nodes.
New nodes may misread valid legacy records (arity <= 127), leading to deserialization failures, corrupted field decoding, task instability, and possible temporary service unavailability.
This is an upgrade-compatibility issue, not just an edge case for large arity values.

Fix scope:

Preserve backward compatibility for all valid legacy records, especially arity <= 127.
Fix example: on write, keep using writeByte(arity) when arity <= Byte.MAX_VALUE; for arity > Byte.MAX_VALUE, write a special marker first (for example, -1), followed by writeInt(arity). On read, read 1 byte first: if it is non-negative, treat it as the legacy arity; if it is the marker, read the following int as the actual arity.
Add compatibility tests for the old-writer -> new-reader path during rolling upgrades, including boundary cases such as 127 and 128.

cc @davidzollo Can you please give me some advice?

DanielLeens

Thanks for chasing the negative-array-size issue. I pulled the latest head locally and rechecked the real cross-task-group row transport path through RecordSerializer.

Runtime path:

Source / Transform / Sink row transport
  -> RecordSerializer.write()
      -> writes type, tableId, rowKind, arity, fields
  -> Hazelcast byte transport
  -> RecordSerializer.read()
      -> reconstructs SeaTunnelRow

The bug is real, but the current fix still changes the on-wire row layout in a backward-incompatible way. Before this PR, arity was encoded in 1 byte. In the current head, the serializer switches to writeInt() / readInt(), so a new reader will consume 4 bytes from an old stream and shift the remaining field offsets. That makes rolling-upgrade / mixed-version clusters unsafe even for valid legacy rows with arity <= 127.

Blocking items:

Please keep backward-compatible row decoding/encoding. The safer shape is to preserve the legacy 1-byte path for legacy-compatible arity values and use a sentinel + int only for larger arity.
Please add compatibility coverage for old-writer -> new-reader, especially around 127 / 128, not only a new-format round trip.

Because this touches seatunnel-engine serialization, I do not recommend merging the current head until the compatibility path is fixed.

-1

lm-ylj · 2026-04-28T01:37:52Z

I appreciate your feedback and suggestions. I will address and fix this backward compatibility issue accordingly.

DanielLeens

Hi @lm-ylj, thanks for the quick follow-up, and thanks for confirming you plan to address the compatibility concern.

I re-checked the current head locally after your latest reply. There is still no new code on top of commit 9b710c2fe, so the technical conclusion remains unchanged for now.

What this PR is trying to fix is real:

User pain: when a SeaTunnelRow has more than 127 fields, the current engine serializer stores arity in a signed byte, and deserialization can end up constructing new SeaTunnelRow(negativeValue), which triggers NegativeArraySizeException.
Fix approach in the current head: switch arity from writeByte/readByte to writeInt/readInt, and add a regression test for arity = 128.
One-line summary: the PR fixes the overflow case for single-version clusters, but it currently does so by changing the engine wire format in a backward-incompatible way.

Runtime chain I re-verified locally:

Source / Transform produces SeaTunnelRow
  -> SeaTunnelSourceCollector.collect() [SeaTunnelSourceCollector.java:93-112]
      -> sendRecordToNext(new Record<>(row))
  -> task-group queue publish
      -> RecordEventProducer.onData() [RecordEventProducer.java:29-53]
  -> Hazelcast serializer hook
      -> RecordSerializerHook.createSerializer() [RecordSerializerHook.java:32-35]
      -> RecordSerializer.write() [RecordSerializer.java:40-58]
  -> remote task-group receives bytes
      -> RecordSerializer.read() [RecordSerializer.java:66-92]
      -> RecordEventHandler.handleRecord() [RecordEventHandler.java:52-66]

Because this serializer sits on the real Zeta cross-task-group transport path, compatibility is the blocking concern here.

The main blocker is still:

The current head changes the on-wire layout from 1 byte to 4 bytes for arity, but the new reader does not preserve any legacy decode path. That means a new node reading bytes written by an old node will consume 4 bytes where the old stream only wrote 1, shifting all following field boundaries. This makes rolling upgrades / mixed-version clusters unsafe.

What I recommend:

Preferred option: keep the legacy 1-byte encoding for compatible values, and only use a sentinel + int path for larger arity values.
Please also add compatibility coverage for old writer -> new reader, especially around the 127 / 128 boundary, not only a new-format round trip.

CI note:

I did not see a code-path failure from this PR itself in the current status rollup. The visible failing item is a PR labeler workflow, which does not change the engine compatibility conclusion above.

Conclusion: merge after fixes

Blocking items:

Fix the serializer to keep backward-compatible decoding/encoding for historical row bytes.
Add old-writer -> new-reader compatibility tests for the boundary cases.

Non-blocking suggestions:

No extra non-blocking asks from my side right now. Once the compatibility path is fixed, I’m happy to re-check it again.

Overall, this is still a worthwhile fix to continue, but because it touches seatunnel-engine serialization, I do not recommend merging the current head until the compatibility path is corrected.

nzw921rx · 2026-04-28T08:20:15Z

@lm-ylj Thank you for the quick fix.

The write path is already deterministic: writeRowArity emits either:

a non-negative single byte (0..127), or
the extended tuple {-1, MAGIC, int}.

Given that contract, both checks below in readRowArity are effectively unreachable:

if (encodedArity != EXTENDED_ROW_ARITY_MARKER) { ... }
if (extensionMagic != EXTENDED_ROW_ARITY_MAGIC) { ... }

Also, pre-fix records with arity > 127 were already non-recoverable under the legacy encoding, so keeping defensive branches for that path adds noise without practical value.

Suggested simplification:

private int readRowArity(ObjectDataInput in) throws IOException {
    byte b = in.readByte();
    if (b >= 0) {
        return b;
    }
    // Extended encoding written by writeRowArity: {-1, MAGIC, int}
    in.readInt(); // magic
    return in.readInt(); // arity
}

This keeps behavior aligned with the writer invariant and makes the decoding path easier to read and maintain.

lm-ylj · 2026-04-28T09:00:23Z

@lm-ylj Thank you for the quick fix.

The write path is already deterministic: writeRowArity emits either:

a non-negative single byte (0..127), or

the extended tuple {-1, MAGIC, int}.

Given that contract, both checks below in readRowArity are effectively unreachable:
if (encodedArity != EXTENDED_ROW_ARITY_MARKER) { ... }
if (extensionMagic != EXTENDED_ROW_ARITY_MAGIC) { ... }
Also, pre-fix records with arity > 127 were already non-recoverable under the legacy encoding, so keeping defensive branches for that path adds noise without practical value.

Suggested simplification:
private int readRowArity(ObjectDataInput in) throws IOException {
    byte b = in.readByte();
    if (b >= 0) {
        return b;
    }
    // Extended encoding written by writeRowArity: {-1, MAGIC, int}
    in.readInt(); // magic
    return in.readInt(); // arity
}
This keeps behavior aligned with the writer invariant and makes the decoding path easier to read and maintain.

Updated

nzw921rx · 2026-04-28T09:07:13Z

+1 LGTM

limin added 4 commits April 25, 2026 15:44

add test case

29e46e4

fix code style

98a9ecd

update test case

b2b91c6

fix bug

9b710c2

github-actions Bot added the Zeta label Apr 27, 2026

dybyte previously approved these changes Apr 27, 2026

View reviewed changes

github-actions Bot added approved reviewed labels Apr 27, 2026

DanielLeens suggested changes Apr 27, 2026

View reviewed changes

github-actions Bot removed the reviewed label Apr 27, 2026

DanielLeens mentioned this pull request Apr 27, 2026

[Bug] [Zeta] NegativeArraySizeException in SeaTunnelRow when executing Shuffle task #10826

Open

3 tasks

github-actions Bot removed the approved label Apr 27, 2026

DanielLeens suggested changes Apr 28, 2026

View reviewed changes

add compatibility judgment and corresponding test cases

c0086d4

remove the judgment logic and update test cases

8c3b042

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Zeta] Fix negative array size exception#10827

[Bug] [Zeta] Fix negative array size exception#10827
lm-ylj wants to merge 6 commits intoapache:devfrom
DobestTech:fix-negative-array-size-exception

lm-ylj commented Apr 27, 2026

Uh oh!

dybyte left a comment

Uh oh!

nzw921rx commented Apr 27, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

lm-ylj commented Apr 28, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

nzw921rx commented Apr 28, 2026

Uh oh!

lm-ylj commented Apr 28, 2026

Uh oh!

nzw921rx commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lm-ylj commented Apr 27, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

dybyte left a comment

Choose a reason for hiding this comment

Uh oh!

nzw921rx commented Apr 27, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

lm-ylj commented Apr 28, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Conclusion: merge after fixes

Uh oh!

nzw921rx commented Apr 28, 2026

Uh oh!

lm-ylj commented Apr 28, 2026

Uh oh!

nzw921rx commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants