Skip to content

Conversation

@XuQianJin-Stars
Copy link
Contributor

Purpose

Linked issue: close #1974

This PR introduces support for nested ROW type in ARROW, COMPACTED, and INDEXED formats, enabling Fluss to handle complex nested data structures including nested rows and nested arrays.

Brief change log

Core Changes:

  • Added ArrowRowColumnVector to support reading ROW type from Arrow format
  • Added ArrowRowWriter to support writing ROW type to Arrow format
  • Introduced RowColumnVector interface for columnar row operations
  • Added RowSerializer for ROW type serialization/deserialization
  • Enhanced IndexedRowWriter and IndexedRowReader to support nested ROW type
  • Enhanced CompactedRowWriter and CompactedRowReader to support nested ROW type

Type System Enhancements:

  • Extended InternalRow and InternalArray interfaces with getRow() method
  • Updated DataGetters to include ROW type getter
  • Enhanced ArrowUtils to create Arrow column vectors for ROW type

Connector Integration:

  • Added FlinkRowConverter and FlinkArrayConverter for Flink-Fluss type conversion
  • Added PaimonRowAsFlussRow and PaimonArrayAsFlussArray for Paimon-Fluss type conversion
  • Updated existing converters to support nested structures

Test Coverage:

  • Enhanced ArrowReaderWriterTest with nested ROW and nested ARRAY test cases
  • Updated IndexedRowTest and IndexedRowReaderTest for ROW type validation

This change enables Fluss to store and process complex nested data structures, which is essential for advanced analytics and complex data modeling scenarios.

Tests

This PR includes the following unit tests to verify the nested ROW and ARRAY type support in Arrow format:

Unit Tests:

  • ArrowReaderWriterTest#testReaderWriter() - Validates that Arrow reader and writer can correctly handle nested ROW and nested ARRAY types

    • Tests nested ARRAY: ARRAY(ARRAY(STRING))
    • Tests nested ROW: ROW(INT, ROW(INT, STRING, BIGINT), STRING)
    • Verifies null value handling in nested structures
    • Validates correct serialization and deserialization of complex nested types
  • IndexedRowReaderTest - Verifies IndexedRow format support for ROW type read/write operations

  • IndexedRowTest - Validates IndexedRow handling of nested types in various scenarios

Test Coverage:

  • Arrow format nested ROW type read/write
  • Arrow format nested ARRAY type read/write
  • Compacted format ROW type support
  • Indexed format ROW type support
  • Type conversion integration with Flink and Paimon connectors

Test Data:

  • Multi-level nested structures with various primitive types
  • Mixed scenarios of null and non-null values
  • Comprehensive validation of all basic types within nested structures

All test cases pass with mvn clean verify.

API and Format

This change affects the storage format:

  • Extends ARROW format to support nested ROW type structures
  • Extends COMPACTED format to support ROW type serialization
  • Extends INDEXED format to support ROW type serialization
  • No breaking changes to existing API or storage format
  • Backward compatible with existing data

Documentation

This change introduces a new feature (nested ROW type support). Documentation is not required as per user request.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch 3 times, most recently from 0634a97 to b2d938b Compare December 8, 2025 12:33
@binary-signal
Copy link
Contributor

binary-signal commented Dec 16, 2025

@XuQianJin-Stars this PR is a godsend. With a few small tweaks, I was able to get nested rows inside array fields working in a Fluss PK table, with tiering to Paimon and union read enabled all running in flink sql client. Is there a way we could merge my changes into your pull request and combine our efforts?

@XuQianJin-Stars
Copy link
Contributor Author

XuQianJin-Stars commented Dec 17, 2025

@XuQianJin-Stars this PR is a godsend. With a few small tweaks, I was able to get nested rows inside array fields working in a Fluss PK table, with tiering to Paimon and union read enabled all running in flink sql client. Is there a way we could merge my changes into your pull request and combine our efforts?

hi @binary-signal well, sure. You can also wait for this PR to be approved and then submit a PR to improve it.

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @XuQianJin-Stars .

I think there are the following 4 problems in this PR.

(1) The current implementation uses AlignedRow format as nested row format. We should use the same row format as the parant row, like if it's CompactedRow, then the nested row (like array<row>) should also use CompactedRow.

(2) The current implementation doesn't support nested row in arrays.

(3) Please add tests!! You can add tests to verify nested row types

  • row<bool, int, long, string, bytes, timestamp_ltz, timestamp_ntz, date, time>
  • array<row<array>>
  • row<array<row>>

in following tests

  • DefaultCompletedFetchTest#testComplexTypeFetch
  • FlinkComplexTypeITCase

(4) the dependencies and licenses changes seems not necessary.

I have appended a commit to fix the (1) and (2) problem. Could you fix the other problems?

<filter>
<artifact>*</artifact>
<excludes>
<exclude>LICENSE*</exclude>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exclude all the LICENSE files?

*
* @param fieldType the field type of the indexed row
*/
public static FieldWriter createFieldWriter(DataType fieldType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicated code block, this can be replaced by BinaryWriter.createValueWriter(fieldType).

Comment on lines 146 to 151
return (flussField) -> FlinkArrayConverter.deserialize(flussDataType, flussField);
case MAP:
// TODO: Add Map type support in future
throw new UnsupportedOperationException("Map type not supported yet");
case ROW:
return (flussField) -> FlinkRowConverter.deserialize(flussDataType, flussField);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I still prefer the original implementation of the conversion which is very clean by using elementGetter and elementConverter. Introducing FlinkArrayConverter and FlinkRowConverter seems unnecessary, and they don't need to implement ArrayData and RowData interfaces.

return new FlinkArrayConverter(flussDataType, flussField).getArrayData();
}

private static Object getFieldValue(InternalArray array, int pos, DataType dataType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do the element getter again, just use the InternalArray.createElementGetter.

Comment on lines 51 to 76
if (!eleType.isNullable()) {
switch (eleType.getTypeRoot()) {
case BOOLEAN:
return new GenericArrayData(from.toBooleanArray());
case TINYINT:
return new GenericArrayData(from.toByteArray());
case SMALLINT:
return new GenericArrayData(from.toShortArray());
case INTEGER:
case DATE:
case TIME_WITHOUT_TIME_ZONE:
return new GenericArrayData(from.toIntArray());
case BIGINT:
return new GenericArrayData(from.toLongArray());
case FLOAT:
return new GenericArrayData(from.toFloatArray());
case DOUBLE:
return new GenericArrayData(from.toDoubleArray());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the nullability information in Flink’s DataType is not always reliable. To ensure robustness, we should wrap this code block in a try-catch and fall back to using the element converter if an exception occurs. This guarantees safe execution even when nullability metadata is inaccurate.

<dependency>
<groupId>org.eclipse.collections</groupId>
<artifactId>eclipse-collections</artifactId>
<version>11.1.0</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this, this should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this, this should be removed.

Reason 1: Purpose
Arrow uses eclipse-collections to provide primitive type collections (such as IntList, LongList) to optimize memory usage and performance, avoiding the overhead of Java boxing/unboxing operations.
Reason 2: Why Explicit Declaration is Required
Although the project uses fluss-shaded-arrow, the eclipse-collections dependency was not included during the shading process, therefore it must be explicitly declared as a separate dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuQianJin-Stars , sorry, I didn't find where Arrow uses eclipse-collections. I tried to remove all the license files and pom changes. If the CI passes, I will merge the PR.

"f20",
DataTypes.ARRAY(DataTypes.FLOAT().copy(false))), // vector embedding type
new DataField(
"f21", DataTypes.ARRAY(DataTypes.ARRAY(DataTypes.STRING()))) // nested array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removes the nested array test data? I think they are very useful.

Comment on lines 97 to 109
20,
new GenericArrayData(
new float[] {0.1f, 1.1f, -0.5f, 6.6f, Float.MAX_VALUE, Float.MIN_VALUE}));
genericRowData.setField(
21,
new GenericArrayData(
new GenericArrayData[] {
new GenericArrayData(
new StringData[] {fromString("a"), null, fromString("c")}),
null,
new GenericArrayData(
new StringData[] {fromString("hello"), fromString("world")})
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removes the original array test data? I think they are useful to our test.

new FlinkAsFlussArray(flinkRow.getArray(21).getArray(2))
.toObjectArray(DataTypes.STRING());
assertThat(stringArray2)
.isEqualTo(new BinaryString[] {fromString("hello"), fromString("world")});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. shouldn't remove

@Override
public InternalRow getRow(int pos, int numFields) {
// TODO: Support Row type conversion from Iceberg to Fluss
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw UnsupportedOperationException

@wuchong wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from b2d938b to 0d29df6 Compare December 22, 2025 17:17
@XuQianJin-Stars
Copy link
Contributor Author

Thanks for the contribution @XuQianJin-Stars .

I think there are the following 4 problems in this PR.

(1) The current implementation uses AlignedRow format as nested row format. We should use the same row format as the parant row, like if it's CompactedRow, then the nested row (like array<row>) should also use CompactedRow.

(2) The current implementation doesn't support nested row in arrays.

(3) Please add tests!! You can add tests to verify nested row types

  • row<bool, int, long, string, bytes, timestamp_ltz, timestamp_ntz, date, time>
  • array<row>
  • row<array>

in following tests

  • DefaultCompletedFetchTest#testComplexTypeFetch
  • FlinkComplexTypeITCase

(4) the dependencies and licenses changes seems not necessary.

I have appended a commit to fix the (1) and (2) problem. Could you fix the other problems?

well,i will fix the other problems.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch 7 times, most recently from e8bbdf6 to 158b328 Compare December 23, 2025 14:56
@wuchong wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from 158b328 to cfc1a30 Compare December 23, 2025 16:46
Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuQianJin-Stars , sorry, I didn't find where Arrow uses eclipse-collections. I tried to remove all the license files and pom changes. If the CI passes, I will merge the PR.

@wuchong wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from cfc1a30 to 143dac8 Compare December 23, 2025 17:16
…lace dependency on eclipse-collections by arrow
@wuchong wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from 143dac8 to a3bc24e Compare December 24, 2025 00:59
@wuchong wuchong merged commit d1ae5b5 into apache:main Dec 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support NestedRow type in log table (Arrow row format)

3 participants