[common] Introduce ROW type for ARROW, COMPACTED and INDEXED formats #2079

XuQianJin-Stars · 2025-12-02T11:28:33Z

Purpose

Linked issue: close #1974

This PR introduces support for nested ROW type in ARROW, COMPACTED, and INDEXED formats, enabling Fluss to handle complex nested data structures including nested rows and nested arrays.

Brief change log

Core Changes:

Added ArrowRowColumnVector to support reading ROW type from Arrow format
Added ArrowRowWriter to support writing ROW type to Arrow format
Introduced RowColumnVector interface for columnar row operations
Added RowSerializer for ROW type serialization/deserialization
Enhanced IndexedRowWriter and IndexedRowReader to support nested ROW type
Enhanced CompactedRowWriter and CompactedRowReader to support nested ROW type

Type System Enhancements:

Extended InternalRow and InternalArray interfaces with getRow() method
Updated DataGetters to include ROW type getter
Enhanced ArrowUtils to create Arrow column vectors for ROW type

Connector Integration:

Added FlinkRowConverter and FlinkArrayConverter for Flink-Fluss type conversion
Added PaimonRowAsFlussRow and PaimonArrayAsFlussArray for Paimon-Fluss type conversion
Updated existing converters to support nested structures

Test Coverage:

Enhanced ArrowReaderWriterTest with nested ROW and nested ARRAY test cases
Updated IndexedRowTest and IndexedRowReaderTest for ROW type validation

This change enables Fluss to store and process complex nested data structures, which is essential for advanced analytics and complex data modeling scenarios.

Tests

This PR includes the following unit tests to verify the nested ROW and ARRAY type support in Arrow format:

Unit Tests:

ArrowReaderWriterTest#testReaderWriter() - Validates that Arrow reader and writer can correctly handle nested ROW and nested ARRAY types
- Tests nested ARRAY: ARRAY(ARRAY(STRING))
- Tests nested ROW: ROW(INT, ROW(INT, STRING, BIGINT), STRING)
- Verifies null value handling in nested structures
- Validates correct serialization and deserialization of complex nested types
IndexedRowReaderTest - Verifies IndexedRow format support for ROW type read/write operations
IndexedRowTest - Validates IndexedRow handling of nested types in various scenarios

Test Coverage:

Arrow format nested ROW type read/write
Arrow format nested ARRAY type read/write
Compacted format ROW type support
Indexed format ROW type support
Type conversion integration with Flink and Paimon connectors

Test Data:

Multi-level nested structures with various primitive types
Mixed scenarios of null and non-null values
Comprehensive validation of all basic types within nested structures

All test cases pass with mvn clean verify.

API and Format

This change affects the storage format:

Extends ARROW format to support nested ROW type structures
Extends COMPACTED format to support ROW type serialization
Extends INDEXED format to support ROW type serialization
No breaking changes to existing API or storage format
Backward compatible with existing data

Documentation

This change introduces a new feature (nested ROW type support). Documentation is not required as per user request.

binary-signal · 2025-12-16T20:15:30Z

@XuQianJin-Stars this PR is a godsend. With a few small tweaks, I was able to get nested rows inside array fields working in a Fluss PK table, with tiering to Paimon and union read enabled all running in flink sql client. Is there a way we could merge my changes into your pull request and combine our efforts?

XuQianJin-Stars · 2025-12-17T02:03:07Z

@XuQianJin-Stars this PR is a godsend. With a few small tweaks, I was able to get nested rows inside array fields working in a Fluss PK table, with tiering to Paimon and union read enabled all running in flink sql client. Is there a way we could merge my changes into your pull request and combine our efforts?

hi @binary-signal well, sure. You can also wait for this PR to be approved and then submit a PR to improve it.

wuchong

Thanks for the contribution @XuQianJin-Stars .

I think there are the following 4 problems in this PR.

(1) The current implementation uses AlignedRow format as nested row format. We should use the same row format as the parant row, like if it's CompactedRow, then the nested row (like array<row>) should also use CompactedRow.

(2) The current implementation doesn't support nested row in arrays.

(3) Please add tests!! You can add tests to verify nested row types

row<bool, int, long, string, bytes, timestamp_ltz, timestamp_ntz, date, time>
array<row<array>>
row<array<row>>

in following tests

DefaultCompletedFetchTest#testComplexTypeFetch
FlinkComplexTypeITCase

(4) the dependencies and licenses changes seems not necessary.

I have appended a commit to fix the (1) and (2) problem. Could you fix the other problems?

wuchong · 2025-12-22T06:24:25Z

fluss-client/pom.xml

+                                <filter>
+                                    <artifact>*</artifact>
+                                    <excludes>
+                                        <exclude>LICENSE*</exclude>


Why exclude all the LICENSE files?

wuchong · 2025-12-22T07:07:36Z

fluss-common/src/main/java/org/apache/fluss/row/indexed/IndexedRowWriter.java

+     *
+     * @param fieldType the field type of the indexed row
+     */
+    public static FieldWriter createFieldWriter(DataType fieldType) {


This is a duplicated code block, this can be replaced by BinaryWriter.createValueWriter(fieldType).

wuchong · 2025-12-22T08:25:20Z

...uss-flink-common/src/main/java/org/apache/fluss/flink/utils/FlussRowToFlinkRowConverter.java

+                return (flussField) -> FlinkArrayConverter.deserialize(flussDataType, flussField);
+            case MAP:
+                // TODO: Add Map type support in future
+                throw new UnsupportedOperationException("Map type not supported yet");
+            case ROW:
+                return (flussField) -> FlinkRowConverter.deserialize(flussDataType, flussField);


Actually, I still prefer the original implementation of the conversion which is very clean by using elementGetter and elementConverter. Introducing FlinkArrayConverter and FlinkRowConverter seems unnecessary, and they don't need to implement ArrayData and RowData interfaces.

wuchong · 2025-12-22T08:30:48Z

...flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/utils/FlinkArrayConverter.java

+        return new FlinkArrayConverter(flussDataType, flussField).getArrayData();
+    }
+
+    private static Object getFieldValue(InternalArray array, int pos, DataType dataType) {


We don't need to do the element getter again, just use the InternalArray.createElementGetter.

wuchong · 2025-12-22T08:31:14Z

...flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/utils/FlinkArrayConverter.java

+        if (!eleType.isNullable()) {
+            switch (eleType.getTypeRoot()) {
+                case BOOLEAN:
+                    return new GenericArrayData(from.toBooleanArray());
+                case TINYINT:
+                    return new GenericArrayData(from.toByteArray());
+                case SMALLINT:
+                    return new GenericArrayData(from.toShortArray());
+                case INTEGER:
+                case DATE:
+                case TIME_WITHOUT_TIME_ZONE:
+                    return new GenericArrayData(from.toIntArray());
+                case BIGINT:
+                    return new GenericArrayData(from.toLongArray());
+                case FLOAT:
+                    return new GenericArrayData(from.toFloatArray());
+                case DOUBLE:
+                    return new GenericArrayData(from.toDoubleArray());
+            }
+        }


Actually, the nullability information in Flink’s DataType is not always reliable. To ensure robustness, we should wrap this code block in a try-catch and fall back to using the element converter if an exception occurs. This guarantees safe execution even when nullability metadata is inaccurate.

wuchong · 2025-12-22T16:51:48Z

fluss-common/pom.xml

+        <dependency>
+            <groupId>org.eclipse.collections</groupId>
+            <artifactId>eclipse-collections</artifactId>
+            <version>11.1.0</version>


We don't need this, this should be removed.

We don't need this, this should be removed.

Reason 1: Purpose
Arrow uses eclipse-collections to provide primitive type collections (such as IntList, LongList) to optimize memory usage and performance, avoiding the overhead of Java boxing/unboxing operations.
Reason 2: Why Explicit Declaration is Required
Although the project uses fluss-shaded-arrow, the eclipse-collections dependency was not included during the shading process, therefore it must be explicitly declared as a separate dependency.

@XuQianJin-Stars , sorry, I didn't find where Arrow uses eclipse-collections. I tried to remove all the license files and pom changes. If the CI passes, I will merge the PR.

wuchong · 2025-12-22T17:02:58Z

fluss-common/src/test/java/org/apache/fluss/row/TestInternalRowGenerator.java

                        "f20",
-                        DataTypes.ARRAY(DataTypes.FLOAT().copy(false))), // vector embedding type
-                new DataField(
-                        "f21", DataTypes.ARRAY(DataTypes.ARRAY(DataTypes.STRING()))) // nested array


Why removes the nested array test data? I think they are very useful.

wuchong · 2025-12-22T17:04:02Z

...flink-common/src/test/java/org/apache/fluss/flink/utils/FlinkRowToFlussRowConverterTest.java

-                20,
-                new GenericArrayData(
-                        new float[] {0.1f, 1.1f, -0.5f, 6.6f, Float.MAX_VALUE, Float.MIN_VALUE}));
-        genericRowData.setField(
-                21,
-                new GenericArrayData(
-                        new GenericArrayData[] {
-                            new GenericArrayData(
-                                    new StringData[] {fromString("a"), null, fromString("c")}),
-                            null,
-                            new GenericArrayData(
-                                    new StringData[] {fromString("hello"), fromString("world")})
-                        }));


Why removes the original array test data? I think they are useful to our test.

wuchong · 2025-12-22T17:04:27Z

...flink-common/src/test/java/org/apache/fluss/flink/utils/FlussRowToFlinkRowConverterTest.java

-                    new FlinkAsFlussArray(flinkRow.getArray(21).getArray(2))
-                            .toObjectArray(DataTypes.STRING());
-            assertThat(stringArray2)
-                    .isEqualTo(new BinaryString[] {fromString("hello"), fromString("world")});


ditto. shouldn't remove

wuchong · 2025-12-22T17:04:43Z

...lake-iceberg/src/main/java/org/apache/fluss/lake/iceberg/source/IcebergRecordAsFlussRow.java

+    @Override
+    public InternalRow getRow(int pos, int numFields) {
+        // TODO: Support Row type conversion from Iceberg to Fluss
+        return null;


throw UnsupportedOperationException

XuQianJin-Stars · 2025-12-23T01:39:40Z

Thanks for the contribution @XuQianJin-Stars .

I think there are the following 4 problems in this PR.

(1) The current implementation uses AlignedRow format as nested row format. We should use the same row format as the parant row, like if it's CompactedRow, then the nested row (like array<row>) should also use CompactedRow.

(2) The current implementation doesn't support nested row in arrays.

(3) Please add tests!! You can add tests to verify nested row types

row<bool, int, long, string, bytes, timestamp_ltz, timestamp_ntz, date, time>

array<row>

row<array>

in following tests

DefaultCompletedFetchTest#testComplexTypeFetch

FlinkComplexTypeITCase

(4) the dependencies and licenses changes seems not necessary.

I have appended a commit to fix the (1) and (2) problem. Could you fix the other problems?

well，i will fix the other problems.

wuchong

@XuQianJin-Stars , sorry, I didn't find where Arrow uses eclipse-collections. I tried to remove all the license files and pom changes. If the CI passes, I will merge the PR.

…ormats (apache#2079)

…lace dependency on eclipse-collections by arrow

…ormats (#2079)

XuQianJin-Stars force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch 3 times, most recently from 0634a97 to b2d938b Compare December 8, 2025 12:33

XuQianJin-Stars mentioned this pull request Dec 17, 2025

[common] Introduce MAP type for ARROW, COMPACTED and INDEXED formats #2190

Open

wuchong reviewed Dec 22, 2025

View reviewed changes

wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from b2d938b to 0d29df6 Compare December 22, 2025 17:17

XuQianJin-Stars force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch 7 times, most recently from e8bbdf6 to 158b328 Compare December 23, 2025 14:56

wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from 158b328 to cfc1a30 Compare December 23, 2025 16:46

wuchong approved these changes Dec 23, 2025

View reviewed changes

[common] Introduce nested ROW type for ARROW, COMPACTED and INDEXED f…

f50e06b

…ormats (apache#2079)

wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from cfc1a30 to 143dac8 Compare December 23, 2025 17:16

[common] Add IntObjectHashMap and IntObjectMap implementations to rep…

a3bc24e

…lace dependency on eclipse-collections by arrow

wuchong force-pushed the feature/issue-1974-support-nestedrow-arrow-format branch from 143dac8 to a3bc24e Compare December 24, 2025 00:59

wuchong merged commit d1ae5b5 into apache:main Dec 24, 2025
5 checks passed

wuchong pushed a commit that referenced this pull request Dec 24, 2025

[common] Introduce nested ROW type for ARROW, COMPACTED and INDEXED f…

ae963ef

…ormats (#2079)

[common] Introduce ROW type for ARROW, COMPACTED and INDEXED formats #2079

[common] Introduce ROW type for ARROW, COMPACTED and INDEXED formats #2079

Uh oh!

Conversation

XuQianJin-Stars commented Dec 2, 2025

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

binary-signal commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XuQianJin-Stars commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wuchong left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuQianJin-Stars commented Dec 23, 2025

Uh oh!

wuchong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

binary-signal commented Dec 16, 2025 •

edited

Loading

XuQianJin-Stars commented Dec 17, 2025 •

edited

Loading

wuchong left a comment •

edited

Loading