[Draft] Paimon Source Support #742

mikedias · 2025-09-08T05:46:00Z

Issue: #275

Draft PR to add support for Apache Paimon as the source table. This is an early sharing to collect feedback on whether the approach is right and on a few issues I've encountered. 🙂

Things that are missing yet:

Comprehensive unit test: I've relied on ITConversionController to validate the code so far. Once we are good with the approach, I'll cover the implementation with more tests.
Incremental Sync: I've only implemented the ConversionSource methods for the snapshot sync. Once we are good with the approach, I'll implement the incremental sync methods.
Target to Paimon: I've only focused on implementing Paimon as a source. The target implementation will be out of scope for this contribution, if that's okay.

Things where I need help:

Hudi partitions and Paimon buckets

Paimon has buckets as another folder division within partitions (e.g. partition=2025-09-01/bucket-0/data-file.parquet), however, Hudi considers the bucket value as part of the partition value (e.g. 2025-09-01/bucket-0). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats.

Iceberg Parquet conversion errors ✅

Solved by #749

When executing the ITConversionController#testVariousOperations with source=paimon and target=iceberg, I'm facing the follow error:

class org.apache.iceberg.shaded.org.apache.arrow.vector.IntVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BaseVariableWidthVector

From debugging, it seems the Parquet reader is trying to read the id int field value with the VarWidthReader class instead of the IntegerReader, causing the conversion issue:

Disabling vectorization doesn't help; doing so makes the error appear on the Spark level, indicating that there is something wrong with the schema conversion for Iceberg that I quite don't understand... 🤔

Thank you so much in advance for your help!

rahil-c · 2025-09-08T05:57:10Z

Thanks @mikedias for your contribution!

the-other-tim-brown · 2025-09-08T13:41:39Z

xtable-core/src/test/java/org/apache/xtable/ITConversionController.java

-import java.util.Optional;
-import java.util.Properties;
-import java.util.UUID;
+import java.util.*;


nitpick: generally we try to avoid * imports.

got it, I'll fix my IntelliJ config!

the-other-tim-brown · 2025-09-08T13:53:33Z

xtable-core/src/main/java/org/apache/xtable/paimon/PaimonSchemaExtractor.java

+
+/** Converts Paimon RowType to XTable InternalSchema. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class PaimonSchemaExtractor {


Adding a note here likely related to the issues with Iceberg you are seeing in the ITs. The output of this does not include the meta fields _KEY_id, _SEQUENCE_NUMBER, _VALUE_KIND. This is likely messing up the type to offset mapping in Iceberg. Is there a way to extract the Paimon schema with these fields?

Thanks @the-other-tim-brown, that was a good insight!

I've added the special fields based on org.apache.paimon.table.SpecialFields helper and now the InternalSchema matches the parquet files.

with that, the error changed to

class org.apache.iceberg.shaded.org.apache.arrow.vector.VarCharVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BitVector

This indicates that it's still the same class of problem, but it also indicates that the order of fields here matters for the iceberg reader.

I've tried sorting by name and fieldId but I'm still getting variations of the same problem... :(

Any more ideas? 😅

Does Paimon attach the field index to the parquet data? The field ordering should match that. We have an option to pass in the index directly to the internal fields as well.

If not, you will not be able to read the data in some vendors like BigQuery and Snowflake.

I believe that Paimon DataField.id should be the field index, and I'm passing it to InternalField.fieldId

Where can we see the field index in the parquet file? parquet-tools meta or parquet-tools schema doesn't have any reference to index, from what I can tell 🤔

parquet-tools schema gives me the field ID when I run I run it

I have found a way to get past this issue, working on a PR now for it: #749

Amazing, thank you so much @the-other-tim-brown!!! 😄

I have merged that PR, please update your branch to pick up the changes when you get a chance.

Great, I'll do it and let you know!

@the-other-tim-brown I've rebased it, and the Iceberg tests are passing now! Thank you so much for finding the root cause and fixing it!

the-other-tim-brown · 2025-09-08T13:55:22Z

xtable-core/src/main/java/org/apache/xtable/paimon/PaimonSchemaExtractor.java

+
+  private List<InternalField> primaryKeyFields(
+      TableSchema paimonSchema, List<InternalField> internalFields) {
+    List<String> keys = paimonSchema.primaryKeys();


Is it possible for the primary key to be a nested field in Paimon?

that is a good question, I'll double check

the-other-tim-brown · 2025-09-08T13:59:29Z

xtable-core/src/main/java/org/apache/xtable/paimon/PaimonConversionSourceProvider.java

+
+      return new PaimonConversionSource(paimonTable);
+    } catch (IOException e) {
+      throw new RuntimeException(e);


We have some custom exception types like ReadException that may be better here to indicate there is an issue reading the initial state.

got it, will change!

the-other-tim-brown · 2025-09-08T14:00:14Z

xtable-core/pom.xml

+                    <source>11</source>
+                    <target>11</target>


Is this a requirement for Paimon? We're only publishing to a java 8 target currently

I'm getting these compatibility errors when using java 8 on my machine, so I thought xtable was on 11 already but wasn't configured

everything works fine on 11 🤔

Yes we use Java 11 as the source and when developing but not our jar target when publishing.

Okay, I'll remove this.

I wonder if it can cause problems if developers end up using features from 11 and our target is 8 🤔

# Conflicts: # xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java # xtable-core/src/test/java/org/apache/xtable/GenericTable.java # xtable-core/src/test/java/org/apache/xtable/ITConversionController.java

mikedias · 2025-10-12T23:43:22Z

xtable-core/src/test/java/org/apache/xtable/TestPaimonTable.java

+        .filter(
+            // TODO Hudi thinks that paimon buckets are partition values, not sure how to handle it
+            // filtering out the partition field on the comparison for now
+            field -> !field.equals(partitionField))


@the-other-tim-brown do you have any opinions about this?

Paimon has buckets as another folder division within partitions (e.g. partition=2025-09-01/bucket-0/data-file.parquet), however, Hudi considers the bucket value as part of the partition value (e.g. 2025-09-01/bucket-0). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats.

Is this an issue in the Apache Hudi or XTable repo?

As far as I can see, it starts on the XTable repo, particularly on the HudiPathUtils.getPartitionPath, used in many key areas of the codebase:

But I'm not familiar with Hudi enough to determine if that's a limitation there as well.

I will play around with this and see how far I can get.

the-other-tim-brown reviewed Sep 8, 2025

View reviewed changes

mikedias added 3 commits October 13, 2025 09:52

Paimon initial support

4a5b40b

# Conflicts: # xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java # xtable-core/src/test/java/org/apache/xtable/GenericTable.java # xtable-core/src/test/java/org/apache/xtable/ITConversionController.java

Expanding imports and removing java 11 target

a8051db

fix compilation issue

3ca95a7

mikedias force-pushed the mdias/paimon-source-support branch from 62dd21f to 3ca95a7 Compare October 12, 2025 22:59

mikedias commented Oct 12, 2025

View reviewed changes

[Draft] Paimon Source Support #742

Are you sure you want to change the base?

[Draft] Paimon Source Support #742

Uh oh!

Conversation

mikedias commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hudi partitions and Paimon buckets

Iceberg Parquet conversion errors ✅

Uh oh!

rahil-c commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikedias Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikedias commented Sep 8, 2025 •

edited

Loading

mikedias Sep 9, 2025 •

edited

Loading