Skip to content

Conversation

mikedias
Copy link

@mikedias mikedias commented Sep 8, 2025

Issue: #275

Draft PR to add support for Apache Paimon as the source table. This is an early sharing to collect feedback on whether the approach is right and on a few issues I've encountered. 🙂

Things that are missing yet:

  • Comprehensive unit test: I've relied on ITConversionController to validate the code so far. Once we are good with the approach, I'll cover the implementation with more tests.
  • Incremental Sync: I've only implemented the ConversionSource methods for the snapshot sync. Once we are good with the approach, I'll implement the incremental sync methods.
  • Target to Paimon: I've only focused on implementing Paimon as a source. The target implementation will be out of scope for this contribution, if that's okay.

Things where I need help:

Hudi partitions and Paimon buckets

Paimon has buckets as another folder division within partitions (e.g. partition=2025-09-01/bucket-0/data-file.parquet), however, Hudi considers the bucket value as part of the partition value (e.g. 2025-09-01/bucket-0). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats.

Iceberg Parquet conversion errors

Solved by #749

When executing the ITConversionController#testVariousOperations with source=paimon and target=iceberg, I'm facing the follow error:

class org.apache.iceberg.shaded.org.apache.arrow.vector.IntVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BaseVariableWidthVector 

From debugging, it seems the Parquet reader is trying to read the id int field value with the VarWidthReader class instead of the IntegerReader, causing the conversion issue:
image

Disabling vectorization doesn't help; doing so makes the error appear on the Spark level, indicating that there is something wrong with the schema conversion for Iceberg that I quite don't understand... 🤔

Thank you so much in advance for your help!

@rahil-c
Copy link
Contributor

rahil-c commented Sep 8, 2025

Thanks @mikedias for your contribution!

import java.util.Optional;
import java.util.Properties;
import java.util.UUID;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: generally we try to avoid * imports.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, I'll fix my IntelliJ config!


/** Converts Paimon RowType to XTable InternalSchema. */
@NoArgsConstructor(access = AccessLevel.PRIVATE)
public class PaimonSchemaExtractor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a note here likely related to the issues with Iceberg you are seeing in the ITs. The output of this does not include the meta fields _KEY_id, _SEQUENCE_NUMBER, _VALUE_KIND. This is likely messing up the type to offset mapping in Iceberg. Is there a way to extract the Paimon schema with these fields?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @the-other-tim-brown, that was a good insight!

I've added the special fields based on org.apache.paimon.table.SpecialFields helper and now the InternalSchema matches the parquet files.

image image

with that, the error changed to

class org.apache.iceberg.shaded.org.apache.arrow.vector.VarCharVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BitVector

This indicates that it's still the same class of problem, but it also indicates that the order of fields here matters for the iceberg reader.

I've tried sorting by name and fieldId but I'm still getting variations of the same problem... :(

Any more ideas? 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Paimon attach the field index to the parquet data? The field ordering should match that. We have an option to pass in the index directly to the internal fields as well.

If not, you will not be able to read the data in some vendors like BigQuery and Snowflake.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that Paimon DataField.id should be the field index, and I'm passing it to InternalField.fieldId

Where can we see the field index in the parquet file? parquet-tools meta or parquet-tools schema doesn't have any reference to index, from what I can tell 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-tools schema gives me the field ID when I run I run it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found a way to get past this issue, working on a PR now for it: #749

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thank you so much @the-other-tim-brown!!! 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged that PR, please update your branch to pick up the changes when you get a chance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I'll do it and let you know!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-other-tim-brown I've rebased it, and the Iceberg tests are passing now! Thank you so much for finding the root cause and fixing it!


private List<InternalField> primaryKeyFields(
TableSchema paimonSchema, List<InternalField> internalFields) {
List<String> keys = paimonSchema.primaryKeys();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the primary key to be a nested field in Paimon?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a good question, I'll double check


return new PaimonConversionSource(paimonTable);
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some custom exception types like ReadException that may be better here to indicate there is an issue reading the initial state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will change!

Comment on lines 225 to 226
<source>11</source>
<target>11</target>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a requirement for Paimon? We're only publishing to a java 8 target currently

Copy link
Author

@mikedias mikedias Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting these compatibility errors when using java 8 on my machine, so I thought xtable was on 11 already but wasn't configured

image image

everything works fine on 11 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we use Java 11 as the source and when developing but not our jar target when publishing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll remove this.

I wonder if it can cause problems if developers end up using features from 11 and our target is 8 🤔

# Conflicts:
#	xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java
#	xtable-core/src/test/java/org/apache/xtable/GenericTable.java
#	xtable-core/src/test/java/org/apache/xtable/ITConversionController.java
@mikedias mikedias force-pushed the mdias/paimon-source-support branch from 62dd21f to 3ca95a7 Compare October 12, 2025 22:59
.filter(
// TODO Hudi thinks that paimon buckets are partition values, not sure how to handle it
// filtering out the partition field on the comparison for now
field -> !field.equals(partitionField))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-other-tim-brown do you have any opinions about this?

Paimon has buckets as another folder division within partitions (e.g. partition=2025-09-01/bucket-0/data-file.parquet), however, Hudi considers the bucket value as part of the partition value (e.g. 2025-09-01/bucket-0). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue in the Apache Hudi or XTable repo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, it starts on the XTable repo, particularly on the HudiPathUtils.getPartitionPath, used in many key areas of the codebase:
image

But I'm not familiar with Hudi enough to determine if that's a limitation there as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will play around with this and see how far I can get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants