-
Notifications
You must be signed in to change notification settings - Fork 187
[Draft] Paimon Source Support #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks @mikedias for your contribution! |
import java.util.Optional; | ||
import java.util.Properties; | ||
import java.util.UUID; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: generally we try to avoid *
imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, I'll fix my IntelliJ config!
|
||
/** Converts Paimon RowType to XTable InternalSchema. */ | ||
@NoArgsConstructor(access = AccessLevel.PRIVATE) | ||
public class PaimonSchemaExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a note here likely related to the issues with Iceberg you are seeing in the ITs. The output of this does not include the meta fields _KEY_id
, _SEQUENCE_NUMBER
, _VALUE_KIND
. This is likely messing up the type to offset mapping in Iceberg. Is there a way to extract the Paimon schema with these fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @the-other-tim-brown, that was a good insight!
I've added the special fields based on org.apache.paimon.table.SpecialFields
helper and now the InternalSchema matches the parquet files.


with that, the error changed to
class org.apache.iceberg.shaded.org.apache.arrow.vector.VarCharVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BitVector
This indicates that it's still the same class of problem, but it also indicates that the order of fields here matters for the iceberg reader.
I've tried sorting by name
and fieldId
but I'm still getting variations of the same problem... :(
Any more ideas? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Paimon attach the field index to the parquet data? The field ordering should match that. We have an option to pass in the index directly to the internal fields as well.
If not, you will not be able to read the data in some vendors like BigQuery and Snowflake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that Paimon DataField.id
should be the field index, and I'm passing it to InternalField.fieldId
Where can we see the field index in the parquet file? parquet-tools meta
or parquet-tools schema
doesn't have any reference to index, from what I can tell 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-tools schema
gives me the field ID when I run I run it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found a way to get past this issue, working on a PR now for it: #749
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thank you so much @the-other-tim-brown!!! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have merged that PR, please update your branch to pick up the changes when you get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I'll do it and let you know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-other-tim-brown I've rebased it, and the Iceberg tests are passing now! Thank you so much for finding the root cause and fixing it!
|
||
private List<InternalField> primaryKeyFields( | ||
TableSchema paimonSchema, List<InternalField> internalFields) { | ||
List<String> keys = paimonSchema.primaryKeys(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the primary key to be a nested field in Paimon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a good question, I'll double check
|
||
return new PaimonConversionSource(paimonTable); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some custom exception types like ReadException
that may be better here to indicate there is an issue reading the initial state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, will change!
xtable-core/pom.xml
Outdated
<source>11</source> | ||
<target>11</target> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a requirement for Paimon? We're only publishing to a java 8 target currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we use Java 11 as the source and when developing but not our jar target when publishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll remove this.
I wonder if it can cause problems if developers end up using features from 11 and our target is 8 🤔
# Conflicts: # xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java # xtable-core/src/test/java/org/apache/xtable/GenericTable.java # xtable-core/src/test/java/org/apache/xtable/ITConversionController.java
62dd21f
to
3ca95a7
Compare
.filter( | ||
// TODO Hudi thinks that paimon buckets are partition values, not sure how to handle it | ||
// filtering out the partition field on the comparison for now | ||
field -> !field.equals(partitionField)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-other-tim-brown do you have any opinions about this?
Paimon has buckets as another folder division within partitions (e.g.
partition=2025-09-01/bucket-0/data-file.parquet
), however, Hudi considers the bucket value as part of the partition value (e.g.2025-09-01/bucket-0
). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an issue in the Apache Hudi or XTable repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will play around with this and see how far I can get.
Issue: #275
Draft PR to add support for Apache Paimon as the source table. This is an early sharing to collect feedback on whether the approach is right and on a few issues I've encountered. 🙂
Things that are missing yet:
ITConversionController
to validate the code so far. Once we are good with the approach, I'll cover the implementation with more tests.ConversionSource
methods for the snapshot sync. Once we are good with the approach, I'll implement the incremental sync methods.Things where I need help:
Hudi partitions and Paimon buckets
Paimon has buckets as another folder division within partitions (e.g.
partition=2025-09-01/bucket-0/data-file.parquet
), however, Hudi considers the bucket value as part of the partition value (e.g.2025-09-01/bucket-0
). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats.Iceberg Parquet conversion errors ✅
Solved by #749
When executing the
ITConversionController#testVariousOperations
with source=paimon and target=iceberg, I'm facing the follow error:From debugging, it seems the Parquet reader is trying to read the

id int
field value with theVarWidthReader
class instead of theIntegerReader
, causing the conversion issue:Disabling vectorization doesn't help; doing so makes the error appear on the Spark level, indicating that there is something wrong with the schema conversion for Iceberg that I quite don't understand... 🤔
Thank you so much in advance for your help!