-
Notifications
You must be signed in to change notification settings - Fork 138
feat: arrow writer #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
feat: arrow writer #531
Conversation
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
…r changes Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
…ed map to json marshalling Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
Signed-off-by: badalprasadsingh <[email protected]>
| ) | ||
|
|
||
| // PartitionInfo represents a Iceberg partition column with its transform, preserving order | ||
| type PartitionInfo struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can not be in the writers.go as its specific to Iceberg.
|
|
||
| // setup java client | ||
| func newIcebergClient(config *Config, partitionInfo []PartitionInfo, threadID string, check, upsert bool, destinationDatabase string) (*serverInstance, error) { | ||
| func newIcebergClient(config *Config, partitionInfo []destination.PartitionInfo, threadID string, check, upsert bool, destinationDatabase string) (*serverInstance, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why there are changes in this function. it should remain intact.
| go.mongodb.org/mongo-driver v1.17.3 | ||
| golang.org/x/tools v0.30.0 | ||
| google.golang.org/grpc v1.71.3 | ||
| google.golang.org/grpc v1.72.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not upgrade the grpc version. It breaks in passive cases for protobuf things
| } | ||
|
|
||
| func (i *Iceberg) Check(ctx context.Context) error { | ||
| if i.UseArrowWrites() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
|
|
||
| // note: java server parses time from long value which will in milliseconds | ||
| func (i *Iceberg) Write(ctx context.Context, records []types.RawRecord) error { | ||
| if i.UseArrowWrites() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not happening in setup?
| } | ||
|
|
||
| if r.fileType == "delete" { | ||
| icebergSchemaJSON := fmt.Sprintf(`{"type":"struct","schema-id":0,"fields":[{"id":%d,"name":"_olake_id","required":true,"type":"string"}]}`, r.FieldId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "schema-id":0?
| record.Release() | ||
|
|
||
| sizeSoFar := int64(0) | ||
| if r.currentBuffer != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can this be nil, you are always setting it?
| sizeSoFar += int64(r.currentBuffer.Len()) | ||
| } | ||
| if r.currentWriter != nil { | ||
| sizeSoFar += r.currentWriter.RowGroupTotalBytesWritten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets discuss why we are adding it twice?
| return fmt.Errorf("failed to write delete arrow record to parquet: %w", err) | ||
| } | ||
|
|
||
| if uploadData != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets handle it in more proper manner, too many times we are calling uploadparquetfile function.
It can be converted to function and also we need to take a look at if we can call at the end just once.
| } | ||
|
|
||
| rec, err := arrow_writer.CreateDelArrRecord(deletes, fieldId) | ||
| if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about if its a update operation? how is the delete file being created in the partition?
feat: Arrow-writer for Iceberg
This PR introduces Apache Arrow-based iceberg writer, that writes both
dataanddeletefiles into object store and registers them into the Iceberg Table using Java`s API passing their file path.Thus, supporting both:
equality-deletes)The current implementation converts
gotypes data intoarrow.Record. Usespqarrowlibrary to write these arrow data into parquet files, flushing each file on exactly reaching the target file size.It introduces:
Rolling Writer Support
datafile writersdeletefile writersfor both partitioned and unpartitioned data, with:
zstd, compression level 1, etc.Fanout Partitioning Strategy
Transforms Logic
identity,year,month,week,day,hour,bucket,truncate,void; all iceberg transforms supportedHow to run it?
In your
destination.json(while using CLI) enable this toggle:As your sync starts, you should see something like this in your logs:
This indicates
OLakeis using the arrow writer successfully.Currently supports: