feat: arrow writer #531

badalprasadsingh · 2025-09-23T05:14:35Z

feat: Arrow-writer for Iceberg

This PR introduces Apache Arrow-based iceberg writer, that writes both data and delete files into object store and registers them into the Iceberg Table using Java`s API passing their file path.

Thus, supporting both:

full-refresh
cdc (equality-deletes)

The current implementation converts go types data into arrow.Record. Uses pqarrow library to write these arrow data into parquet files, flushing each file on exactly reaching the target file size.

It introduces:

Rolling Writer Support

rolling data file writers
rolling delete file writers

for both partitioned and unpartitioned data, with:

compression as zstd, compression level 1, etc.
configurable target file sizes

Fanout Partitioning Strategy

keeping multiple files open at the same time (no clustering or sorting required)

Transforms Logic

identity, year, month, week, day, hour, bucket, truncate, void; all iceberg transforms supported

How to run it?

In your destination.json (while using CLI) enable this toggle:

"arrow_writes": true

As your sync starts, you should see something like this in your logs:

INFO >>>> Arrow Writer Enabled >>>> >>>> >>>>

This indicates OLake is using the arrow writer successfully.

Currently supports:

schema-evolution
all iceberg catalogs (Glue, REST, Hadoop, JDBC, etc.)
all object stores (S3, ADLS, GCS, S3A, etc.)

Signed-off-by: badalprasadsingh <[email protected]>

…r changes Signed-off-by: badalprasadsingh <[email protected]>

Signed-off-by: badalprasadsingh <[email protected]>

…ed map to json marshalling Signed-off-by: badalprasadsingh <[email protected]>

Signed-off-by: badalprasadsingh <[email protected]>

shubham19may · 2025-11-05T10:05:40Z

destination/writers.go

 )

+// PartitionInfo represents a Iceberg partition column with its transform, preserving order
+type PartitionInfo struct {


This can not be in the writers.go as its specific to Iceberg.

shubham19may · 2025-11-05T10:08:38Z

destination/iceberg/java_client.go


 // setup java client
-func newIcebergClient(config *Config, partitionInfo []PartitionInfo, threadID string, check, upsert bool, destinationDatabase string) (*serverInstance, error) {
+func newIcebergClient(config *Config, partitionInfo []destination.PartitionInfo, threadID string, check, upsert bool, destinationDatabase string) (*serverInstance, error) {


why there are changes in this function. it should remain intact.

shubham19may · 2025-11-05T10:10:31Z

go.mod

 	go.mongodb.org/mongo-driver v1.17.3
-	golang.org/x/tools v0.30.0
-	google.golang.org/grpc v1.71.3
+	google.golang.org/grpc v1.72.0


Do not upgrade the grpc version. It breaks in passive cases for protobuf things

shubham19may · 2025-11-05T10:13:23Z

destination/iceberg/iceberg.go

 }

 func (i *Iceberg) Check(ctx context.Context) error {
+	if i.UseArrowWrites() {


shubham19may · 2025-11-05T10:14:18Z

destination/iceberg/iceberg.go


 // note: java server parses time from long value which will in milliseconds
 func (i *Iceberg) Write(ctx context.Context, records []types.RawRecord) error {
+	if i.UseArrowWrites() {


why is this not happening in setup?

shubham19may · 2025-11-05T18:41:52Z

destination/iceberg/arrow/rolling_writer.go

+		}
+
+		if r.fileType == "delete" {
+			icebergSchemaJSON := fmt.Sprintf(`{"type":"struct","schema-id":0,"fields":[{"id":%d,"name":"_olake_id","required":true,"type":"string"}]}`, r.FieldId)


why "schema-id":0?

shubham19may · 2025-11-05T19:08:27Z

destination/iceberg/arrow/rolling_writer.go

+	record.Release()
+
+	sizeSoFar := int64(0)
+	if r.currentBuffer != nil {


how can this be nil, you are always setting it?

shubham19may · 2025-11-05T19:08:48Z

destination/iceberg/arrow/rolling_writer.go

+		sizeSoFar += int64(r.currentBuffer.Len())
+	}
+	if r.currentWriter != nil {
+		sizeSoFar += r.currentWriter.RowGroupTotalBytesWritten()


lets discuss why we are adding it twice?

shubham19may · 2025-11-05T19:17:24Z

destination/iceberg/arrow.go

+				return fmt.Errorf("failed to write delete arrow record to parquet: %w", err)
+			}
+
+			if uploadData != nil {


lets handle it in more proper manner, too many times we are calling uploadparquetfile function.

It can be converted to function and also we need to take a look at if we can call at the end just once.

shubham19may · 2025-11-05T19:22:04Z

destination/iceberg/arrow.go

+			}
+
+			rec, err := arrow_writer.CreateDelArrRecord(deletes, fieldId)
+			if err != nil {


what about if its a update operation? how is the delete file being created in the partition?

badalprasadsingh added 4 commits September 10, 2025 16:50

feat: arrow writer intial implementation

14041a5

Signed-off-by: badalprasadsingh <[email protected]>

feat: updated implementation of arrow writer

61a1e38

Signed-off-by: badalprasadsingh <[email protected]>

fix: normalization, denormalization, parquet file name, and some mino…

a1e5212

…r changes Signed-off-by: badalprasadsingh <[email protected]>

merged staging

3bacef3

Signed-off-by: badalprasadsingh <[email protected]>

badalprasadsingh marked this pull request as ready for review September 29, 2025 03:15

badalprasadsingh added 13 commits September 29, 2025 11:24

fix: minor issues

5064163

Signed-off-by: badalprasadsingh <[email protected]>

fix: minor issues

5d17e91

Signed-off-by: badalprasadsingh <[email protected]>

fix: minor issues

16abcf8

Signed-off-by: badalprasadsingh <[email protected]>

feat: initial implementation arrow delta writer

f1b4cb3

Signed-off-by: badalprasadsingh <[email protected]>

fix: minor refractor

9e2e6f7

Signed-off-by: badalprasadsingh <[email protected]>

fix: minor

fb718de

Signed-off-by: badalprasadsingh <[email protected]>

fix: minor

71b992c

Signed-off-by: badalprasadsingh <[email protected]>

fix: batch size, _olake_timestmap, _cdc_timestamp values and normaliz…

fb25399

…ed map to json marshalling Signed-off-by: badalprasadsingh <[email protected]>

fix: minor refractor

05ec65b

Signed-off-by: badalprasadsingh <[email protected]>

fix: target file size fix

ef8341e

Signed-off-by: badalprasadsingh <[email protected]>

fix: target file size fix

fac6e51

Signed-off-by: badalprasadsingh <[email protected]>

fix: minor

4cdb8e9

Signed-off-by: badalprasadsingh <[email protected]>

fix: file naming

ed17f1f

Signed-off-by: badalprasadsingh <[email protected]>

shubham19may reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: arrow writer #531

feat: arrow writer #531

badalprasadsingh commented Sep 23, 2025 •

edited

Loading

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

shubham19may Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: arrow writer #531

Are you sure you want to change the base?

feat: arrow writer #531

Conversation

badalprasadsingh commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: Arrow-writer for Iceberg

Rolling Writer Support

Fanout Partitioning Strategy

Transforms Logic

How to run it?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

badalprasadsingh commented Sep 23, 2025 •

edited

Loading