[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154

khandelwal-prateek · 2025-11-11T08:16:35Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-2231

Description

Here are some details about my PR, including screenshots (if applicable):
This PR introduces IcebergFileStreamExtractor, enabling partition-aware file-level streaming from Iceberg tables to any destination (e.g., Azure, S3, HDFS). This is a follow-up PR to the IcebergSource and IcebergFileStreamHelper implementation - [GOBBLIN-2231] Implement IcebergSource to enable copying Iceberg data files to any dest #4146.

Additions

Added IcebergFileStreamExtractor extending FileBasedExtractor.
Implemented partition-aware destination path computation for date-partitioned Iceberg tables.
Integrated with IcebergSource to consume partition metadata from work units.
Builds CopyableFile with both source and destination metadata.

Behavior

Reads partition → path mapping from the work unit.
Computes destination paths while preserving partition structure:
Partitioned tables: <finalDir>/<partitionPath>/<filename>

Example: /data/table1/datepartition=2025-04-01-00/file.parquet

Non-partitioned tables: <finalDir>/<filename>

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

…ceberg to any dest

Copilot

Pull Request Overview

This PR introduces IcebergFileStreamExtractor to enable partition-aware file-level copying from Iceberg tables to any destination (e.g., Azure, S3, HDFS). This is a follow-up to the IcebergSource implementation.

Key Changes:

Added IcebergFileStreamExtractor extending FileBasedExtractor for file streaming mode
Implemented partition-aware destination path computation using metadata from work units
Added comprehensive test coverage for both partitioned and non-partitioned scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.

File	Description
IcebergFileStreamExtractor.java	Main extractor implementation with partition-aware destination path computation and file streaming logic
IcebergFileStreamExtractorTest.java	Comprehensive test suite covering partitioned/non-partitioned files, metadata preservation, and error handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-11T09:32:22Z

...rc/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractor.java

+      this.fileToPartitionPathMap = gson.fromJson(partitionPathJson,
+          new TypeToken<Map<String, String>>() {}.getType());


The JSON parsing with gson.fromJson() can throw JsonSyntaxException if the JSON is malformed. This should be wrapped in a try-catch block to provide a more informative error message and prevent the extractor from failing to initialize with an unclear exception.

Copilot · 2025-11-11T09:32:22Z

...rc/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractor.java

+    Configuration hadoopConf = new Configuration();
+    FileSystem originFs = sourcePath.getFileSystem(hadoopConf);


A new Configuration and FileSystem instance is created for every file in downloadFile(). This is inefficient as the method is called for each file. Consider reusing the FileSystem instance or using the fsHelper's FileSystem that's already connected. Similar pattern exists at line 122 where another FileSystem is created via WriterUtils.getFsConfiguration().

...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java

codecov-commenter · 2025-11-11T14:13:59Z

Codecov Report

❌ Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.56%. Comparing base (8ba6c1f) to head (e01763a).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
...ement/copy/iceberg/IcebergFileStreamExtractor.java	93.47%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #4154      +/-   ##
============================================
+ Coverage     48.96%   51.56%   +2.60%     
+ Complexity    10148     7690    -2458     
============================================
  Files          1912     1402     -510     
  Lines         74708    53289   -21419     
  Branches       8289     5857    -2432     
============================================
- Hits          36580    27481    -9099     
+ Misses        34852    23473   -11379     
+ Partials       3276     2335     -941

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Added IcebergFileStreamExtractor for partition-aware file copy from I…

80b8afe

…ceberg to any dest

Blazer-007 requested a review from Copilot November 11, 2025 09:20

Copilot started reviewing on behalf of Blazer-007 November 11, 2025 09:20 View session

Copilot finished reviewing on behalf of Blazer-007 November 11, 2025 09:23

Copilot AI reviewed Nov 11, 2025

View reviewed changes

khandelwal-prateek added 3 commits November 11, 2025 16:52

Address review comments and fix indentation

7343f6e

Add IcebergFileStreamHelper change

836bfa5

Fix test failure

e01763a

Blazer-007 merged commit 9ad7ad2 into apache:master Nov 11, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154

[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154

Uh oh!

khandelwal-prateek commented Nov 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 11, 2025

Uh oh!

Blazer-007 Nov 11, 2025

Uh oh!

Copilot AI Nov 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Nov 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		this.fileToPartitionPathMap = gson.fromJson(partitionPathJson,
		new TypeToken<Map<String, String>>() {}.getType());

		Configuration hadoopConf = new Configuration();
		FileSystem originFs = sourcePath.getFileSystem(hadoopConf);

[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154

[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154

Uh oh!

Conversation

khandelwal-prateek commented Nov 11, 2025

JIRA

Description

Tests

Commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Blazer-007 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Nov 11, 2025 •

edited

Loading