-
Notifications
You must be signed in to change notification settings - Fork 749
[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2231] Added extractor for partition-aware file copy from Iceberg to any dest #4154
Conversation
…ceberg to any dest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces IcebergFileStreamExtractor to enable partition-aware file-level copying from Iceberg tables to any destination (e.g., Azure, S3, HDFS). This is a follow-up to the IcebergSource implementation.
Key Changes:
- Added
IcebergFileStreamExtractorextendingFileBasedExtractorfor file streaming mode - Implemented partition-aware destination path computation using metadata from work units
- Added comprehensive test coverage for both partitioned and non-partitioned scenarios
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.
| File | Description |
|---|---|
| IcebergFileStreamExtractor.java | Main extractor implementation with partition-aware destination path computation and file streaming logic |
| IcebergFileStreamExtractorTest.java | Comprehensive test suite covering partitioned/non-partitioned files, metadata preservation, and error handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| this.fileToPartitionPathMap = gson.fromJson(partitionPathJson, | ||
| new TypeToken<Map<String, String>>() {}.getType()); |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON parsing with gson.fromJson() can throw JsonSyntaxException if the JSON is malformed. This should be wrapped in a try-catch block to provide a more informative error message and prevent the extractor from failing to initialize with an unclear exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| Configuration hadoopConf = new Configuration(); | ||
| FileSystem originFs = sourcePath.getFileSystem(hadoopConf); |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new Configuration and FileSystem instance is created for every file in downloadFile(). This is inefficient as the method is called for each file. Consider reusing the FileSystem instance or using the fsHelper's FileSystem that's already connected. Similar pattern exists at line 122 where another FileSystem is created via WriterUtils.getFsConfiguration().
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/gobblin/data/management/copy/iceberg/IcebergFileStreamExtractorTest.java
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4154 +/- ##
============================================
+ Coverage 48.96% 51.56% +2.60%
+ Complexity 10148 7690 -2458
============================================
Files 1912 1402 -510
Lines 74708 53289 -21419
Branches 8289 5857 -2432
============================================
- Hits 36580 27481 -9099
+ Misses 34852 23473 -11379
+ Partials 3276 2335 -941 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Additions
Behavior
<finalDir>/<partitionPath>/<filename>Non-partitioned tables:
<finalDir>/<filename>Tests
Commits