Skip to content

Conversation

@khandelwal-prateek
Copy link
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

Additions

  • Added IcebergFileStreamExtractor extending FileBasedExtractor.
  • Implemented partition-aware destination path computation for date-partitioned Iceberg tables.
  • Integrated with IcebergSource to consume partition metadata from work units.
  • Builds CopyableFile with both source and destination metadata.

Behavior

  • Reads partition → path mapping from the work unit.
  • Computes destination paths while preserving partition structure:
  • Partitioned tables: <finalDir>/<partitionPath>/<filename>
Example: /data/table1/datepartition=2025-04-01-00/file.parquet

Non-partitioned tables: <finalDir>/<filename>

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces IcebergFileStreamExtractor to enable partition-aware file-level copying from Iceberg tables to any destination (e.g., Azure, S3, HDFS). This is a follow-up to the IcebergSource implementation.

Key Changes:

  • Added IcebergFileStreamExtractor extending FileBasedExtractor for file streaming mode
  • Implemented partition-aware destination path computation using metadata from work units
  • Added comprehensive test coverage for both partitioned and non-partitioned scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.

File Description
IcebergFileStreamExtractor.java Main extractor implementation with partition-aware destination path computation and file streaming logic
IcebergFileStreamExtractorTest.java Comprehensive test suite covering partitioned/non-partitioned files, metadata preservation, and error handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 69 to 70
this.fileToPartitionPathMap = gson.fromJson(partitionPathJson,
new TypeToken<Map<String, String>>() {}.getType());
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON parsing with gson.fromJson() can throw JsonSyntaxException if the JSON is malformed. This should be wrapped in a try-catch block to provide a more informative error message and prevent the extractor from failing to initialize with an unclear exception.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 115 to 116
Configuration hadoopConf = new Configuration();
FileSystem originFs = sourcePath.getFileSystem(hadoopConf);
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new Configuration and FileSystem instance is created for every file in downloadFile(). This is inefficient as the method is called for each file. Consider reusing the FileSystem instance or using the fsHelper's FileSystem that's already connected. Similar pattern exists at line 122 where another FileSystem is created via WriterUtils.getFsConfiguration().

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.56%. Comparing base (8ba6c1f) to head (e01763a).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...ement/copy/iceberg/IcebergFileStreamExtractor.java 93.47% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4154      +/-   ##
============================================
+ Coverage     48.96%   51.56%   +2.60%     
+ Complexity    10148     7690    -2458     
============================================
  Files          1912     1402     -510     
  Lines         74708    53289   -21419     
  Branches       8289     5857    -2432     
============================================
- Hits          36580    27481    -9099     
+ Misses        34852    23473   -11379     
+ Partials       3276     2335     -941     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Blazer-007 Blazer-007 merged commit 9ad7ad2 into apache:master Nov 11, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants