Skip to content

Use RapidsFileIO for writing data.#13501

Merged
liurenjie1024 merged 18 commits into
NVIDIA:branch-25.12from
liurenjie1024:ray/13471
Oct 10, 2025
Merged

Use RapidsFileIO for writing data.#13501
liurenjie1024 merged 18 commits into
NVIDIA:branch-25.12from
liurenjie1024:ray/13471

Conversation

@liurenjie1024

Copy link
Copy Markdown
Collaborator

Fixes #13471 .

Description

This is blocked by NVIDIA/cudf-spark-jni#3768. In this pr we implemented output related interface for HadoopFileIO and IcebergFileIO, and use them in ColumnarOutputWriter.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Copilot AI review requested due to automatic review settings September 26, 2025 06:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements output functionality for RapidsFileIO to support writing data through a unified file I/O interface, addressing issue #13471. The changes integrate file I/O operations for both Hadoop and Iceberg environments to improve consistency and enable better file system abstraction.

  • Added output stream and output file implementations for both Hadoop and Iceberg file systems
  • Updated all columnar output writers to use RapidsFileIO instead of direct Hadoop filesystem calls
  • Modified constructor signatures across file format writers to accept the new file I/O parameter

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ColumnarOutputWriter.scala Updated to use RapidsFileIO for creating output streams instead of direct Hadoop filesystem calls
GpuFileFormatDataWriter.scala Added lazy initialization of HadoopFileIO and updated writer factory calls
GpuParquetFileFormat.scala Modified constructor signatures to accept and pass through RapidsFileIO parameter
GpuOrcFileFormat.scala Updated ORC writer constructors to support new file I/O interface
GpuHiveFileFormat.scala Modified Hive format writers to use RapidsFileIO parameter
HadoopFileIO.java Added newOutputFile method implementation for Hadoop file system
HadoopOutputFile.java New implementation providing Hadoop-specific output file operations
HadoopOutputStream.java New wrapper for Hadoop FSDataOutputStream extending RapidsOutputStream
IcebergFileIO.java Added newOutputFile method and minor string formatting change
IcebergOutputFile.java New implementation wrapping Iceberg OutputFile for Rapids compatibility
IcebergOutputStream.java New wrapper for Iceberg PositionOutputStream extending RapidsOutputStream
GpuSparkWrite.scala Added lazy initialization of IcebergFileIO for Iceberg write operations
GpuSparkFileWriterFactory.scala Updated constructor to accept and use IcebergFileIO parameter
GpuFileFormatDataWriterSuite.scala Updated test mocks to handle additional constructor parameters
iceberg_append_test.py Removed test skip condition for remote Iceberg catalog, indicating the issue is now resolved

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

*/
public IcebergFileIO(FileIO delegate) {
Objects.requireNonNull(delegate, "delegate can't be null");
Objects.requireNonNull(delegate, "delegate can't be null!");

Copilot AI Sep 26, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message formatting is inconsistent with other files in the codebase. Consider using 'can't be null' without the exclamation mark to match the style used in HadoopOutputFile.java and other files.

Suggested change
Objects.requireNonNull(delegate, "delegate can't be null!");
Objects.requireNonNull(delegate, "delegate can't be null");

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore this issue? The style of using exclamation points is inconsistent even within this PR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@liurenjie1024 liurenjie1024 marked this pull request as draft September 26, 2025 06:12
@liurenjie1024 liurenjie1024 marked this pull request as ready for review September 28, 2025 06:31
Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>
@liurenjie1024 liurenjie1024 changed the base branch from branch-25.10 to branch-25.12 September 29, 2025 08:24
@gerashegalov

Copy link
Copy Markdown
Collaborator

build

@sameerz sameerz added the task Work required that improves the product but is not user facing label Oct 6, 2025
gerashegalov
gerashegalov previously approved these changes Oct 7, 2025
*/
public IcebergFileIO(FileIO delegate) {
Objects.requireNonNull(delegate, "delegate can't be null");
Objects.requireNonNull(delegate, "delegate can't be null!");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore this issue? The style of using exclamation points is inconsistent even within this PR?

@liurenjie1024

Copy link
Copy Markdown
Collaborator Author

build

@liurenjie1024

Copy link
Copy Markdown
Collaborator Author

Verfied locally against s3tables, and it works.

@liurenjie1024

Copy link
Copy Markdown
Collaborator Author

build

@liurenjie1024

Copy link
Copy Markdown
Collaborator Author

cc @gerashegalov

@liurenjie1024 liurenjie1024 merged commit 86f2903 into NVIDIA:branch-25.12 Oct 10, 2025
59 of 60 checks passed
@liurenjie1024 liurenjie1024 deleted the ray/13471 branch October 10, 2025 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

task Work required that improves the product but is not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Use FileIO to write output files

5 participants