Use `RapidsFileIO` for writing data. by liurenjie1024 · Pull Request #13501 · NVIDIA/cudf-spark

liurenjie1024 · 2025-09-26T06:11:48Z

Description

This is blocked by NVIDIA/cudf-spark-jni#3768. In this pr we implemented output related interface for HadoopFileIO and IcebergFileIO, and use them in ColumnarOutputWriter.

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Copilot

Pull Request Overview

This PR implements output functionality for RapidsFileIO to support writing data through a unified file I/O interface, addressing issue #13471. The changes integrate file I/O operations for both Hadoop and Iceberg environments to improve consistency and enable better file system abstraction.

Added output stream and output file implementations for both Hadoop and Iceberg file systems
Updated all columnar output writers to use RapidsFileIO instead of direct Hadoop filesystem calls
Modified constructor signatures across file format writers to accept the new file I/O parameter

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`ColumnarOutputWriter.scala`	Updated to use `RapidsFileIO` for creating output streams instead of direct Hadoop filesystem calls
`GpuFileFormatDataWriter.scala`	Added lazy initialization of `HadoopFileIO` and updated writer factory calls
`GpuParquetFileFormat.scala`	Modified constructor signatures to accept and pass through `RapidsFileIO` parameter
`GpuOrcFileFormat.scala`	Updated ORC writer constructors to support new file I/O interface
`GpuHiveFileFormat.scala`	Modified Hive format writers to use `RapidsFileIO` parameter
`HadoopFileIO.java`	Added `newOutputFile` method implementation for Hadoop file system
`HadoopOutputFile.java`	New implementation providing Hadoop-specific output file operations
`HadoopOutputStream.java`	New wrapper for Hadoop `FSDataOutputStream` extending `RapidsOutputStream`
`IcebergFileIO.java`	Added `newOutputFile` method and minor string formatting change
`IcebergOutputFile.java`	New implementation wrapping Iceberg `OutputFile` for Rapids compatibility
`IcebergOutputStream.java`	New wrapper for Iceberg `PositionOutputStream` extending `RapidsOutputStream`
`GpuSparkWrite.scala`	Added lazy initialization of `IcebergFileIO` for Iceberg write operations
`GpuSparkFileWriterFactory.scala`	Updated constructor to accept and use `IcebergFileIO` parameter
`GpuFileFormatDataWriterSuite.scala`	Updated test mocks to handle additional constructor parameters
`iceberg_append_test.py`	Removed test skip condition for remote Iceberg catalog, indicating the issue is now resolved

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-26T06:12:14Z

   */
  public IcebergFileIO(FileIO delegate) {
-    Objects.requireNonNull(delegate, "delegate can't be null");
+    Objects.requireNonNull(delegate, "delegate can't be null!");


The error message formatting is inconsistent with other files in the codebase. Consider using 'can't be null' without the exclamation mark to match the style used in HadoopOutputFile.java and other files.

Suggested change

Objects.requireNonNull(delegate, "delegate can't be null!");

Objects.requireNonNull(delegate, "delegate can't be null");

Why ignore this issue? The style of using exclamation points is inconsistent even within this PR?

Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>

gerashegalov · 2025-09-29T20:44:23Z

build

gerashegalov · 2025-10-06T18:44:44Z

   */
  public IcebergFileIO(FileIO delegate) {
-    Objects.requireNonNull(delegate, "delegate can't be null");
+    Objects.requireNonNull(delegate, "delegate can't be null!");


Why ignore this issue? The style of using exclamation points is inconsistent even within this PR?

liurenjie1024 · 2025-10-09T06:26:46Z

build

liurenjie1024 · 2025-10-09T10:04:05Z

Verfied locally against s3tables, and it works.

liurenjie1024 · 2025-10-09T10:04:09Z

build

liurenjie1024 · 2025-10-10T03:06:15Z

cc @gerashegalov

liurenjie1024 added 10 commits September 25, 2025 17:47

Partial

dba6323

Complete

02022a3

Fix build

2ab5a8c

Fix build

b1d3eb1

Fix import

04bba3c

Fix Output

bdfeaba

Fix build

e77f0bc

Fix build

143705e

Fix build

29c55df

Remove test skip

4384bf3

Copilot AI review requested due to automatic review settings September 26, 2025 06:11

Copilot AI reviewed Sep 26, 2025

View reviewed changes

liurenjie1024 marked this pull request as draft September 26, 2025 06:12

liurenjie1024 marked this pull request as ready for review September 28, 2025 06:31

Merge remote-tracking branch 'upstream/branch-25.10' into ray/13471

a447033

Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>

liurenjie1024 force-pushed the ray/13471 branch from 274fe46 to a447033 Compare September 28, 2025 06:32

liurenjie1024 requested review from abellina, gerashegalov and res-life September 28, 2025 06:34

liurenjie1024 added 3 commits September 28, 2025 17:08

fix build break

c5a9f80

fix build break

37c870e

Merge remote-tracking branch 'upstream/branch-25.10' into ray/13471

6255752

liurenjie1024 changed the base branch from branch-25.10 to branch-25.12 September 29, 2025 08:24

sameerz added the task Work required that improves the product but is not user facing label Oct 6, 2025

gerashegalov previously approved these changes Oct 7, 2025

View reviewed changes

gerashegalov assigned liurenjie1024 Oct 7, 2025

Merge remote-tracking branch 'upstream/branch-25.12' into ray/13471

00c1572

Skip set output path

503d741

liurenjie1024 dismissed gerashegalov’s stale review via 503d741 October 9, 2025 09:25

liurenjie1024 added 2 commits October 9, 2025 17:28

Fix build break

7e5033d

Fix test

7cd078a

gerashegalov approved these changes Oct 10, 2025

View reviewed changes

liurenjie1024 merged commit 86f2903 into NVIDIA:branch-25.12 Oct 10, 2025
59 of 60 checks passed

liurenjie1024 deleted the ray/13471 branch October 10, 2025 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `RapidsFileIO` for writing data.#13501

Use `RapidsFileIO` for writing data.#13501
liurenjie1024 merged 18 commits into
NVIDIA:branch-25.12from
liurenjie1024:ray/13471

liurenjie1024 commented Sep 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 26, 2025

Uh oh!

gerashegalov Oct 6, 2025

Uh oh!

liurenjie1024 Oct 9, 2025

Uh oh!

gerashegalov commented Sep 29, 2025

Uh oh!

gerashegalov Oct 6, 2025

Uh oh!

liurenjie1024 commented Oct 9, 2025

Uh oh!

liurenjie1024 commented Oct 9, 2025

Uh oh!

liurenjie1024 commented Oct 9, 2025

Uh oh!

liurenjie1024 commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	Objects.requireNonNull(delegate, "delegate can't be null!");
	Objects.requireNonNull(delegate, "delegate can't be null");

Uh oh!

Conversation

liurenjie1024 commented Sep 26, 2025

Description

Checklists

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Sep 29, 2025

Uh oh!

gerashegalov Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 commented Oct 9, 2025

Uh oh!

liurenjie1024 commented Oct 9, 2025

Uh oh!

liurenjie1024 commented Oct 9, 2025

Uh oh!

liurenjie1024 commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants