Use RapidsFileIO for writing data.#13501
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements output functionality for RapidsFileIO to support writing data through a unified file I/O interface, addressing issue #13471. The changes integrate file I/O operations for both Hadoop and Iceberg environments to improve consistency and enable better file system abstraction.
- Added output stream and output file implementations for both Hadoop and Iceberg file systems
- Updated all columnar output writers to use
RapidsFileIOinstead of direct Hadoop filesystem calls - Modified constructor signatures across file format writers to accept the new file I/O parameter
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
ColumnarOutputWriter.scala |
Updated to use RapidsFileIO for creating output streams instead of direct Hadoop filesystem calls |
GpuFileFormatDataWriter.scala |
Added lazy initialization of HadoopFileIO and updated writer factory calls |
GpuParquetFileFormat.scala |
Modified constructor signatures to accept and pass through RapidsFileIO parameter |
GpuOrcFileFormat.scala |
Updated ORC writer constructors to support new file I/O interface |
GpuHiveFileFormat.scala |
Modified Hive format writers to use RapidsFileIO parameter |
HadoopFileIO.java |
Added newOutputFile method implementation for Hadoop file system |
HadoopOutputFile.java |
New implementation providing Hadoop-specific output file operations |
HadoopOutputStream.java |
New wrapper for Hadoop FSDataOutputStream extending RapidsOutputStream |
IcebergFileIO.java |
Added newOutputFile method and minor string formatting change |
IcebergOutputFile.java |
New implementation wrapping Iceberg OutputFile for Rapids compatibility |
IcebergOutputStream.java |
New wrapper for Iceberg PositionOutputStream extending RapidsOutputStream |
GpuSparkWrite.scala |
Added lazy initialization of IcebergFileIO for Iceberg write operations |
GpuSparkFileWriterFactory.scala |
Updated constructor to accept and use IcebergFileIO parameter |
GpuFileFormatDataWriterSuite.scala |
Updated test mocks to handle additional constructor parameters |
iceberg_append_test.py |
Removed test skip condition for remote Iceberg catalog, indicating the issue is now resolved |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| */ | ||
| public IcebergFileIO(FileIO delegate) { | ||
| Objects.requireNonNull(delegate, "delegate can't be null"); | ||
| Objects.requireNonNull(delegate, "delegate can't be null!"); |
There was a problem hiding this comment.
The error message formatting is inconsistent with other files in the codebase. Consider using 'can't be null' without the exclamation mark to match the style used in HadoopOutputFile.java and other files.
| Objects.requireNonNull(delegate, "delegate can't be null!"); | |
| Objects.requireNonNull(delegate, "delegate can't be null"); |
There was a problem hiding this comment.
Why ignore this issue? The style of using exclamation points is inconsistent even within this PR?
Signed-off-by: liurenjie1024 <liurenjie2008@gmail.com>
274fe46 to
a447033
Compare
|
build |
| */ | ||
| public IcebergFileIO(FileIO delegate) { | ||
| Objects.requireNonNull(delegate, "delegate can't be null"); | ||
| Objects.requireNonNull(delegate, "delegate can't be null!"); |
There was a problem hiding this comment.
Why ignore this issue? The style of using exclamation points is inconsistent even within this PR?
|
build |
|
Verfied locally against s3tables, and it works. |
|
build |
Fixes #13471 .
Description
This is blocked by NVIDIA/cudf-spark-jni#3768. In this pr we implemented output related interface for
HadoopFileIOandIcebergFileIO, and use them inColumnarOutputWriter.Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)