Skip to content

Expose cuDF Parquet writer row group size configs#14783

Merged
thirtiseven merged 7 commits into
NVIDIA:mainfrom
thirtiseven:parquet-row-group-size-config-main
May 19, 2026
Merged

Expose cuDF Parquet writer row group size configs#14783
thirtiseven merged 7 commits into
NVIDIA:mainfrom
thirtiseven:parquet-row-group-size-config-main

Conversation

@thirtiseven

@thirtiseven thirtiseven commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Fixes #14782.

Related to #9126.

Description

This PR exposes two internal Spark RAPIDS configs for tuning cuDF Parquet writer row group limits:

  • spark.rapids.sql.format.parquet.writer.rowGroupSizeRows
  • spark.rapids.sql.format.parquet.writer.rowGroupSizeBytes

These configs are cuDF-specific pass-through knobs and are documented as best-effort limits. They are not mapped from Spark parquet.block.size, because cuDF row group sizing is based on uncompressed estimates and page-fragment boundaries rather than Spark exact parquet.block.size behavior.

The implementation wires the options into both the standard GPU Parquet writer and the Hive GPU Parquet writer. When the byte limit is set, the standard Parquet writer factory also uses it for partition flush sizing so concurrent output writer buffering is consistent with the configured row group byte target.

This PR also adds a warning when parquet.block.size is set to a non-default value for GPU Parquet writes. The warning explains that RAPIDS GPU Parquet writer does not apply Spark CPU writer row group sizing semantics for parquet.block.size, points users to spark.rapids.sql.format.parquet.write.enabled=false if they require CPU writer behavior, and lists the two internal RAPIDS-specific row group tuning configs for experimentation.

Tests were added to ParquetWriterSuite to verify that the row-based and byte-based configs affect the written Parquet row groups. The rows test uses an observable row count because cuDF does not split row groups below its page fragment granularity. The suite also covers the parquet.block.size warning helper for unset/default/non-default values.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven changed the title [FEA] Expose cuDF Parquet writer row group size configs Expose cuDF Parquet writer row group size configs May 12, 2026
@thirtiseven thirtiseven self-assigned this May 12, 2026
@thirtiseven

Copy link
Copy Markdown
Collaborator Author

@greptile full review

@greptile-apps

greptile-apps Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR exposes two internal cuDF pass-through knobs — spark.rapids.sql.format.parquet.writer.rowGroupSizeRows and spark.rapids.sql.format.parquet.writer.rowGroupSizeBytes — so power users can tune GPU Parquet row-group sizing without falling back to the CPU writer. Both configs are wired into GpuParquetWriter and GpuHiveParquetWriter, partitionFlushSize is updated to honour the byte limit on both write paths, and a driver-side warning is emitted when the unrelated Spark parquet.block.size setting is detected.

  • New RapidsConf keys (PARQUET_WRITER_ROW_GROUP_SIZE_ROWS, PARQUET_WRITER_ROW_GROUP_SIZE_BYTES) are .internal(), createOptional, with input validation, and are passed through to both the standard and Hive GPU Parquet writer paths.
  • partitionFlushSize is now overridden in both GpuParquetFileFormat and GpuHiveParquetFileFormat factory classes to keep concurrent-write partition buffer flushing consistent with the configured byte target.
  • New unit tests verify row-group splitting, flush-size propagation, and the block-size warning logic; the bytes-based splitting test uses hardcoded arithmetic that ties assertions to cuDF's internal per-row estimation, which may be fragile across cuDF versions.

Confidence Score: 5/5

Safe to merge; the new code paths add optional pass-through knobs with no effect when unset, both write paths are updated symmetrically, and the warning logic is covered by unit tests.

The functional changes are additive and gated behind optional configs that default to None, so existing Parquet write behaviour is entirely unchanged. Both the standard and Hive GPU writer paths are treated consistently. The only concern is that the bytes-splitting test ties assertions to cuDF's internal per-row estimation math, which could become a flaky failure if cuDF changes how it accounts for page overhead — but this does not affect production correctness.

tests/src/test/scala/com/nvidia/spark/rapids/ParquetWriterSuite.scala — the byte-budget test assertions are fragile; the production files are clean.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Adds two new internal configs PARQUET_WRITER_ROW_GROUP_SIZE_ROWS (integerConf, positive) and PARQUET_WRITER_ROW_GROUP_SIZE_BYTES (bytesConf, ≥1024), both createOptional and correctly documented.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala Adds parquetBlockSizeWarning utility, wires the two new row-group configs into GpuParquetWriter, and updates partitionFlushSize to honour the byte config; logic and resource handling are correct.
sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala Mirrors the standard writer: adds Logging mixin, emits the block-size warning, wires row-group configs into GpuHiveParquetWriter, and overrides partitionFlushSize; parity with GpuParquetFileFormat is correct.
tests/src/test/scala/com/nvidia/spark/rapids/ParquetWriterSuite.scala Adds four new tests; the bytes test hard-codes cuDF per-row estimation math (16 bytes/row, 512-byte overhead constant) that may become fragile if cuDF internals change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[prepareWrite called] --> B{non-default BLOCK_SIZE?}
    B -- yes --> C[logWarning on driver]
    B -- no --> D[continue]
    C --> D
    D --> E[Read rowGroupSizeRows and rowGroupSizeBytes from RapidsConf]
    E --> F[ColumnarOutputWriterFactory]
    F --> G[partitionFlushSize uses rowGroupSizeBytes if set]
    F --> H{writer type}
    H -- standard --> I[GpuParquetWriter]
    H -- hive --> J[GpuHiveParquetWriter]
    I --> K[builder.withRowGroupSizeRows / withRowGroupSizeBytes]
    J --> K
    K --> L[Table.writeParquetChunked]
Loading

Reviews (6): Last reviewed commit: "block size warning" | Re-trigger Greptile

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/ParquetWriterSuite.scala Outdated
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven marked this pull request as ready for review May 13, 2026 06:30
val rowGroupCounts = getSingleParquetFileRowGroupCounts(spark, writePath)
assert(rowGroupCounts.length > 1, s"Expected multiple row groups, got $rowGroupCounts")
assertResult(10000L) {
rowGroupCounts.sum

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that each row group is less than #rows and #bytes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of these tests actually check the byte size of the row group that is written? It doesn't look like it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yesterday I avoided a direct totalByteSize <= rowGroupSizeBytes assertion because BlockMetaData.getTotalByteSize includes Parquet page/encoding overhead; with a 1024-byte limit I saw footer sizes like 1064 bytes even though cuDF was honoring the limit based on its uncompressed data-size estimate.

I updated the test to check both now: the estimated data bytes used for cuDF row-group splitting, and the actual footer totalByteSize with a small 512-byte overhead allowance.

@thirtiseven

Copy link
Copy Markdown
Collaborator Author

build

@thirtiseven thirtiseven merged commit bdf9542 into NVIDIA:main May 19, 2026
49 checks passed
@thirtiseven thirtiseven deleted the parquet-row-group-size-config-main branch May 19, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Expose cuDF Parquet writer row group size controls

3 participants