Support kudo GPU shuffle reads in the plugin by zpuller · Pull Request #13489 · NVIDIA/cudf-spark

zpuller · 2025-09-24T20:41:03Z

Description

This PR adds a disabled-by-default optional feature to perform Kudo deserialization on the GPU. It also adds a unit test to validate this new behavior.

I'm currently planning to merge this and then have a follow up PR to either enable the behavior by default or conditionally on internal metrics, depending on remaining performance testing results.

Checklists

The test query of interest shows a 10% performance gain with this feature enabled.

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller · 2025-10-11T22:13:57Z

build

Signed-off-by: Zach Puller <zpuller@nvidia.com>

binmahone · 2025-10-24T01:35:47Z

Hi @zpuller , can you point workload has 10% improvement, and can you attach a detailed perf report?

binmahone · 2025-10-28T01:35:45Z

I left my comment @zpuller .

Also FYI, recently we accidentally enabled kudo gpu write in our customer environment, and found many queries failed at shuffle. We're still investigating why. But that won't be a show stopper of this PR (since all GPU kudo are by default disabled)

zpuller · 2025-10-28T15:53:33Z

Ok, thanks. I tested the GPU kudo writes against NDS and didn't have any failure or incorrect result, but if there are other failures we can investigate

greptile-apps

Greptile Overview

Greptile Summary

This PR implements GPU-side Kudo deserialization for shuffle reads, splitting the previous SHUFFLE_KUDO_MODE config into separate read and write modes. The implementation includes safety checks to prevent incompatible configurations (async reads + GPU deserialization), comprehensive refactoring of the shuffle coalesce iterator hierarchy, and proper handling of edge cases like zero-column batches.

Key changes:

Added SHUFFLE_KUDO_READ_MODE config (defaulting to CPU for backward compatibility)
Implemented KudoGpuTableOperator and KudoGpuShuffleCoalesceIterator for GPU deserialization
Refactored iterator hierarchy with new base classes CoalesceIteratorBase and GpuCoalesceIteratorBase
Added resolveKudoMode() logic to prevent conflicts between async reads and GPU deserialization (downgrades to CPU with warning)
Updated all references from old SHUFFLE_KUDO_MODE to SHUFFLE_KUDO_WRITE_MODE
Added comprehensive unit test validating the full GPU shuffle round-trip

Previous review issues addressed:
All issues from previous review comments have been fixed in subsequent commits:

✅ Doc string corrected to say "deserialize shuffle inputs"
✅ Test now sets both write and read mode to GPU
✅ Zero-column batch case properly handled in KudoGpuTableOperator
✅ Async read incompatibility resolved via resolveKudoMode() logic

Confidence Score: 5/5

This PR is safe to merge with high confidence - all previous issues have been addressed and the implementation includes proper safety checks
The code demonstrates thorough engineering: all previous review comments have been addressed, async read conflicts are handled gracefully with downgrade logic, edge cases like zero-column batches are properly handled, comprehensive testing validates the GPU deserialization path, and the refactoring maintains clear separation between host and GPU paths with appropriate type safety through proper iterator casting
No files require special attention - all previous issues have been resolved and the implementation is complete

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	5/5	Adds new `SHUFFLE_KUDO_READ_MODE` config to control GPU/CPU deserialization, splitting from previous combined `SHUFFLE_KUDO_MODE`
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala	5/5	Updates shuffle coalesce logic to handle GPU-deserialized batches, returning tuple to track whether batches are host-serialized or GPU batches
tests/src/test/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceSuite.scala	5/5	Adds comprehensive test for GPU kudo deserialization path, properly configuring both read and write modes
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	5/5	Major refactoring adding GPU deserialization support: new `KudoGpuTableOperator`, `KudoGpuShuffleCoalesceIterator`, and `GpuCoalesceIteratorBase`, plus safety logic in `CoalesceReadOption.resolveKudoMode()`

Sequence Diagram

sequenceDiagram
    participant User as User Query
    participant Exec as GpuShuffleCoalesceExec
    participant Config as CoalesceReadOption
    participant KudoIter as KudoGpuShuffleCoalesceIterator
    participant Operator as KudoGpuTableOperator
    participant GPU as GPU/cuDF
    
    User->>Exec: Execute shuffle read
    Exec->>Config: Create CoalesceReadOption(conf)
    Config->>Config: resolveKudoMode(kudoMode, useAsync)
    alt useAsync=true && kudoMode=GPU
        Config-->>Config: Downgrade to CPU mode (log warning)
    end
    Config-->>Exec: readOption with resolved mode
    
    alt kudoEnabled && kudoMode=GPU
        Exec->>KudoIter: Create GPU coalesce iterator
        KudoIter->>KudoIter: Buffer KudoSerializedTableColumn batches
        KudoIter->>Operator: concat(columns)
        alt numCols = 0
            Operator-->>KudoIter: Empty ColumnarBatch
        else
            Operator->>GPU: Copy to device memory
            Operator->>GPU: KudoGpuSerializer.assembleFromDeviceRaw()
            GPU-->>Operator: cuDF Table
            Operator-->>KudoIter: ColumnarBatch (GPU)
        end
        KudoIter-->>Exec: GPU ColumnarBatch
    else kudoEnabled && kudoMode=CPU
        Exec->>Exec: Use host deserialization path
    end
    
    Exec-->>User: Return deserialized batches

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zpuller · 2025-10-31T22:39:02Z

build

binmahone · 2025-11-03T06:22:17Z

I left my comment @zpuller .

Also FYI, recently we accidentally enabled kudo gpu write in our customer environment, and found many queries failed at shuffle. We're still investigating why. But that won't be a show stopper of this PR (since all GPU kudo are by default disabled)

the know issue is #13663, which is merged.
the other issues seems irrelevant to Gpu kudo

so I think this PR is good to go

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds GPU-based Kudo deserialization to complement the existing CPU-based approach. The feature is disabled by default via the new spark.rapids.shuffle.kudo.serializer.read.mode configuration (defaults to CPU).

Key Changes

Configuration split: Renamed SHUFFLE_KUDO_MODE to SHUFFLE_KUDO_WRITE_MODE and added separate SHUFFLE_KUDO_READ_MODE config for independent control of serialization vs deserialization
New iterator classes: Added KudoGpuShuffleCoalesceIterator and GpuGpuShuffleCoalesceIterator to handle GPU deserialization path, producing ColumnarBatch directly on GPU rather than CoalescedHostResult
Type safety: Refactored iterator hierarchy with proper type parameters to distinguish between host (CoalescedHostResult) and GPU (ColumnarBatch) results
Conflict resolution: Added resolveKudoMode() to automatically override GPU mode to CPU when async read is enabled, preventing incompatible configuration combinations
Hash join integration: Updated GpuShuffledHashJoinExec to handle both GPU batches and host results with proper type discrimination via isHostSerialized flag

Test Coverage

New GpuShuffleCoalesceSuite validates the GPU deserialization path by performing a full shuffle round-trip with GPU mode enabled and comparing results against expected output.

Confidence Score: 4/5

This PR is safe to merge with minor style improvements recommended
The implementation is well-designed with proper type safety, conflict resolution, and test coverage. Previous critical issues (test configuration, zero-column handling, async/GPU conflicts) have all been addressed. Only one minor style issue found (missing space in doc string). The feature is disabled by default, reducing risk.
No files require special attention - all previously identified issues have been resolved

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	4/5	Adds new `SHUFFLE_KUDO_READ_MODE` config to control GPU/CPU deserialization separately from write mode. Renames `SHUFFLE_KUDO_MODE` to `SHUFFLE_KUDO_WRITE_MODE` and adds helper method `shuffleKudoGpuSerializerReadEnabled`. Minor doc formatting issue found.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	4/5	Major refactoring to support GPU-based Kudo deserialization. Adds new iterator classes (`KudoGpuShuffleCoalesceIterator`, `GpuGpuShuffleCoalesceIterator`), new table operator (`KudoGpuTableOperator`), and refactored base classes. Includes conflict resolution logic between async mode and GPU mode.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala	4/5	Updates shuffle coalesce iterator logic to handle both GPU batches and host results. Renames `getHostShuffleCoalesceIterator` to `getShuffleCoalesceIterator` and returns tuple with boolean flag indicating result type. Properly integrates GPU Kudo mode into hash join execution.
tests/src/test/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceSuite.scala	5/5	New test suite validating GPU Kudo deserialization. Test properly configures both write and read modes to GPU, performs full shuffle round-trip, and validates data correctness. Well-structured test with proper resource management.

Sequence Diagram

sequenceDiagram
    participant Exec as GpuShuffleCoalesceExec
    participant Utils as GpuShuffleCoalesceUtils
    participant ReadOpt as CoalesceReadOption
    participant KudoGpuIter as KudoGpuShuffleCoalesceIterator
    participant TableOp as KudoGpuTableOperator
    participant GpuIter as GpuGpuShuffleCoalesceIterator
    participant GPU as GPU Memory

    Exec->>ReadOpt: apply(conf)
    ReadOpt->>ReadOpt: resolveKudoMode()
    Note over ReadOpt: Overrides GPU to CPU if async enabled
    ReadOpt-->>Exec: CoalesceReadOption

    Exec->>Utils: getGpuShuffleCoalesceIterator()
    
    alt kudoMode == GPU
        Utils->>KudoGpuIter: new KudoGpuShuffleCoalesceIterator()
        Note over KudoGpuIter: Produces ColumnarBatch (GPU)
    else kudoMode == CPU
        Utils->>Utils: new KudoHostShuffleCoalesceIterator()
        Note over Utils: Produces CoalescedHostResult
    end

    alt prefetchFirstBatch
        Utils->>Utils: Buffer first batch
    end

    alt useAsync == true
        Utils->>Utils: new GpuShuffleAsyncCoalesceIterator()
        Note over Utils: Always uses CoalescedHostResult
    else kudoMode == GPU
        Utils->>GpuIter: new GpuGpuShuffleCoalesceIterator()
    else 
        Utils->>Utils: new GpuShuffleCoalesceIterator()
    end

    loop For each shuffle partition
        GpuIter->>KudoGpuIter: next()
        KudoGpuIter->>TableOp: concat(KudoSerializedTableColumn[])
        TableOp->>GPU: Copy to device memory
        TableOp->>GPU: assembleFromDeviceRaw()
        GPU-->>TableOp: GPU Table
        TableOp-->>KudoGpuIter: ColumnarBatch
        KudoGpuIter-->>GpuIter: ColumnarBatch (GPU)
        GpuIter-->>Exec: ColumnarBatch
    end

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

Adds GPU-based Kudo deserialization as an optional disabled-by-default feature for shuffle reads, complementing existing CPU-based deserialization.

Key changes:

Splits SHUFFLE_KUDO_MODE config into separate SHUFFLE_KUDO_WRITE_MODE and SHUFFLE_KUDO_READ_MODE configs to allow independent control
Implements KudoGpuTableOperator and KudoGpuShuffleCoalesceIterator for GPU-side deserialization
Adds conflict resolution to disable async reads when GPU kudo mode is enabled (with warning)
Refactors iterator hierarchy with CoalesceIteratorBase and GpuMetricIteratorBase to reduce duplication
Updates test suite to verify GPU deserialization path

All issues from previous comments have been addressed:

Doc string corrected to say "deserialize shuffle inputs"
Test now sets both write and read modes to GPU
Zero-column batch handling implemented in KudoGpuTableOperator.concat()
Async/GPU mode conflict resolved with resolveUseAsync() function

Confidence Score: 5/5

Safe to merge - all previously identified issues have been resolved
All critical issues from previous review comments have been fixed: doc strings corrected, test properly configured for GPU read mode, zero-column batches handled, and async/GPU mode conflict resolved with proper warning. The implementation follows existing patterns with good abstractions.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	5/5	Adds GPU kudo deserialization support with proper async conflict resolution. Previous comments addressed the async/GPU mode conflict and test coverage.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	5/5	Splits kudo serializer mode into separate read/write configs. Doc string already corrected in previous comments.

Sequence Diagram

sequenceDiagram
    participant Exec as GpuShuffleCoalesceExec
    participant ReadOpt as CoalesceReadOption
    participant Utils as GpuShuffleCoalesceUtils
    participant KudoGpuIter as KudoGpuShuffleCoalesceIterator
    participant KudoGpuOp as KudoGpuTableOperator
    participant MetricIter as GpuColumnarBatchMetricIterator
    
    Exec->>ReadOpt: apply(conf)
    ReadOpt->>ReadOpt: resolveUseAsync(kudoMode, useAsync)
    Note over ReadOpt: If kudoMode=GPU & useAsync=true<br/>logs warning, sets useAsync=false
    ReadOpt-->>Exec: CoalesceReadOption(kudoEnabled, kudoMode, useAsync)
    
    Exec->>Utils: getGpuShuffleCoalesceIterator(iter, targetSize, dataTypes, readOption, metricsMap)
    
    alt kudoEnabled && kudoMode == GPU
        Utils->>KudoGpuIter: new KudoGpuShuffleCoalesceIterator(...)
        KudoGpuIter->>KudoGpuOp: new KudoGpuTableOperator(dataTypes)
        
        alt useAsync == false (enforced by resolveUseAsync)
            Utils->>MetricIter: new GpuColumnarBatchMetricIterator(iter)
            Note over MetricIter: Wraps GPU coalesced batches<br/>with metrics tracking
        end
    else kudoEnabled && kudoMode == CPU
        Utils->>Utils: new KudoHostShuffleCoalesceIterator(...)
        Note over Utils: Host-side kudo deserialization
    else !kudoEnabled
        Utils->>Utils: new HostShuffleCoalesceIterator(...)
        Note over Utils: Traditional JCudf deserialization
    end
    
    Note over KudoGpuIter,KudoGpuOp: GPU Deserialization Path
    KudoGpuIter->>KudoGpuIter: bufferNextBatch()
    KudoGpuIter->>KudoGpuIter: concatenateTablesInGpu()
    KudoGpuIter->>KudoGpuOp: concat(kudoTables[])
    KudoGpuOp->>KudoGpuOp: Copy headers & data to host buffers
    KudoGpuOp->>KudoGpuOp: Copy to device buffers
    KudoGpuOp->>KudoGpuOp: KudoGpuSerializer.assembleFromDeviceRaw()
    KudoGpuOp-->>KudoGpuIter: ColumnarBatch (GPU)
    
    KudoGpuIter-->>MetricIter: ColumnarBatch
    MetricIter->>MetricIter: Track output metrics
    MetricIter-->>Exec: ColumnarBatch (GPU)

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds GPU-based Kudo deserialization for shuffle reads, allowing the plugin to deserialize shuffle data directly on the GPU rather than on the CPU. The implementation introduces a new KudoGpuShuffleCoalesceIterator and KudoGpuTableOperator that handle GPU-side deserialization, mirroring the existing CPU-based Kudo implementation.

Key changes:

Added KudoBuffers case class to manage paired data/offsets buffers for Kudo format
Introduced KudoGpuTableOperator that performs deserialization on GPU using KudoGpuSerializer.assembleFromDeviceRaw()
Refactored iterator hierarchy with new base classes CoalesceIteratorBase and GpuMetricIteratorBase to share common logic
Added resolveUseAsync() to prevent incompatible async+GPU-kudo configuration (logs warning and disables async)
Created GpuColumnarBatchMetricIterator to track metrics for GPU-deserialized batches
Properly handles edge cases: zero-column batches, resource cleanup, type safety with explicit casts

The implementation follows the existing patterns in the codebase and includes comprehensive test coverage that validates the full serialization/deserialization round-trip.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation demonstrates excellent engineering practices: proper edge case handling (zero-column batches, async mode conflicts), clean refactoring with well-defined abstractions (base classes for code reuse), appropriate resource management (AutoCloseable, withResource), and comprehensive test coverage validating the full shuffle round-trip. The code correctly prevents the async+GPU-kudo combination through resolveUseAsync(), ensuring type safety throughout the pipeline.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	5/5	Adds GPU-based Kudo deserialization with proper handling of edge cases (zero-column batches, async mode conflicts), well-structured class hierarchy, and appropriate resource management

Sequence Diagram

sequenceDiagram
    participant User as Shuffle Read
    participant Utils as GpuShuffleCoalesceUtils
    participant GPU_Iter as KudoGpuShuffleCoalesceIterator
    participant GPU_Op as KudoGpuTableOperator
    participant Metric_Iter as GpuColumnarBatchMetricIterator
    participant Host_Iter as KudoHostShuffleCoalesceIterator
    participant Host_Op as KudoTableOperator
    
    User->>Utils: getGpuShuffleCoalesceIterator(kudoMode=GPU)
    Utils->>Utils: Check kudoEnabled && kudoMode==GPU
    alt GPU Kudo Mode
        Utils->>GPU_Iter: Create KudoGpuShuffleCoalesceIterator
        GPU_Iter->>GPU_Iter: Collect batches up to targetSize
        GPU_Iter->>GPU_Op: concat(serialized columns)
        GPU_Op->>GPU_Op: Allocate host buffers
        GPU_Op->>GPU_Op: Copy to device buffers
        GPU_Op->>GPU_Op: KudoGpuSerializer.assembleFromDeviceRaw()
        GPU_Op-->>GPU_Iter: ColumnarBatch
        GPU_Iter-->>Utils: Iterator[ColumnarBatch]
        Utils->>Metric_Iter: Wrap in GpuColumnarBatchMetricIterator
        Metric_Iter-->>User: Iterator[ColumnarBatch] with metrics
    else CPU Kudo Mode
        Utils->>Host_Iter: Create KudoHostShuffleCoalesceIterator
        Host_Iter->>Host_Iter: Collect batches up to targetSize
        Host_Iter->>Host_Op: concat(serialized columns)
        Host_Op->>Host_Op: KudoSerializer.mergeOnHost()
        Host_Op-->>Host_Iter: CoalescedHostResult
        Host_Iter-->>Utils: Iterator[CoalescedHostResult]
        Utils-->>User: Transfer to GPU in next stage
    end

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR implements GPU-side Kudo deserialization for shuffle reads, providing a new execution path that deserializes shuffle data directly on the GPU rather than on the host.

Key Changes

New GPU deserialization path: Introduces KudoGpuTableOperator and KudoGpuShuffleCoalesceIterator that perform Kudo deserialization entirely on the GPU by copying serialized data to device memory and using KudoGpuSerializer.assembleFromDeviceRaw()
Configuration: Adds SHUFFLE_KUDO_READ_MODE config (CPU/GPU) separate from write mode, with proper validation that disables async reads when GPU mode is enabled (they're incompatible)
Refactored iterator hierarchy: Extracts common coalescing logic into CoalesceIteratorBase, with specialized subclasses for host (HostCoalesceIteratorBase) and GPU (GpuCoalesceIteratorBase) processing
Join integration: Updates GpuShuffledHashJoinExec to handle both host and GPU deserialization paths by tracking whether batches are host-serialized vs already on GPU
Testing: Comprehensive new test suite validates end-to-end GPU deserialization with proper configuration

Issues Found

All critical issues from previous review comments have been addressed:

✅ Doc string corrected (SHUFFLE_KUDO_READ_MODE now says "deserialize shuffle inputs")
✅ Test now sets both write and read modes to GPU
✅ Async + GPU mode conflict resolved with resolveUseAsync() that disables async and logs warning
✅ Zero-column batches properly handled in KudoGpuTableOperator.concat()

The implementation correctly separates host and GPU paths throughout the codebase with appropriate type handling.

Confidence Score: 4/5

This PR is safe to merge with minor risk - all previously identified critical issues have been resolved
Score of 4 reflects that this is a complex feature addition touching shuffle execution paths, but the implementation is well-structured with proper error handling. All critical bugs from previous reviews (async/GPU incompatibility, test configuration, zero-column handling) have been fixed. The feature is disabled by default which reduces risk. Minor deduction because this introduces a new execution path that should be thoroughly performance-tested before enabling by default.
Pay close attention to GpuShuffleCoalesceExec.scala due to the complex iterator hierarchy refactoring and new GPU deserialization logic

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	4/5	Adds GPU-side Kudo deserialization with new `KudoGpuTableOperator` and `KudoGpuShuffleCoalesceIterator` classes; refactors iterator hierarchy; properly handles async/GPU mode conflicts. Previous critical issues (zero-column handling, test configuration) have been fixed.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala	5/5	Updates join execution to support both host and GPU kudo deserialization paths by returning tuple `(Iterator[AutoCloseable], Boolean)` to distinguish between host and GPU batches; correctly handles type casting based on `isHostSerialized` flag.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	5/5	Adds new `SHUFFLE_KUDO_READ_MODE` configuration to control Kudo deserialization location (CPU vs GPU); splits previous `shuffleKudoMode` into separate read/write modes; doc strings corrected in recent commits.
tests/src/test/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceSuite.scala	5/5	New test suite with comprehensive test for GPU kudo deserialization end-to-end; correctly configures both `SHUFFLE_KUDO_WRITE_MODE` and `SHUFFLE_KUDO_READ_MODE` to GPU; validates round-trip serialization/deserialization.

Sequence Diagram

sequenceDiagram
    participant Exec as GpuShuffleCoalesceExec
    participant CoalOpt as CoalesceReadOption
    participant KudoGpuIter as KudoGpuShuffleCoalesceIterator
    participant KudoGpuOp as KudoGpuTableOperator
    participant Host as HostMemoryBuffer
    participant Device as DeviceMemoryBuffer
    participant Kudo as KudoGpuSerializer
    participant Output as ColumnarBatch

    Note over Exec,CoalOpt: Configuration Phase
    Exec->>CoalOpt: apply(conf)
    CoalOpt->>CoalOpt: resolveUseAsync()<br/>(disables async if GPU mode)
    CoalOpt-->>Exec: CoalesceReadOption(kudoMode=GPU, useAsync=false)

    Note over Exec,KudoGpuIter: Iterator Creation
    Exec->>KudoGpuIter: new KudoGpuShuffleCoalesceIterator(iter, targetSize, dataTypes)
    KudoGpuIter->>KudoGpuOp: new KudoGpuTableOperator(dataTypes)

    Note over KudoGpuIter,Output: Batch Processing Loop
    loop For each batch
        KudoGpuIter->>KudoGpuIter: bufferNextBatch()<br/>(collect tables up to target size)
        KudoGpuIter->>KudoGpuIter: extractAndUpdateBatch()
        KudoGpuIter->>KudoGpuOp: concat(kudoTables[])
        
        alt numCols == 0
            KudoGpuOp-->>Output: new ColumnarBatch(empty, rowCount)
        else numCols > 0
            KudoGpuOp->>Host: allocate data and offsets buffers
            KudoGpuOp->>Host: copy kudo headers and table data
            KudoGpuOp->>Device: allocate device buffers
            KudoGpuOp->>Device: copyFromHostBuffer()
            KudoGpuOp->>Kudo: assembleFromDeviceRaw(schema, dataDev, offsetsDev)
            Kudo-->>KudoGpuOp: cudf Table
            KudoGpuOp->>Output: GpuColumnVector.from(table, dataTypes)
            KudoGpuOp-->>KudoGpuIter: ColumnarBatch (on GPU)
        end
        
        KudoGpuIter-->>Exec: ColumnarBatch (GPU-side)
    end

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR implements GPU-side Kudo deserialization for shuffle reads, adding a disabled-by-default feature controlled by spark.rapids.shuffle.kudo.serializer.read.mode.

Key Changes:

Introduces KudoGpuTableOperator that performs deserialization directly on the GPU
Adds KudoGpuShuffleCoalesceIterator and supporting iterator base classes to handle GPU-path coalescing
Creates KudoBuffers case class to manage paired host/device buffers for Kudo format
Implements conflict resolution between async read and GPU kudo mode (GPU mode takes precedence with warning)
Properly handles zero-column batches
Adds comprehensive unit test that validates end-to-end GPU deserialization

Previous Review Feedback:
Previous comments flagged potential issues with async read compatibility and test configuration. These have been properly addressed:

Async/GPU kudo conflict is handled via resolveUseAsync() which disables async when GPU mode is enabled (GpuShuffleCoalesceExec.scala:117-126)
Test correctly configures both write and read modes to GPU (GpuShuffleCoalesceSuite.scala:108-109)
Zero-column handling is implemented (GpuShuffleCoalesceExec.scala:358-360)

Confidence Score: 4/5

This PR is safe to merge with minor risk - implementation is sound and previous concerns have been addressed
The implementation properly handles GPU deserialization with good separation of concerns. The async/GPU kudo conflict resolution prevents runtime issues. Zero-column edge cases are handled. The comprehensive test validates the full serialization round-trip. Previous reviewer concerns have been properly addressed. Score is 4 instead of 5 because this is a complex feature touching critical shuffle code paths, and while well-implemented, would benefit from additional performance validation in production.
No files require special attention - the implementation is well-structured and previous issues have been resolved

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	4/5	Adds GPU-side Kudo deserialization with new KudoGpuTableOperator and iterator classes. Includes proper async conflict resolution and zero-column handling. Previous reviewer concerns about async compatibility have been addressed.

Sequence Diagram

sequenceDiagram
    participant Client as Shuffle Reader
    participant Utils as GpuShuffleCoalesceUtils
    participant Coalesce as KudoGpuShuffleCoalesceIterator
    participant Op as KudoGpuTableOperator
    participant GPU as GPU Memory/Serializer
    
    Client->>Utils: getGpuShuffleCoalesceIterator(kudoMode=GPU)
    Utils->>Utils: Check readOption.kudoMode
    Utils->>Coalesce: Create KudoGpuShuffleCoalesceIterator
    
    loop For each batch request
        Client->>Coalesce: next()
        Coalesce->>Coalesce: bufferNextBatch() - collect tables up to target size
        Coalesce->>Coalesce: acquireGpuSemaphore()
        Coalesce->>Op: concat(columns)
        
        Op->>Op: Check if numCols == 0
        alt Zero columns
            Op->>Client: Return empty ColumnarBatch
        else Has columns
            Op->>Op: Calculate buffer sizes
            Op->>Op: Allocate host buffers (data + offsets)
            Op->>Op: Serialize headers and data to host
            Op->>GPU: Copy to device buffers
            GPU->>GPU: KudoGpuSerializer.assembleFromDeviceRaw()
            GPU->>Op: Return Table
            Op->>Coalesce: Return ColumnarBatch
            Coalesce->>Client: Return deserialized batch
        end
    end

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zpuller · 2025-11-03T22:22:30Z

build

abellina

Looking good, one last one.

…alesceExec.scala Co-authored-by: Alessandro Bellina <abellina@gmail.com>

abellina · 2025-11-04T17:18:58Z

build

greptile-apps

Greptile Overview

Greptile Summary

This PR implements GPU-based Kudo deserialization for shuffle reads, providing a performance optimization that shows a 10% improvement on test queries.

Key Changes:

Added new KudoGpuTableOperator and KudoGpuShuffleCoalesceIterator to perform deserialization directly on GPU
Introduced SHUFFLE_KUDO_READ_MODE configuration (separate from write mode) with CPU/GPU options, defaulting to CPU
Refactored iterator hierarchy: created CoalesceIteratorBase base class and split into HostCoalesceIteratorBase and GpuCoalesceIteratorBase
Added KudoBuffers case class to manage host/device memory buffers for serialization
Implemented conflict resolution between async read and GPU kudo mode (GPU mode takes precedence with warning)
Added GpuMetricIteratorBase and GpuColumnarBatchMetricIterator for proper metrics tracking when GPU deserialization is used
Updated delta-lake modules to use the new separate SHUFFLE_KUDO_WRITE_MODE config

The implementation properly handles the interaction between async reads and GPU kudo mode by disabling async when GPU mode is active (via resolveUseAsync), preventing type safety issues.

Previous Review Comments:
All previously identified issues have been addressed in the current code.

Confidence Score: 4/5

This PR is safe to merge with minor risk from complexity
The implementation is well-structured with proper resource management, comprehensive refactoring, and conflict resolution between async and GPU modes. The 10% performance improvement validates the approach. Score is 4/5 rather than 5/5 due to the significant complexity added with new iterator hierarchies and the interaction between multiple configuration modes that could have subtle edge cases in production.
No files require special attention - the refactoring is comprehensive and the previously identified issues have been resolved

Important Files Changed

File Analysis

Filename	Score	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala	4/5	Adds GPU-based Kudo deserialization with proper async handling and comprehensive refactoring of coalesce iterators

Sequence Diagram

sequenceDiagram
    participant App as Spark Application
    participant Exec as GpuShuffleCoalesceExec
    participant ReadOpt as CoalesceReadOption
    participant HostIter as KudoHostShuffleCoalesceIterator
    participant GpuIter as KudoGpuShuffleCoalesceIterator
    participant MetricIter as GpuColumnarBatchMetricIterator
    participant DevMem as GPU Memory

    App->>Exec: executeColumnar
    Exec->>ReadOpt: Create from conf
    ReadOpt->>ReadOpt: resolveUseAsync
    
    alt GPU Kudo Mode and Async Enabled
        ReadOpt->>ReadOpt: Log warning disable async
    end
    
    ReadOpt-->>Exec: ReadOption with resolved settings
    
    alt Kudo Enabled and GPU Mode
        Exec->>GpuIter: Create iterator
        GpuIter->>GpuIter: bufferNextBatch collect tables
        GpuIter->>GpuIter: concatenateTablesInGpu
        GpuIter->>DevMem: Copy to device memory
        GpuIter->>DevMem: assembleFromDeviceRaw
        GpuIter-->>Exec: ColumnarBatch on GPU
        
        alt Sync Mode no async
            Exec->>MetricIter: Wrap for metrics
            MetricIter->>MetricIter: Track metrics only
            MetricIter-->>App: ColumnarBatch
        end
    else Kudo Enabled and CPU Mode
        Exec->>HostIter: Create iterator
        HostIter->>HostIter: bufferNextBatch collect tables
        HostIter->>HostIter: concatenateTablesInHost
        HostIter-->>Exec: CoalescedHostResult
        Exec->>Exec: Transfer to GPU
        Exec-->>App: ColumnarBatch
    end

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zpuller · 2025-11-04T22:39:17Z

build

zpuller added 2 commits September 24, 2025 13:28

start kudo gpu read impl

181e34d

Signed-off-by: Zach Puller <zpuller@nvidia.com>

separate config for kudo gpu reads

38609da

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller changed the title ~~Kudo gpu read~~ Support kudo GPU shuffle reads in the plugin Sep 24, 2025

zpuller added 5 commits September 24, 2025 15:32

formatting

dba80b0

Signed-off-by: Zach Puller <zpuller@nvidia.com>

fix scala 2.13 build

29a725b

Signed-off-by: Zach Puller <zpuller@nvidia.com>

integrate gpu kudo reads in shuffled hash join

9252362

Signed-off-by: Zach Puller <zpuller@nvidia.com>

change kudo write mode config name

5d305ef

Signed-off-by: Zach Puller <zpuller@nvidia.com>

Merge branch 'branch-25.12' into kudo_gpu_read

d6de781

zpuller changed the base branch from branch-25.10 to branch-25.12 September 29, 2025 16:34

zpuller mentioned this pull request Oct 7, 2025

[BUG] Several NDS queries return incorrect results with gpu kudo reads (shuffle_assemble) NVIDIA/cudf-spark-jni#3811

Closed

zpuller added 8 commits October 11, 2025 10:45

rm comment

b233c81

format

37a3edc

Signed-off-by: Zach Puller <zpuller@nvidia.com>

rename SHUFFLE_KUDO_MODE to SHUFFLE_KUDO_WRITE_MODE

4240938

Signed-off-by: Zach Puller <zpuller@nvidia.com>

refactor CoalesceIterators

786b232

Signed-off-by: Zach Puller <zpuller@nvidia.com>

format

88c70ad

Signed-off-by: Zach Puller <zpuller@nvidia.com>

add unit test

7c05cb4

Signed-off-by: Zach Puller <zpuller@nvidia.com>

fix test

5436d1a

Signed-off-by: Zach Puller <zpuller@nvidia.com>

fix

c94bc6f

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller marked this pull request as ready for review October 11, 2025 22:06

zpuller requested review from abellina and revans2 October 12, 2025 19:03

mv gpu acquire sem to right before kudo gpu concat

e3d48ff

Signed-off-by: Zach Puller <zpuller@nvidia.com>

sameerz added the performance A performance related task/issue label Oct 22, 2025

thirtiseven mentioned this pull request Oct 27, 2025

[BUG] some hash_aggregate integration tests failed with kudo.serializer.mode=GPU #13664

Closed

zpuller requested a review from binmahone October 27, 2025 18:05

binmahone reviewed Oct 28, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala

zpuller requested a review from abellina October 31, 2025 22:05

greptile-apps Bot reviewed Oct 31, 2025

View reviewed changes

pr comment

40699d2

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Nov 3, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated

abellina reviewed Nov 3, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala Outdated

abellina reviewed Nov 3, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala Outdated

zpuller added 4 commits November 3, 2025 12:14

format

4d55447

Signed-off-by: Zach Puller <zpuller@nvidia.com>

factor out shared GpuMetricIteratorBase

e72ff88

Signed-off-by: Zach Puller <zpuller@nvidia.com>

kudo buffer autocloseable wrapper

7333b0b

Signed-off-by: Zach Puller <zpuller@nvidia.com>

pr comments

9cc8289

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Nov 3, 2025

View reviewed changes

pr comments

d353f51

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Nov 3, 2025

View reviewed changes

zpuller requested a review from abellina November 3, 2025 21:12

dont pass T to KudoBuffers

a582b6c

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Nov 3, 2025

View reviewed changes

format

c2731d6

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Nov 3, 2025

View reviewed changes

abellina reviewed Nov 4, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCo…

b9bf0b6

…alesceExec.scala Co-authored-by: Alessandro Bellina <abellina@gmail.com>

abellina approved these changes Nov 4, 2025

View reviewed changes

greptile-apps Bot reviewed Nov 4, 2025

View reviewed changes

zpuller merged commit 5d85dd1 into NVIDIA:main Nov 5, 2025
60 checks passed

zpuller deleted the kudo_gpu_read branch November 5, 2025 17:21

Uh oh!

Conversation

zpuller commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklists

Uh oh!

zpuller commented Oct 11, 2025

Uh oh!

binmahone commented Oct 24, 2025

Uh oh!

Uh oh!

binmahone commented Oct 28, 2025

Uh oh!

zpuller commented Oct 28, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

zpuller commented Oct 31, 2025

Uh oh!

binmahone commented Nov 3, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Test Coverage

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Issues Found

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

zpuller commented Nov 3, 2025

zpuller commented Sep 24, 2025 •

edited

Loading