Skip to content

[FEA] kud0 optimization for empty copies during shuffle_assemble #3020

Open
@nvdbaranec

Description

@nvdbaranec

Benchmarking showed a surprising amount of time spent on the GPU servicing "empty" copy commands generated during the shuffle step. Fundamentally we assume every column has 3 copies to perform: validity, offsets, data. For cases where there's no work to do, we still generate a command to copy 0 bytes. The kernel just immediately drops out for those, but for situations where we have very large splits, the overhead can be nontrivial. It wouldn't be too hard to change how the copy batches are generated so that the list is compact and devoid of empties.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions