[FEA] kud0 optimization for empty copies during shuffle_assemble


Benchmarking showed a surprising amount of time spent on the GPU servicing "empty" copy commands generated during the shuffle step.  Fundamentally we assume every column has 3 copies to perform: validity, offsets, data.  For cases where there's no work to do, we still generate a command to copy 0 bytes.  The kernel just immediately drops out for those, but for situations where we have very large splits, the overhead can be nontrivial.  It wouldn't be too hard to change how the copy batches are generated so that the list is compact and devoid of empties.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] kud0 optimization for empty copies during shuffle_assemble #3020

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] kud0 optimization for empty copies during shuffle_assemble #3020

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions