Open
Description
Benchmarking showed a surprising amount of time spent on the GPU servicing "empty" copy commands generated during the shuffle step. Fundamentally we assume every column has 3 copies to perform: validity, offsets, data. For cases where there's no work to do, we still generate a command to copy 0 bytes. The kernel just immediately drops out for those, but for situations where we have very large splits, the overhead can be nontrivial. It wouldn't be too hard to change how the copy batches are generated so that the list is compact and devoid of empties.