Skip to content

[BUG] split/slice APIs do not align with partitioning APIs #4607

Open
@jrhemstad

Description

@jrhemstad

Describe the bug

Partitioning APIs that partition a table into n partitions, like hash_partition or round_robin_partition, return a single table and a vector of n+1 offsets that points to the beginning of each partition and where the size of any partition i can be determined by offsets[i+1] - offsets[i].

For example:

partitioned_table = {7}, {}, {3, 8, 9}, {42};
offsets = [0, 1, 1, 4, 5]

I would expect to be able to trivially pass the output of a partitioning API into an API like split or slice in order to get a vector of zero-copy table_views for each partition.

However, this is not possible because the expected inputs for split or slice are incompatible with the offsets vector returned from a partitioning API.

slice expects a vector of index pairs:

 input:   [{10, 12, 14, 16, 18, 20, 22, 24, 26, 28},
           {50, 52, 54, 56, 58, 60, 62, 64, 66, 68}]
 indices: {1, 3, 5, 9, 2, 4, 8, 8}
 output:  [{{12, 14}, {20, 22, 24, 26}, {14, 16}, {}},
           {{52, 54}, {60, 22, 24, 26}, {14, 16}, {}}]

split expects a vector of the split points:

 input:   {10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
 splits:  {2, 5, 9}
 output:  {{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}}

Neither of these are trivially compatible with the output of a partitioning API.

split is the closest. You can obtain the splits vector from the offsets vector by dropping the first and last element from offsets. However, that is inconvenient.

Expected behavior

There should be an API that allows naively passing in the vector of offsets from a partitioning API and it returns a vector of zero-copy views for each partition.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions