Add a chunk batch interface

Currently, all input partitions are split into `output_number_partitions (n_out)` number of chunks and added to the shuffler. If there are `n` input partitions in each worker then, there will be `n*n_out` chunks in each worker. If there were `p` workers, then on average `n*n_out/p` number of messages are sent externally to another worker. These could potentially be very small. 

Concatenating these small chunks to a batch could potentially improve the communication performance. Since the shuffler receives metadata and data buffers separately, this concatenation/ batching will have to be done on outside of cudf utils. Overhead of this approach would be increased memory pressure. 

This approach is discussed further in this [document](https://docs.google.com/document/d/1lFbAAbwFJRvZ5ie3Svg66aUIc5TacJ5oT7M3VYRnHw0/edit?usp=sharing).  


`ChunkBatch` would encapsulate a batch of chunks added to the shuffler, intended to be sent to a particular target worker/ rank. Just like a `Chunk`, a `ChunkBatch` will contain a uid. Then there will be two buffers, concatenated metadata and concatenated payloads. 
Most trivial way of creating a batch would be to copy the metadata of each chunk on to the concatenated buffer one after the other. Same can follow the payloads. This allows us to traverse the a batch using a forward iterator. However, random access is not possible as we wouldn't know the stride for the `i`th chunk in the batch. 

If we were to add a random access iterator to the batch, we would need more information, such as the prefix sum of metadata buffers and payload buffers in each chunk. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a chunk batch interface #175

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a chunk batch interface #175

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions