Skip to content

Add a chunk batch interface #175

Open
@nirandaperera

Description

@nirandaperera

Currently, all input partitions are split into output_number_partitions (n_out) number of chunks and added to the shuffler. If there are n input partitions in each worker then, there will be n*n_out chunks in each worker. If there were p workers, then on average n*n_out/p number of messages are sent externally to another worker. These could potentially be very small.

Concatenating these small chunks to a batch could potentially improve the communication performance. Since the shuffler receives metadata and data buffers separately, this concatenation/ batching will have to be done on outside of cudf utils. Overhead of this approach would be increased memory pressure.

This approach is discussed further in this document.

ChunkBatch would encapsulate a batch of chunks added to the shuffler, intended to be sent to a particular target worker/ rank. Just like a Chunk, a ChunkBatch will contain a uid. Then there will be two buffers, concatenated metadata and concatenated payloads.
Most trivial way of creating a batch would be to copy the metadata of each chunk on to the concatenated buffer one after the other. Same can follow the payloads. This allows us to traverse the a batch using a forward iterator. However, random access is not possible as we wouldn't know the stride for the ith chunk in the batch.

If we were to add a random access iterator to the batch, we would need more information, such as the prefix sum of metadata buffers and payload buffers in each chunk.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions