Description
Is your feature request related to a problem? Please describe.
It would be useful to have a pack
function to merge multiple device_buffer
s into a single device_buffer
. This is helpful in situations where having one large device_buffer
to read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.
Similarly it would be useful to have an unpack
function to split a device_buffer
into multiple device_buffer
s. This is helpful in situations where having one large device_buffer
to write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.
Describe the solution you'd like
For pack
it would be nice if it simply takes several device_buffer
s in vector
form and return a single one. Additionally it would be nice if pack
could recognize when device_buffer
s are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so if unpack
is used regularly?). If we allow pack
to change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the larger device_buffer
.
For unpack
it would be nice if it takes a single device_buffer
and size_t
s in vector
form to split and return a vector
of multiple device_buffer
s. Additionally it would be nice if unpack
did not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.
Describe alternatives you've considered
One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use vector
.
pack
itself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).
Additional context
Having unpack
in particular would be helpful for aggregated receives. A natural extension of this would be to have pack
for aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR ( dask/distributed#3453 ) provides a WIP implementation of aggregated receives for context.
Also having pack
would be useful when spilling several device_buffer
s from device to host as it would allow us to pack them into one device_buffer
before transferring ( rapidsai/dask-cuda#250 ). Having unpack
would help us break up the allocation whenever the object is unspilled.
This need has also come up in downstream contexts ( #3793 ). Maybe they would benefit from an upstream solution as well?