Skip to content

[FEA] pack/unpack functions to merge/split (multiple) device_buffer(s) #9726

Open
@jakirkham

Description

@jakirkham

Is your feature request related to a problem? Please describe.

It would be useful to have a pack function to merge multiple device_buffers into a single device_buffer. This is helpful in situations where having one large device_buffer to read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.

Similarly it would be useful to have an unpack function to split a device_buffer into multiple device_buffers. This is helpful in situations where having one large device_buffer to write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.

Describe the solution you'd like

For pack it would be nice if it simply takes several device_buffers in vector form and return a single one. Additionally it would be nice if pack could recognize when device_buffers are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so if unpack is used regularly?). If we allow pack to change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the larger device_buffer.

For unpack it would be nice if it takes a single device_buffer and size_ts in vector form to split and return a vector of multiple device_buffers. Additionally it would be nice if unpack did not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.

Describe alternatives you've considered

One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use vector.

pack itself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).

Additional context

Having unpack in particular would be helpful for aggregated receives. A natural extension of this would be to have pack for aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR ( dask/distributed#3453 ) provides a WIP implementation of aggregated receives for context.

Also having pack would be useful when spilling several device_buffers from device to host as it would allow us to pack them into one device_buffer before transferring ( rapidsai/dask-cuda#250 ). Having unpack would help us break up the allocation whenever the object is unspilled.

This need has also come up in downstream contexts ( #3793 ). Maybe they would benefit from an upstream solution as well?

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions