[FEA] `pack`/`unpack` functions to merge/split (multiple) `device_buffer`(s)

**Is your feature request related to a problem? Please describe.**

It would be useful to have a `pack` function to merge multiple `device_buffer`s into a single `device_buffer`. This is helpful in situations where having one large `device_buffer` to read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.

Similarly it would be useful to have an `unpack` function to split a `device_buffer` into multiple `device_buffer`s. This is helpful in situations where having one large `device_buffer` to write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.

**Describe the solution you'd like**

For `pack` it would be nice if it simply takes several `device_buffer`s in `vector` form and return a single one. Additionally it would be nice if `pack` could recognize when `device_buffer`s are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so if `unpack` is used regularly?). If we allow `pack` to change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the larger `device_buffer`.

For `unpack` it would be nice if it takes a single `device_buffer` and `size_t`s in `vector` form to split and return a `vector` of multiple `device_buffer`s. Additionally it would be nice if `unpack` did not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.

**Describe alternatives you've considered**

One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use `vector`.

`pack` itself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).

**Additional context**

Having `unpack` in particular would be helpful for aggregated receives. A natural extension of this would be to have `pack` for aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR  ( https://github.com/dask/distributed/pull/3453 ) provides a WIP implementation of aggregated receives for context.

Also having `pack` would be useful when spilling several `device_buffer`s from device to host as it would allow us to pack them into one `device_buffer` before transferring ( https://github.com/rapidsai/dask-cuda/issues/250 ). Having `unpack` would help us break up the allocation whenever the object is unspilled.

This need has also come up in downstream contexts ( https://github.com/rapidsai/cudf/issues/3793 ). Maybe they would benefit from an upstream solution as well?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] `pack`/`unpack` functions to merge/split (multiple) `device_buffer`(s) #9726

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] pack/unpack functions to merge/split (multiple) device_buffer(s) #9726

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA] `pack`/`unpack` functions to merge/split (multiple) `device_buffer`(s) #9726