More flexible APIs #166

dutkalex · 2025-06-18T13:48:17Z

dutkalex
Jun 18, 2025

Hi everyone!
I’d like to submit to your review an idea for improving the usability of the high-level APIs.

First, let me articulate the problem I want to solve. One of the first milestones for KokkosComm is be able to send data between processes using traditional point-to-point and collective communication patterns, synchronously and asynchronously, for both contiguous and non-contiguous Kokkos::Views. While implementing the synchronous case is fairly straightforward, the async case raises many questions, especially when sending non-contiguous data:

What is the most efficient way to send non-contiguous data? Packing and unpacking seems like the right behavior by default, but other strategies might be preferable in some cases (Design Constraints/Limitations #28).
What should be the lifetime of the packing/unpacking buffers? They have to be kept around at least until the communication completes, but for some applications it is common to reuse these buffers multiple times to avoid allocating and deallocating over and over again.
How to support overlapping communications with computation when packing and unpacking is required? Hiding communication costs by performing other computations in the meantime is definitely one of the most appealing properties of async communications, so inserting a synchronous packing step somewhat defeats the purpose such constructs and limits the scalability of user code.

The current « automagic » approach using the KokkosComm::Req<CommSpace> object is therefore incompatible with the level of flexibility required to actually meet the diversity of user needs. Ideally, we want KokkosComm to empower users to implement the best communication scheme for their problem in a direct and portable way, not force them to conform to a predefined way of doing things that is likely to be suboptimal in many contexts. Otherwise, users will be tempted to call the lower level internal APIs when they want a different behavior, or simply will write their own communication wrappers to fit their needs.

IMO, the root problem here is that we are attempting to hide the contiguity problem to the user, and that is not the right way to go: this approach allocates and copies data behind the user’s back, which is contrary the « no implicit deep copies » fundamental design principle of Kokkos. Instead, I argue that we need to treat this whole packing/unpacking process as its own thing in the KokkosComm model, and make it customizable by the users.

That being said, the question of how to manage the lifetime of packing buffers remains unanswered, but I believe we already have the right abstraction in place to deal with this. The KokkosComm::Handle<ExecSpace,CommSpace> naturally predates and outlives the asynchronous communication calls and could be responsible for providing (in a customizable way, depending on the context) the necessary packing/unpacking buffers. Here is a strawman example to illustrate the proposed idea:

// The temporary buffers are acquired from the handle, and opaque tokens are returned to later wait for completion
auto opaque_recv_token = KokkosComm::recv_async(non_contig_recv_data, rank, tag, handle);
auto opaque_pack_token = KokkosComm::pack_async(non_contig_send_data, handle);

// Here we might want to do something else concurrently while packing

// The opaque_pack_token is consumed and a new one is produced for the send step
auto opaque_send_token = KokkosComm::send_async(opaque_pack_handle, rank, tag, handle);

// Here we might want to do something else concurrently while communicating

// The opaque_recv_token is consumed and a new one is produced for the unpacking step
auto opaque_unpack_token = KokkosComm::unpack_async(opaque_recv_token, handle);
 
// Here we might want to do something else concurrently while unpacking

// All the tokens are consumed and the temporary buffers are handed back to the handle
KokkosComm::wait_all(opaque_unpack_token, opaque_send_token, handle);

Naming and syntax is completely up for debate ; the above example is only meant to illustrate the flexibility and composability of the design: users can open the hood at any point and swap in a custom implementation better suited for their needs. From these simple buildings blocks, we could then easily define helper functions to bundle operations together when it makes sense to do so (the current solution could be easily implemented in terms of these primitive for example). This approach also has the added benefit that all the communication calls only have to handle contiguous data, thus simplifying their implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More flexible APIs #166

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

More flexible APIs #166

Uh oh!

dutkalex Jun 18, 2025

Replies: 0 comments

dutkalex
Jun 18, 2025