-
Notifications
You must be signed in to change notification settings - Fork 13
Description
An idea of mine from some send/receive packed_range discussion with @loganharbour late last night, slightly edited:
What if we always do chunked sends conceptually, but:
-
We pack any metadata in with the buffer itself, so that we can use a Status to allocate the receive buffer and we don't have to send a size integer separately. To enable better computation+communication interleaving during nonblocking sends, we can avoid packing a total_buffer_size, and just pack a bool in each chunk that says whether subsequent chunks are expected.
-
If a probe for the first buffer succeeds, we just let probes for the packed receive return true and we allow "semi" non-blocking receives in the API: when a PostWaitUnpackBuffer handles the first received chunk's buffer, it checks to see if subsequent chunks are expected and it treats those with a receive-and-unpack loop (blocking as a whole, even if we use non-blocking vector receives internally to let receive and unpack steps overlap).
Then:
We get all the benefits of the existing chunked case: User code which wants to interleave communication and computation as much as possible within a single send/receive pair can set a moderate max chunk size, and it'll work the way the chunked case does now, so they can begin processing earlier chunks while later chunks are still in transit.
We improve slightly on the existing chunked case flexibility and performance: at least the first chunk of a receive can be non-blocking, and each subsequent chunk's receive can be started (non-blocking internally even if we're using the blocking user API) before the meat of the prior chunk gets unpacked, in cases where a probe for it succeeds.
We get all the benefits of the non-chunked case: User code can set a 2e9 max chunk size and it'll work the way the non-chunked case does now, except for one extra buffer entry and one extra if test. There still won't be the latency of waiting for a separate total_size message, and we'll still be able to allocate the incoming buffer based on a probed Status.
We improve slightly on the existing non-chunked case robustness: If not everything fits in 2e9 entries, and we set a 2e9 max chunk size and try to do a non-blocking receive, then in some cases we end up stupidly twiddling our thumbs while waiting for that next chunk, which sucks ... but right now what we do in that case is timpi_error_msg("Non-blocking packed range sends cannot exceed " << std::numeric_limits<int>::max() << "in size"); so we're still moving in the right direction.
We improve a lot on the flexibility of performance testing: We can experiment with using smaller max chunk sizes to see if we can get some more communication/computation overlap, and this can even be benchmarked without recompiling because the difference between the chunked and the unchunked API is now just the value of an integer variable, not two different methods with different argument lists.
Publishing because I don't have much time to play with this right now and I don't want to forget about it when that changes, but also because I'd love to know if @friedmud can poke any holes in this idea.