UCP/RMA: Add GET and PUT rendezvous protocols#11482
Conversation
|
CI fix is #11533 |
| /* Remote buffer memory info for RTR_REQ */ | ||
| ucp_memory_info_t remote_mem_info; | ||
|
|
There was a problem hiding this comment.
do we really need it here or can create a separate struct in the union below?
There was a problem hiding this comment.
removed as we should be able to use rkey for that, even with fragment requests.
| * Re-check the key because recursive lookup may have initialized this | ||
| * exact selection already. | ||
| */ | ||
| khiter = kh_get(ucp_proto_select_hash, proto_select->hash, key.u64); |
There was a problem hiding this comment.
added comments above.
| * Relevant for UCP_OP_ID_RNDV_SEND and UCP_OP_ID_RNDV_RECV. */ | ||
| #define UCP_PROTO_SELECT_OP_FLAG_PPLN_FRAG (UCP_PROTO_SELECT_OP_FLAGS_BASE << 1) | ||
|
|
||
|
|
There was a problem hiding this comment.
minor: can avoid this change
| ucp_ep_h ep; | ||
|
|
||
| UCP_WORKER_GET_VALID_EP_BY_ID(&ep, worker, rts->sreq.ep_id, { | ||
| ucp_datatype_iter_cleanup(&recv_req->recv.dt_iter, 1, UCP_DT_MASK_ALL); |
There was a problem hiding this comment.
we need to cleanup here, because now we have internal recv_req (in practice it might be no op as memh is null)
| ucp_datatype_iter_cleanup(&recv_req->recv.dt_iter, 1, UCP_DT_MASK_ALL); | ||
| ucp_proto_rndv_recv_req_complete(recv_req, UCS_ERR_NO_MEMORY); |
There was a problem hiding this comment.
is it unrelated fix?
just for understanding
There was a problem hiding this comment.
afaiu the generic issue was already there, but RMA/RNDV now creates an internal recv_req on this path, so on allocation failure we must complete it with NO_MEMORY to send error ATS and release it.
| ucs_status_t status; | ||
|
|
||
| was_initialized = req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED; | ||
| status = ucp_proto_rndv_rts_request_init(req); |
There was a problem hiding this comment.
do we also need to do it once? Otherwise we may initialize the same request nultiple times, while UCT will be returning NO_RESOURCES
There was a problem hiding this comment.
it is covered by was_initialized, and rts_request_init is protected, but moved was_initialized outside of ucp_proto_put_rndv_init to try match the usual pattern.
| static void | ||
| ucp_proto_put_rndv_query(const ucp_proto_query_params_t *params, | ||
| ucp_proto_query_attr_t *attr) | ||
| { | ||
| ucp_proto_rma_rndv_query(params, attr, UCP_PROTO_RNDV_DESC); | ||
| } | ||
|
|
||
| static void | ||
| ucp_proto_get_rndv_query(const ucp_proto_query_params_t *params, | ||
| ucp_proto_query_attr_t *attr) | ||
| { | ||
| ucp_proto_rma_rndv_query(params, attr, UCP_PROTO_RNDV_DESC); | ||
| } |
There was a problem hiding this comment.
can you have just one common function?
| recv_req->recv.rndv.ep_id = rts->super.sreq.ep_id; | ||
| recv_req->recv.rndv.complete_cb = ucp_rma_rndv_put_recv_complete; |
What?
Add rendezvous-based protocols for PUT and GET:
put/rndvas anRMA_RTSwrapper over regular RNDV receiveget/rndvas a push-only RNDV wrapper using spontaneousRTR_REQ.Why?
Reuse existing rendezvous flows for RMA transfers to support cases where native RMA cannot directly access either the source or destination.
How?
PUT using
put/rndv:GET uses a rndv push flow only, using
get/rndv:Common:
Config for instance:
UCX_MAX_RNDV_RAILSapplies through the reused RNDV data path,UCX_RNDV_SCHEMEis functional forput/rndv.