Skip to content

UCP/RMA: Add GET and PUT rendezvous protocols#11482

Open
tvegas1 wants to merge 32 commits into
openucx:masterfrom
tvegas1:rma_rndv
Open

UCP/RMA: Add GET and PUT rendezvous protocols#11482
tvegas1 wants to merge 32 commits into
openucx:masterfrom
tvegas1:rma_rndv

Conversation

@tvegas1

@tvegas1 tvegas1 commented May 22, 2026

Copy link
Copy Markdown
Contributor

What?

Add rendezvous-based protocols for PUT and GET:

  • put/rndv as an RMA_RTS wrapper over regular RNDV receive
  • get/rndv as a push-only RNDV wrapper using spontaneous RTR_REQ.

Why?

Reuse existing rendezvous flows for RMA transfers to support cases where native RMA cannot directly access either the source or destination.

How?

PUT using put/rndv:

  • RNDV_SEND creation on origin, RMA_RTS tx, RNDV_RECV creation on target

GET uses a rndv push flow only, using get/rndv:

  • RNDV_RECV creation on origin, RTR_REQ tx, RNDV_SEND creation on target, data push and relevant AMs.

Common:

  • RNDV fragmentation, mtype staging, pipeline flow, rkey packing, ATP/ATS handling, request cleanup
    Config for instance:
  • UCX_MAX_RNDV_RAILS applies through the reused RNDV data path, UCX_RNDV_SCHEME is functional for put/rndv.

Comment thread src/ucp/rma/rma_rndv.c Outdated
Comment thread src/ucp/rma/rma_rndv.c
Comment thread src/ucp/rma/rma_rndv.c
@tvegas1

tvegas1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

CI fix is #11533

Comment thread src/ucp/core/ucp_request.h Outdated
Comment on lines +268 to +270
/* Remote buffer memory info for RTR_REQ */
ucp_memory_info_t remote_mem_info;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need it here or can create a separate struct in the union below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed as we should be able to use rkey for that, even with fragment requests.

* Re-check the key because recursive lookup may have initialized this
* exact selection already.
*/
khiter = kh_get(ucp_proto_select_hash, proto_select->hash, key.u64);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments above.

* Relevant for UCP_OP_ID_RNDV_SEND and UCP_OP_ID_RNDV_RECV. */
#define UCP_PROTO_SELECT_OP_FLAG_PPLN_FRAG (UCP_PROTO_SELECT_OP_FLAGS_BASE << 1)


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can avoid this change

Comment thread src/ucp/rndv/proto_rndv.c
ucp_ep_h ep;

UCP_WORKER_GET_VALID_EP_BY_ID(&ep, worker, rts->sreq.ep_id, {
ucp_datatype_iter_cleanup(&recv_req->recv.dt_iter, 1, UCP_DT_MASK_ALL);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it unrelated fix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to cleanup here, because now we have internal recv_req (in practice it might be no op as memh is null)

Comment thread src/ucp/rndv/proto_rndv.c
Comment on lines +829 to +830
ucp_datatype_iter_cleanup(&recv_req->recv.dt_iter, 1, UCP_DT_MASK_ALL);
ucp_proto_rndv_recv_req_complete(recv_req, UCS_ERR_NO_MEMORY);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it unrelated fix?
just for understanding

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiu the generic issue was already there, but RMA/RNDV now creates an internal recv_req on this path, so on allocation failure we must complete it with NO_MEMORY to send error ATS and release it.

Comment thread src/ucp/rma/rma_rndv.c Outdated
ucs_status_t status;

was_initialized = req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED;
status = ucp_proto_rndv_rts_request_init(req);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to do it once? Otherwise we may initialize the same request nultiple times, while UCT will be returning NO_RESOURCES

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is covered by was_initialized, and rts_request_init is protected, but moved was_initialized outside of ucp_proto_put_rndv_init to try match the usual pattern.

Comment thread src/ucp/rma/rma_rndv.c
Comment thread src/ucp/rma/rma_rndv.c Outdated
Comment on lines +216 to +228
static void
ucp_proto_put_rndv_query(const ucp_proto_query_params_t *params,
ucp_proto_query_attr_t *attr)
{
ucp_proto_rma_rndv_query(params, attr, UCP_PROTO_RNDV_DESC);
}

static void
ucp_proto_get_rndv_query(const ucp_proto_query_params_t *params,
ucp_proto_query_attr_t *attr)
{
ucp_proto_rma_rndv_query(params, attr, UCP_PROTO_RNDV_DESC);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have just one common function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread src/ucp/rma/rma_rndv.c
Comment thread src/ucp/rma/rma_rndv.c Outdated
Comment on lines +566 to +567
recv_req->recv.rndv.ep_id = rts->super.sreq.ep_id;
recv_req->recv.rndv.complete_cb = ucp_rma_rndv_put_recv_complete;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants