core/api: Introduce fi_rpc #10965

j-xiong · 2025-04-15T06:14:14Z

An RPC transaction consists of the client sending a request to the server, and the server sending back the response into the response buffer supplied by the client when the request was sent.

RPC operations can be built on top of existing point-to-point data transfer operations, including base messaging, tagged messaging, and RMA. One of the implementations uses base messaging for sending the request and tagged messaging with exact match for sending the response. One implication of such implementation is that the response messages should always arrive expected. In another word, if a reponse can not find a match it must be a stale message and can be safely dropped. However, currently the providers don't have a good way to obtain such information. As a result, those stale messages have to be kept indefinitely, hogging the resource, and negatively impacting the performance.

Having a dedicated RPC API allows the provider to understand the semanatics and perform optimizations when approciate.

The RPC API consist a set of calls to send RPC requests, a set of calls to send RPC response, a call to discard RPC requests, and a new CQ format to delivey RPC request meta data via completion entries. The content of the request and response are defined by the user.

Workflow of an RPC transacction:

Cleint:

prepare request buffer
fi_rpc

Server:

fi_recv
get completion (for fi_recv)
check cqe.flags & FI_RPC
get rpc_id and timeout from cqe
process the received buffer ==> prepare response
fi_rpc_resp
get completion (for fi_rpc_resp)

Client:

get completion (for fi_rpc)
process the received response

A new primary capability FI_RPC is defined. It is also a flag used in completion entries indicating that the completed receive contains an RPC request.

Update

A sample implementation is added to the rxm provider
A simple test is added to fabtests

belynam · 2025-04-15T17:26:16Z

@j-xiong Per discussion in today's OFIWG meeting, I have created #10966 to alter struct fi_opx_endpoint with a new rpc_reserved field that can be removed to allow for the larger size of struct fid_ep.

shefty

Need to examine RPC calls not having an fi_addr_t as input.

man/fi_rpc.3.md

shefty · 2025-05-12T20:28:56Z

include/rdma/fi_eq.h

+	union {
+		uint64_t	data;
+		int		timeout;
+	};


This change effectively blocks RPC from using CQ data. That may be undesirable. It forces RPC requests to match against a posted receive buffer, even if the request could be conveyed using only CQ data.

Since RPC requests arrive as untagged messages, they should have the same abilities as untagged messages, including carrying CQ data and supporting multi-recv. It's reasonable to replace 'tag' with 'rpc_id', since RPC requests go through the untagged queues, but timeout should extend the structure. Timeout may be useful as a completion field in some future change.

I would expect an RPC request need more than 64 bits to describe the details of the request. It is reasonable to require an extra message body for the request. And, unlike msg/tagged send calls, we don't have a 'data' variant for rpc so why not go further to remove data from the msg variant as well.

Using the data field for the timeout value greatly simplifies the process of adding RPC support to existing providers.

There's no data variant because it was excluded. I'm questioning that decision. Tagged and untagged messages (and even RMA) support CQ data. Why exclude it from RPC? We don't know how apps might use this API, an RPC request with a command but no input data could easily use just CQ data. Or the input data might be transferred using some other mechanism, since RPC requires landing the request into a posted receive buffer.

The API is fixed nearly forever. It's more important to get it right.

shefty · 2025-05-12T20:52:09Z

include/rdma/fi_rpc.h

+	ssize_t (*respmsg)(struct fid_ep *ep, const struct fi_msg_rpc_resp *msg,
+			   uint64_t flags);
+	ssize_t (*discard)(struct fid_ep *ep, uint64_t rpc_id);
+};


It's worth noting that RPC may be implemented by sending a request through untagged messages, but the response using tag matching. Today, untagged and tagged queues are not associated with each other. That may not be the case here. For example, the untagged request may arrive, but its ACK lost. The RPC response could then be generated, which would complete the request, only the request was never ACKed. The result is that the provider must be ready to handle the case where a "failed" send actually completes successfully.

This is, in part, a result of having the request target the untagged message queue. I.e. the request lands in a receive buffer posted using fi_recv(). I don't know that it helps if RPC's were to have their own posted receive queue.

Do we care about the ACK of receiving the request? The initiator only needs to know if the result comes back. A more specific question is: how many completions should an RPC request generate? one for the send and one for the result? or simply one for the result?

The operation is reliable, so the provider will care about getting an ACK, so that it can retry sending the request. The response might be usable as an ACK if it is generated quick enough. But the response comes from the application, not the network. It's unlikely a response will ever act as an ACK.

The problem is that the request-ACK is generated between the 'message' queues (fi_msg APIs). Even though the 'send' is invoked through the RPC API, it targets the 'recv' message API. The response is flowing between some other queue (with ACKs on that queue). That other queue isn't identified, but we know it's something other than the 'message' queue, which isn't capable of matching the response with the request.

I would have a single completion for the RPC. The provider just needs to deal with the possibility of a send failing, while the request succeeds. Or the response arriving before the send completes.

It's odd having RPC requests and responses target different message queues, which are visible to the application. A provider could restrict an endpoint to only supporting RPC, but the API allows mixing RPC with untagged messages, which is somewhat confusing and unlike the other APIs. I don't know if documenting there's a separate virtual RPC queue is needed though.

An RPC transaction consists of the client sending a request to the server, and the server sending back the response into the response buffer supplied by the client when the request was sent. RPC operations can be built on top of existing point-to-point data transfer operations, including base messaging, tagged messaging, and RMA. One of the implementations uses base messaging for sending the request and tagged messaging with exact match for sending the response. One implication of such implementation is that the response messages should always arrive expected. In another word, if a reponse can not find a match it must be a stale message and can be safely dropped. However, currently the providers don't have a good way to obtain such information. As a result, those stale messages have to be kept indefinitely, hogging the resource, and negatively impacting the performance. Having a dedicated RPC API allows the provider to understand the semanatics and perform optimizations when approciate. The RPC API consist a set of calls to send RPC requests, a set of calls to send RPC response, a call to discard RPC requests, and a new CQ format to delivey RPC request meta data via completion entries. The content of the request and response are defined by the user. Workflow of an RPC transacction: Cleint: prepare request buffer fi_rpc Server: fi_recv get completion (for fi_recv) check cqe.flags & FI_RPC get rpc_id and timeout from cqe process the received buffer ==> prepare response fi_rpc_resp get completion (for fi_rpc_resp) Client: get completion (for fi_rpc) process the received response A new primary capability FI_RPC is defined. It is also a flag used in completion entries indicating that the completed receive contains an RPC request. Signed-off-by: Jianxin Xiong <[email protected]>

Signed-off-by: Jianxin Xiong <[email protected]>

Implement RPC functionality using ofi_op_msg and ofi_op_tagged. The FI_RPC flag is used to distinguish from regular msg and tagged operations. Basic flow: Client RPC REQ * Get a unique RPC tag * Post tagged recv w/ FI_RPC, w/ RPC tag as tag * Post msg send w/ FI_RPC, w/ (RPC tag, timeout) passed as (tag, data) Server Get the RPC REQ * Post msg recv, could use FI_MULTI_RECV flag * Get completion w/ FI_RPC. Retrieve (RPC id, timeout) from CQE RPC RESP * Post tagged send w/ FI_RPC, w/ RPC id as tag Client Get the RPC RESP * Get completion for tagged recv. RPC done. Stale response: If an RPC response can't find a matching entry, it is dropped. Limitations: Currently tagged flow is not separate from the RPC flow. It is advised not to mix the usage of the two interface. Future improvements could include using a saparate queue for RPC response buffers. The timeout value is not used. Signed-off-by: Jianxin Xiong <[email protected]>

Signed-off-by: Jianxin Xiong <[email protected]>

j-xiong · 2025-05-13T20:29:51Z

@shefty Updated with the data field added back. Added the data variant of the API calls.

shefty · 2025-05-13T21:16:55Z

Thanks - as for supporting CQ data, I'm actually okay only supporting that through the RPC 'msg' APIs. I know this differs from the other API groups, where we have explicit 'data' APIs. Honestly, if I were to redo those APIs, I'd keep writedata, but drop senddata. CQ data is useful for writes, but less so for sends.

I'm still concerned about mixing RPC requests with untagged messages. The problem is that the target provider must be able to 'intercept' the untagged receive, identify it as an RPC, setup the RPC ID tracking, and report the completion properly. It seems cleaner if RPC requests landed into some RPC message queue, which was virtually separated from untagged messages.

For the tcp or rxm, I think these would be handled by adding another receive list to the software shared receive queue, to go along with the untagged and tagged lists. (It may be possible to carve out a tag ID for this, if additional restrictions were in place to match on that ID first prior to checking wildcard matching. But that seems like it would result in confusing code.) Trying to add RPCs to verbs (over MSG endpoints, if that made sense) would likely result in the endpoint supporting either FI_RPC or FI_MSG, but not both. FI_RPC needs additional protocol. If you wanted to avoid modifying the SRX for tcp or rxm, you could follow a similar approach and not support both FI_RPC and FI_MSG on the same endpoint.

shefty · 2025-05-13T21:40:23Z

As an idea, maybe we could introduce a mode bit for RPC indicating that FI_RPC and FI_MSG share the same receive buffers. Such a mode bit could be introduced later, or defined now, so that apps could code for that possibility.

j-xiong · 2025-05-13T22:31:03Z

@shefty Thanks for the feedback. Do you prefer me to remove the 'data' API calls? I can go either way.

For API cleanness, it might be better to have separate fi_rpc_recv/recvv/recvmsg calls that work almost the same way as the FI_MSG calls but use a different rx queue.

shefty · 2025-05-13T22:44:39Z

I can go either way on the data calls as well. I was looking at the size of the struct, which was getting large, and would get bigger if we added recv calls as well. But there's something to be said for consistency...

I agree with adding recv calls to fi_rpc and documenting that requests land in those posted receive buffers. The rpc recv calls should be the same behavior/flags as the fi_msg recv calls (I think).

j-xiong added the work in progress label Apr 15, 2025

j-xiong force-pushed the fi_rpc branch 5 times, most recently from 2e9e484 to 6917074 Compare April 16, 2025 16:46

shefty requested changes Apr 16, 2025

View reviewed changes

j-xiong force-pushed the fi_rpc branch from 6917074 to b8264d4 Compare April 16, 2025 19:54

j-xiong force-pushed the fi_rpc branch 5 times, most recently from f01c97d to 37e640a Compare May 12, 2025 05:55

j-xiong removed the work in progress label May 12, 2025

j-xiong force-pushed the fi_rpc branch 2 times, most recently from 7e37326 to 0d6377a Compare May 12, 2025 20:54

shefty reviewed May 12, 2025

View reviewed changes

j-xiong force-pushed the fi_rpc branch from 0d6377a to 8ef28c0 Compare May 12, 2025 20:57

j-xiong added 7 commits May 13, 2025 13:04

core/enosys: Add dummy rpc ops functions

bc717e4

Signed-off-by: Jianxin Xiong <[email protected]>

prov/util: Add support for FI_RPC and FI_CQ_FORMAT_RPC

e80a0d4

Signed-off-by: Jianxin Xiong <[email protected]>

prov/util: Add a no-lock version of ofi_srx_generic_trecv

5b398df

Signed-off-by: Jianxin Xiong <[email protected]>

fabtests: Fix getinfo_test for providers supporting FI_RPC

e6b7fc4

Signed-off-by: Jianxin Xiong <[email protected]>

fabtests: Add a funcitonal test for RPC

84bf23d

Signed-off-by: Jianxin Xiong <[email protected]>

j-xiong force-pushed the fi_rpc branch from 8ef28c0 to 84bf23d Compare May 13, 2025 20:26

core/api: Introduce fi_rpc #10965

Are you sure you want to change the base?

core/api: Introduce fi_rpc #10965

Conversation

j-xiong commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belynam commented Apr 15, 2025

Uh oh!

shefty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shefty May 12, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong May 12, 2025

Choose a reason for hiding this comment

Uh oh!

shefty May 12, 2025

Choose a reason for hiding this comment

Uh oh!

shefty May 12, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong May 12, 2025

Choose a reason for hiding this comment

Uh oh!

shefty May 12, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong commented May 13, 2025

Uh oh!

shefty commented May 13, 2025

Uh oh!

shefty commented May 13, 2025

Uh oh!

j-xiong commented May 13, 2025

Uh oh!

shefty commented May 13, 2025

Uh oh!

Uh oh!

j-xiong commented Apr 15, 2025 •

edited

Loading