-
Notifications
You must be signed in to change notification settings - Fork 422
core/api: Introduce fi_rpc #10965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
core/api: Introduce fi_rpc #10965
Conversation
2e9e484
to
6917074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to examine RPC calls not having an fi_addr_t as input.
f01c97d
to
37e640a
Compare
7e37326
to
0d6377a
Compare
include/rdma/fi_eq.h
Outdated
union { | ||
uint64_t data; | ||
int timeout; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change effectively blocks RPC from using CQ data. That may be undesirable. It forces RPC requests to match against a posted receive buffer, even if the request could be conveyed using only CQ data.
Since RPC requests arrive as untagged messages, they should have the same abilities as untagged messages, including carrying CQ data and supporting multi-recv. It's reasonable to replace 'tag' with 'rpc_id', since RPC requests go through the untagged queues, but timeout should extend the structure. Timeout may be useful as a completion field in some future change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect an RPC request need more than 64 bits to describe the details of the request. It is reasonable to require an extra message body for the request. And, unlike msg/tagged send calls, we don't have a 'data' variant for rpc so why not go further to remove data from the msg
variant as well.
Using the data field for the timeout value greatly simplifies the process of adding RPC support to existing providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no data variant because it was excluded. I'm questioning that decision. Tagged and untagged messages (and even RMA) support CQ data. Why exclude it from RPC? We don't know how apps might use this API, an RPC request with a command but no input data could easily use just CQ data. Or the input data might be transferred using some other mechanism, since RPC requires landing the request into a posted receive buffer.
The API is fixed nearly forever. It's more important to get it right.
ssize_t (*respmsg)(struct fid_ep *ep, const struct fi_msg_rpc_resp *msg, | ||
uint64_t flags); | ||
ssize_t (*discard)(struct fid_ep *ep, uint64_t rpc_id); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that RPC may be implemented by sending a request through untagged messages, but the response using tag matching. Today, untagged and tagged queues are not associated with each other. That may not be the case here. For example, the untagged request may arrive, but its ACK lost. The RPC response could then be generated, which would complete the request, only the request was never ACKed. The result is that the provider must be ready to handle the case where a "failed" send actually completes successfully.
This is, in part, a result of having the request target the untagged message queue. I.e. the request lands in a receive buffer posted using fi_recv(). I don't know that it helps if RPC's were to have their own posted receive queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about the ACK of receiving the request? The initiator only needs to know if the result comes back. A more specific question is: how many completions should an RPC request generate? one for the send and one for the result? or simply one for the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation is reliable, so the provider will care about getting an ACK, so that it can retry sending the request. The response might be usable as an ACK if it is generated quick enough. But the response comes from the application, not the network. It's unlikely a response will ever act as an ACK.
The problem is that the request-ACK is generated between the 'message' queues (fi_msg APIs). Even though the 'send' is invoked through the RPC API, it targets the 'recv' message API. The response is flowing between some other queue (with ACKs on that queue). That other queue isn't identified, but we know it's something other than the 'message' queue, which isn't capable of matching the response with the request.
I would have a single completion for the RPC. The provider just needs to deal with the possibility of a send failing, while the request succeeds. Or the response arriving before the send completes.
It's odd having RPC requests and responses target different message queues, which are visible to the application. A provider could restrict an endpoint to only supporting RPC, but the API allows mixing RPC with untagged messages, which is somewhat confusing and unlike the other APIs. I don't know if documenting there's a separate virtual RPC queue is needed though.
An RPC transaction consists of the client sending a request to the server, and the server sending back the response into the response buffer supplied by the client when the request was sent. RPC operations can be built on top of existing point-to-point data transfer operations, including base messaging, tagged messaging, and RMA. One of the implementations uses base messaging for sending the request and tagged messaging with exact match for sending the response. One implication of such implementation is that the response messages should always arrive expected. In another word, if a reponse can not find a match it must be a stale message and can be safely dropped. However, currently the providers don't have a good way to obtain such information. As a result, those stale messages have to be kept indefinitely, hogging the resource, and negatively impacting the performance. Having a dedicated RPC API allows the provider to understand the semanatics and perform optimizations when approciate. The RPC API consist a set of calls to send RPC requests, a set of calls to send RPC response, a call to discard RPC requests, and a new CQ format to delivey RPC request meta data via completion entries. The content of the request and response are defined by the user. Workflow of an RPC transacction: Cleint: prepare request buffer fi_rpc Server: fi_recv get completion (for fi_recv) check cqe.flags & FI_RPC get rpc_id and timeout from cqe process the received buffer ==> prepare response fi_rpc_resp get completion (for fi_rpc_resp) Client: get completion (for fi_rpc) process the received response A new primary capability FI_RPC is defined. It is also a flag used in completion entries indicating that the completed receive contains an RPC request. Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Implement RPC functionality using ofi_op_msg and ofi_op_tagged. The FI_RPC flag is used to distinguish from regular msg and tagged operations. Basic flow: Client RPC REQ * Get a unique RPC tag * Post tagged recv w/ FI_RPC, w/ RPC tag as tag * Post msg send w/ FI_RPC, w/ (RPC tag, timeout) passed as (tag, data) Server Get the RPC REQ * Post msg recv, could use FI_MULTI_RECV flag * Get completion w/ FI_RPC. Retrieve (RPC id, timeout) from CQE RPC RESP * Post tagged send w/ FI_RPC, w/ RPC id as tag Client Get the RPC RESP * Get completion for tagged recv. RPC done. Stale response: If an RPC response can't find a matching entry, it is dropped. Limitations: Currently tagged flow is not separate from the RPC flow. It is advised not to mix the usage of the two interface. Future improvements could include using a saparate queue for RPC response buffers. The timeout value is not used. Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
@shefty Updated with the |
Thanks - as for supporting CQ data, I'm actually okay only supporting that through the RPC 'msg' APIs. I know this differs from the other API groups, where we have explicit 'data' APIs. Honestly, if I were to redo those APIs, I'd keep writedata, but drop senddata. CQ data is useful for writes, but less so for sends. I'm still concerned about mixing RPC requests with untagged messages. The problem is that the target provider must be able to 'intercept' the untagged receive, identify it as an RPC, setup the RPC ID tracking, and report the completion properly. It seems cleaner if RPC requests landed into some RPC message queue, which was virtually separated from untagged messages. For the tcp or rxm, I think these would be handled by adding another receive list to the software shared receive queue, to go along with the untagged and tagged lists. (It may be possible to carve out a tag ID for this, if additional restrictions were in place to match on that ID first prior to checking wildcard matching. But that seems like it would result in confusing code.) Trying to add RPCs to verbs (over MSG endpoints, if that made sense) would likely result in the endpoint supporting either FI_RPC or FI_MSG, but not both. FI_RPC needs additional protocol. If you wanted to avoid modifying the SRX for tcp or rxm, you could follow a similar approach and not support both FI_RPC and FI_MSG on the same endpoint. |
As an idea, maybe we could introduce a mode bit for RPC indicating that FI_RPC and FI_MSG share the same receive buffers. Such a mode bit could be introduced later, or defined now, so that apps could code for that possibility. |
@shefty Thanks for the feedback. Do you prefer me to remove the 'data' API calls? I can go either way. For API cleanness, it might be better to have separate fi_rpc_recv/recvv/recvmsg calls that work almost the same way as the FI_MSG calls but use a different rx queue. |
I can go either way on the data calls as well. I was looking at the size of the struct, which was getting large, and would get bigger if we added recv calls as well. But there's something to be said for consistency... I agree with adding recv calls to fi_rpc and documenting that requests land in those posted receive buffers. The rpc recv calls should be the same behavior/flags as the fi_msg recv calls (I think). |
An RPC transaction consists of the client sending a request to the server, and the server sending back the response into the response buffer supplied by the client when the request was sent.
RPC operations can be built on top of existing point-to-point data transfer operations, including base messaging, tagged messaging, and RMA. One of the implementations uses base messaging for sending the request and tagged messaging with exact match for sending the response. One implication of such implementation is that the response messages should always arrive expected. In another word, if a reponse can not find a match it must be a stale message and can be safely dropped. However, currently the providers don't have a good way to obtain such information. As a result, those stale messages have to be kept indefinitely, hogging the resource, and negatively impacting the performance.
Having a dedicated RPC API allows the provider to understand the semanatics and perform optimizations when approciate.
The RPC API consist a set of calls to send RPC requests, a set of calls to send RPC response, a call to discard RPC requests, and a new CQ format to delivey RPC request meta data via completion entries. The content of the request and response are defined by the user.
Workflow of an RPC transacction:
Cleint:
fi_rpc
Server:
fi_recv
fi_rpc_resp
Client:
A new primary capability FI_RPC is defined. It is also a flag used in completion entries indicating that the completed receive contains an RPC request.
Update