Cancel tag RX entry once progress RX see any failure. #10997

wangdi1 · 2025-04-28T14:57:02Z

No description provided.

Signed-off-by: Di Wang <[email protected]>

wangdi1 · 2025-04-28T14:58:08Z

@shefty @j-xiong any comments on this PR? Thanks.

shefty · 2025-04-28T16:04:05Z

prov/tcp/src/xnet_progress.c

 			xnet_ep_disable(ep, 0, NULL, 0);
-
+			xnet_ep_tag_cancel(ep, -ret);


All cleanup should be done in xnet_ep_disable, look at flush_all_queues

I checked flush_all_queues(), which seems not trying to cleanup the tag_queue. Actually I had a question here,
So when sending an unexpected msg, if the send succeeds, though the following progress_rx failed, should we leave its posted tag recv entry in the tag queue, i.e. let some one else cancel it?
Or it should actively cancel the tag entry as in this patch? @soumagne

Tag matching only applies to RDM endpoints. For tcp, the tag matching is part of the shared receive context implementation. An error on a single connection should not impact the posted tag queue. Only the active tagged message in use should be reported in error, assuming that it's already been matched.

Tag queues are cleaned up when destroying the srx.

Thanks for the clarification @shefty how about the updated patch? And the canceling is only for the tags with the same peer address with the disabling ep.

NB: if tag recv is not canceled/reported in time, the send op will not be completed in time, even though ep is already disabled due to the failure.

The connection can be torn down and reestablished. That's already done in the code today.

The remote peer can also re-form the connection. An RDM endpoint simply has no way of knowing what the application is going to do. Failing outstanding receive operations isn't the right answer because of a potential hiccup in communication.

The ultimate issue is the application has selected an endpoint and operation which doesn't align with what it needs. That's why we need the RPC APIs, which provide enough context for the provider to make the correct decision. Because the proposed change here will break some apps.

The problem with the send getting stuck is something that should be handled. We would need to figure out what exactly happened there.

Thanks for clarification. @shefty If I understand correctly, connection will only be re-established if there are new sending request and ep is not in the CONNECTED state.

In our current situation, when a remote peer terminates unexpectedly, the local socket may not immediately update its state, so the associated ep can remain in a CONNECTED state, which allows send() operations to succeeds, since they only copy the message to the local kernel's send buffer. However, subsequent receive operations (handled by progress_rx) will fail. This failure currently results only in the ep being disabled, rather than triggering a connection re-establishment. As a result, the original client request hangs there until there are new requests.

The RPC layer (cart/mercury) indeed track and retry these RPCs by issuing new requests, this retry mechanism is currently dependent on a timeout, which can introduce significant delays (potentially minutes).

An alternative approach would be for the lower I/O layer to signal a distinct error event to the RPC layer upon such a receive failure. This would enable the RPC layer to make a more immediate and informed decision regarding any pending receive operations (identified by tags) associated with that specific ep.

Correct, the connection will reform if it's not connected. The connection can be re-connected from either side.

If the socket is broken, I would expect send() to fail as well, depending on whether the kernel is aware of the failure yet. If the kernel accepted the send, there's really no way for user space to know if the send made it or not without a SW ack (e.g. delivery complete). For all the local side knows, the send could have reached the target. It's simply a race at which point the connection breaks.

We could introduce AV related events to report upwards. For example, FI_EUNEXP_COMM, unexpected communication error. It would be up the provider to determine what, if anything, constituted an unexpected communication error. For TCP, it could include a broken connection or error trying to establish one (unsure about the latter).

Hmm If we introduce an new type of event(UNEXP_COMM), which probably do the similar thing as this patch (fail the related operation, resend the operation etc). Though this will be done in the PRC layer(mercury), so other applications should be ok with it. Thanks!

Signed-off-by: Di Wang <[email protected]>

frostedcmos · 2025-05-14T20:59:12Z

prov/tcp/src/xnet_srx.c

+		entry = slist_remove_head(&tmp);
+		slist_insert_head(entry, queue);
+
+		xfer_entry = container_of(entry, struct xnet_xfer_entry, entry);


doesn't look like xfer_entry is used after here

Cancel tag RX entry once progress RX see any failure.

d8de2e8

Signed-off-by: Di Wang <[email protected]>

shefty reviewed Apr 28, 2025

View reviewed changes

cancel active tag queue when ep is disabled.

8be7cb5

Signed-off-by: Di Wang <[email protected]>

wangdi1 requested a review from shefty April 30, 2025 03:12

frostedcmos reviewed May 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cancel tag RX entry once progress RX see any failure. #10997

Cancel tag RX entry once progress RX see any failure. #10997

wangdi1 commented Apr 28, 2025

Uh oh!

wangdi1 commented Apr 28, 2025

Uh oh!

shefty Apr 28, 2025

Uh oh!

wangdi1 Apr 28, 2025

Uh oh!

shefty Apr 28, 2025

Uh oh!

wangdi1 Apr 30, 2025

Uh oh!

shefty Apr 30, 2025

Uh oh!

shefty Apr 30, 2025

Uh oh!

shefty Apr 30, 2025

Uh oh!

wangdi1 May 3, 2025

Uh oh!

shefty May 5, 2025

Uh oh!

wangdi1 May 8, 2025

Uh oh!

frostedcmos May 14, 2025

Uh oh!

Uh oh!

		xnet_ep_disable(ep, 0, NULL, 0);

		xnet_ep_tag_cancel(ep, -ret);

Cancel tag RX entry once progress RX see any failure. #10997

Are you sure you want to change the base?

Cancel tag RX entry once progress RX see any failure. #10997

Conversation

wangdi1 commented Apr 28, 2025

Uh oh!

wangdi1 commented Apr 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!