In UCX, even with flow control (FC) enabled, why does the client still continuously experience RNR errors? #10580
Unanswered
super-train
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone,
I am using UCX version 1.12, and I have a question regarding flow control (FC) that has been bothering me for a long time. I would greatly appreciate any help you can provide; thank you very much!
In the code for UCX version 1.12, the sender has an initial value for fc_wnd, which has two thresholds: a soft threshold (calculated as 0.5 * fc_wnd) and a hard threshold (calculated as 0.2 * fc_wnd). Each time the sender sends I/O, it checks the current value of fc_wnd. If fc_wnd decreases to the soft or hard threshold, it marks the sent message's am_id with the corresponding FC soft request tag or hard request tag. When the receiver receives the AM message, it checks whether the am_id contains an FC tag. If it's a soft request tag, it merely sets a flag on the endpoint indicating that it needs to reply to the sender with permission to enlarge the fc_wnd. When the receiver has a message to send back to the sender next time, the am_id will carry the tag for granting the enlarged fc_wnd. If it’s a hard request tag, the receiver immediately prepares a message to grant an increase in fc_wnd and places it at the front of the pending queue, waiting to be sent to the sender.
My question is: why does the receiver directly send a response to the sender with permission to expand fc_wnd when it receives a request to enlarge the fc_wnd, without checking if its own receive queue resources are sufficient? This is something I find very difficult to understand.
For example, if the receiver still has 10 resources in its receive queue but immediately notifies the sender to enlarge fc_wnd, the sender will start sending a large amount of messages right away, eventually leading to the receiver exhausting its receive queue resources. This results in the sender encountering RNR errors. This is not just a hypothetical situation; it reflects the issues I am currently facing. Multiple senders are sending messages to a single receiver (essentially multiple clients sending messages to the same server), causing the clients to continuously generate RNR errors.
What is the significance and purpose of fc_wnd in this context, and how can I resolve the issue I've described above?
I eagerly await any answers from everyone. Thank you once again!
Beta Was this translation helpful? Give feedback.
All reactions