Why does the program directly call ucs_fatal to handle failures after ibv_post_send fails? #10575
Unanswered
super-train
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone,
I have a question that has been bothering me for a long time. I would greatly appreciate any insights you can provide.
The specific issue is as follows: In version ucx-1.12, when using the RDMA send function uct_rc_verbs_ep_post_send, if ibv_post_send fails, the code directly calls ucs_fatal to handle it. This will terminate the program and throw a stack trace, which is unacceptable in practical applications. The corresponding code is as follows:
c
static UCS_F_ALWAYS_INLINE void
uct_rc_verbs_ep_post_send(uct_rc_verbs_iface_t* iface, uct_rc_verbs_ep_t* ep,
struct ibv_send_wr *wr, int send_flags, int max_log_sge)
{
struct ibv_send_wr *bad_wr;
int ret;
}
My questions are as follows:
Why does it directly call ucs_fatal here? This is very ungraceful and unacceptable in practical implementation.
If I want to create a new function similar to uct_rc_verbs_iface_poll_tx that handles failures when ibv_post_send fails, similar to how it invokes iface->super.super.ops->handle_failure(&iface->super.super, &wc[i], status);, would that be feasible? If so, what resources should be handled within this new post-send handling function?
c
static UCS_F_ALWAYS_INLINE unsigned
uct_rc_verbs_iface_poll_tx(uct_rc_verbs_iface_t *iface)
{
uct_rc_verbs_ep_t *ep;
uint16_t count;
int i;
unsigned num_wcs = iface->super.super.config.tx_max_poll;
struct ibv_wc wc[num_wcs];
ucs_status_t status;
}
Additionally, I have checked the source code in ucx-1.18.0, and it also directly calls ucs_fatal when ibv_post_send fails. Why is it that the implementation is consistently terminating the program like this? Whereas in SPDK's NVMe and 3FS, the handling of ibv_post_send failures involves changing the QP state to unavailable and releasing resources. I'm puzzled about why UCX handles it differently. If you have any insights, please let me know. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions