Skip to content

Commit 56cfcf8

Browse files
committed
UCT/IB/RDMACM: Hold async block over rdma_get_cm_event
The rdmacm CM event handler called rdma_get_cm_event() outside the CM's async block, then took the block only around uct_rdmacm_cm_process_event(). The ep destructor (uct_rdmacm_cm_ep_t cleanup) and other destroy sites hold the same block when calling rdma_destroy_id(), so the synchronization intent was to serialize them with the handler. The pre-block window let a concurrent rdma_destroy_id() free the cm_id's userspace tracking while the async thread was mid-lookup inside rdma_get_cm_event(), producing a NULL deref at the internal pthread_mutex_lock(&id_priv->mut) call. Observed as a SIGSEGV inside librdmacm during sockaddr error/wireup-failure gtests under multi-threaded workers where event delivery and ep teardown interleave more often. Acquire the async block before rdma_get_cm_event() and release it on the error/EAGAIN exit path, so the entire fetch + dispatch is serialized with rdma_destroy_id() callers that hold the same block. Signed-off-by: NirWolfer <nwolfer@nvidia.com>
1 parent 2e11735 commit 56cfcf8

1 file changed

Lines changed: 6 additions & 1 deletion

File tree

src/uct/ib/rdmacm/rdmacm_cm.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -813,9 +813,15 @@ static void uct_rdmacm_cm_event_handler(int fd, ucs_event_set_types_t events,
813813
int ret;
814814

815815
for (;;) {
816+
/* Hold the async block across rdma_get_cm_event() so a concurrent
817+
* rdma_destroy_id() (held under the same block in the ep destructor)
818+
* cannot free the cm_id userspace tracking mid-lookup. */
819+
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
820+
816821
/* Fetch an event */
817822
ret = rdma_get_cm_event(cm->ev_ch, &event);
818823
if (ret) {
824+
UCS_ASYNC_UNBLOCK(uct_rdmacm_cm_get_async(cm));
819825
/* EAGAIN (in a non-blocking rdma_get_cm_event) means that
820826
* there are no more events */
821827
if ((errno != EAGAIN) && (errno != EINTR)) {
@@ -825,7 +831,6 @@ static void uct_rdmacm_cm_event_handler(int fd, ucs_event_set_types_t events,
825831
return;
826832
}
827833

828-
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
829834
uct_rdmacm_cm_process_event(cm, event);
830835
UCS_ASYNC_UNBLOCK(uct_rdmacm_cm_get_async(cm));
831836
}

0 commit comments

Comments
 (0)