-
Notifications
You must be signed in to change notification settings - Fork 532
Description
Describe the bug
Calling ucp_listener_reject() on a TCP sockcm connection request can lead to a SIGSEGV (NULL function pointer call) on the UCX async thread.
The crash occurs at uct_cm.c:130:
cep->server.notify_cb(&cep->super.super, cep->user_data, ¬ify_args);cep->server.notify_cb is NULL because the server endpoint was never fully created — the reject path skips uct_tcp_sockcm_ep_server_create() which is where UCT_CM_SET_CB assigns the callback (or defaults it to ucs_empty_function).
Root cause analysis:
- Client connects → listener allocates
uct_tcp_sockcm_ep_tviauct_tcp_sockcm_ep_alloc_and_init(struct is zeroed,notify_cb = NULL) invoke_conn_req_cbfires → passesconn_requestto user- User calls
ucp_listener_reject()→uct_tcp_listener_reject()sends reject message to client (hdr->status = UCS_ERR_REJECTED) - Client responds → async thread receives data on the EP's fd
uct_tcp_sockcm_ep_server_handle_data_received()attcp_sockcm_ep.c:707-712seesUCT_TCP_SOCKCM_EP_DATA_SENTis set and callsuct_tcp_sockcm_ep_server_notify_cb(cep, status)- This calls
uct_cm_ep_server_conn_notify_cb()→ dereferencescep->server.notify_cbwhich is NULL → SIGSEGV
The comment at tcp_sockcm_ep.c:290 says:
/* the server might not have a valid ep yet. in this case the notify_cb
* is an empty function */But notify_cb is never set to ucs_empty_function on the reject path. UCT_CM_SET_CB (which defaults to ucs_empty_function) only runs inside uct_tcp_sockcm_ep_server_create(), which is only called from the accept path (ucp_ep_create), not the reject path.
Crash backtrace (from gdb on core dump):
#0 0x0000000000000000 in ?? ()
#1 uct_cm_ep_server_conn_notify_cb (cep=0x7fb6e00a5dc0, status=...) at base/uct_cm.c:130
#2 uct_tcp_sockcm_ep_server_notify_cb (cep=0x7fb6e00a5dc0) at tcp/tcp_sockcm_ep.c:70
#3 uct_tcp_sockcm_ep_server_handle_data_received (cep=0x7fb6e00a5dc0) at tcp/tcp_sockcm_ep.c:712
#4 uct_tcp_sockcm_ep_handle_data_received (cep=0x7fb6e00a5dc0) at tcp/tcp_sockcm_ep.c:743
#5 uct_tcp_sockcm_ep_recv (cep=0x7fb6e00a5dc0) at tcp/tcp_sockcm_ep.c:834
#6 uct_tcp_sa_data_handler (fd=1270, events=..., arg=0x7fb6e00a5dc0) at tcp/tcp_sockcm.c:113
#7 ucs_async_handler_invoke (handler=0x7fb6e0093430) at async/async.c:268
#8 ucs_async_handler_dispatch (handler=0x7fb6e0093430) at async/async.c:290
#9 ucs_async_dispatch_handlers (...) at async/async.c:322
#10 ucs_async_thread_ev_handler (...) at async/thread.c:88
#11 ucs_event_set_wait (...) at sys/event_set.c:215
#12 ucs_async_thread_func (arg=0x7fb714033800) at async/thread.c:131
Steps to Reproduce
Minimal C reproducer — server creates a listener, rejects incoming connections in a loop under load:
#include <ucp/api/ucp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
#include <arpa/inet.h>
/* Shared state between listener callback and main thread */
static ucp_conn_request_h pending_conn_req = NULL;
static pthread_mutex_t req_mutex = PTHREAD_MUTEX_INITIALIZER;
static void conn_handler_cb(ucp_conn_request_h conn_request, void *arg) {
pthread_mutex_lock(&req_mutex);
/* Just store the latest request; in a real app you'd queue them */
pending_conn_req = conn_request;
pthread_mutex_unlock(&req_mutex);
}
static void err_handler_cb(void *arg, ucp_ep_h ep, ucs_status_t status) {
/* Client gets UCS_ERR_REJECTED here */
}
int main(int argc, char **argv) {
ucp_params_t ucp_params = { .field_mask = UCP_PARAM_FIELD_FEATURES,
.features = UCP_FEATURE_TAG };
ucp_config_t *config;
ucp_context_h context;
ucp_worker_h worker;
ucp_listener_h listener;
ucp_config_read(NULL, NULL, &config);
ucp_init(&ucp_params, config, &context);
ucp_config_release(config);
ucp_worker_params_t wparams = { .field_mask = UCP_WORKER_PARAM_FIELD_THREAD_MODE,
.thread_mode = UCS_THREAD_MODE_SINGLE };
ucp_worker_create(context, &wparams, &worker);
/* Create listener on localhost */
struct sockaddr_in listen_addr = { .sin_family = AF_INET,
.sin_port = htons(0),
.sin_addr.s_addr = htonl(INADDR_LOOPBACK) };
ucp_listener_params_t lparams = {
.field_mask = UCP_LISTENER_PARAM_FIELD_SOCK_ADDR |
UCP_LISTENER_PARAM_FIELD_CONN_HANDLER,
.sockaddr.addr = (struct sockaddr *)&listen_addr,
.sockaddr.addrlen = sizeof(listen_addr),
.conn_handler = { .cb = conn_handler_cb, .arg = NULL }
};
ucp_listener_create(worker, &lparams, &listener);
/* Query bound port */
ucp_listener_attr_t lattr = { .field_mask = UCP_LISTENER_ATTR_FIELD_SOCKADDR };
ucp_listener_query(listener, &lattr);
uint16_t port = ntohs(((struct sockaddr_in *)&lattr.sockaddr)->sin_port);
printf("Listening on port %u\n", port);
/* Spawn client threads that connect rapidly */
/* (In our case we had ~100 stress-test cycles with 15 concurrent connects) */
for (int i = 0; i < 1000; i++) {
/* Client side: connect */
struct sockaddr_in dest = { .sin_family = AF_INET,
.sin_port = htons(port),
.sin_addr.s_addr = htonl(INADDR_LOOPBACK) };
ucp_ep_params_t ep_params = {
.field_mask = UCP_EP_PARAM_FIELD_FLAGS |
UCP_EP_PARAM_FIELD_SOCK_ADDR |
UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE |
UCP_EP_PARAM_FIELD_ERR_HANDLER,
.flags = UCP_EP_PARAMS_FLAGS_CLIENT_SERVER,
.sockaddr.addr = (struct sockaddr *)&dest,
.sockaddr.addrlen = sizeof(dest),
.err_mode = UCP_ERR_HANDLING_MODE_PEER,
.err_handler = { .cb = err_handler_cb, .arg = NULL }
};
ucp_ep_h client_ep;
ucp_ep_create(worker, &ep_params, &client_ep);
/* Progress until we get a connection request */
while (1) {
ucp_worker_progress(worker);
pthread_mutex_lock(&req_mutex);
if (pending_conn_req != NULL) {
/* REJECT the connection — this triggers the bug */
ucp_listener_reject(listener, pending_conn_req);
pending_conn_req = NULL;
pthread_mutex_unlock(&req_mutex);
break;
}
pthread_mutex_unlock(&req_mutex);
}
/* Clean up client ep */
ucp_ep_close_nbx(client_ep, &(ucp_request_param_t){
.op_attr_mask = UCP_OP_ATTR_FIELD_FLAGS,
.flags = UCP_EP_CLOSE_FLAG_FORCE });
/* Progress to let reject complete */
for (int j = 0; j < 100; j++) ucp_worker_progress(worker);
}
ucp_listener_destroy(listener);
ucp_worker_destroy(worker);
ucp_cleanup(context);
return 0;
}The crash is probabilistic (~10-20% of runs) and depends on timing of the async thread processing the client's response after the reject message is sent.
- UCX version: 1.20.0 (release tarball)
- Configure flags: default
- Environment:
UCX_TLS=tcp
Setup and versions
- OS: Ubuntu 24.04 (noble), x86_64
- CPU: x86-64-v3 (Haswell+)
- Kernel: 6.x
- This is a TCP-only issue (not RDMA). No special hardware required to reproduce.
Suggested fix
Either:
-
Initialize
notify_cbtoucs_empty_functionduringuct_tcp_sockcm_ep_alloc_and_init(or inuct_cm_base_ep_tinit), so it's never NULL — matching the comment attcp_sockcm_ep.c:290. -
NULL-guard in
uct_cm_ep_server_conn_notify_cbatuct_cm.c:130:if (cep->server.notify_cb != NULL) { cep->server.notify_cb(&cep->super.super, cep->user_data, ¬ify_args); }
-
Check EP state in
uct_tcp_sockcm_ep_server_handle_data_receivedbefore calling the notify callback — skip ifUCT_TCP_SOCKCM_EP_SERVER_REJECT_CALLEDis set.
Additional information
The ucp_listener_reject API documentation (ucp.h:2684-2701) states it is a valid way to handle a conn_request, on equal footing with ucp_ep_create. The crash only occurs with TCP sockcm transport (UCX_TLS=tcp). RDMA CM may or may not have the same issue (not tested).