Skip to content

UCP: SGL datatype implementation#11344

Merged
brminich merged 44 commits into
openucx:masterfrom
michal-shalev:ucp-sgl-put-impl
Jun 1, 2026
Merged

UCP: SGL datatype implementation#11344
brminich merged 44 commits into
openucx:masterfrom
michal-shalev:ucp-sgl-put-impl

Conversation

@michal-shalev

@michal-shalev michal-shalev commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

What?

UCP SGL (scatter–gather list) datatype: iterators, packing, RMA put integration (proto / put offload), UCT v2 iface capability bits, and gtests.

Why?

Lets callers pass multi-buffer local and remote descriptors for one-sided put in a first-class way.

How?

  • dt_sgl + extensions to the datatype iterator and proto selection for SGL.
  • Put path hooks in put_offload / RMA send and UCT uct_iface_v2 fields for max SGL put segments / zcopy.
  • Tests: invalid-param coverage, optional zcopy functional tests gated on UCT_IFACE_FLAG_V2_PUT_SGL_ZCOPY.

Comment thread src/uct/api/v2/uct_v2.h
Comment thread test/gtest/ucp/ucp_test.cc Outdated
Comment thread src/ucp/proto/proto_common.c Outdated
Comment thread src/ucp/rma/put_offload.c Outdated
Comment thread src/uct/api/v2/uct_v2.h Outdated
Comment thread src/ucp/dt/dt_sgl.c Outdated
Comment thread src/ucp/dt/dt_sgl.h Outdated
Comment thread test/gtest/ucp/test_ucp_rma.cc
Comment thread src/ucp/dt/datatype_iter.c Outdated
Comment thread src/ucp/rma/put_offload.c Outdated

iface_attr_v2.field_mask = UCT_IFACE_ATTR_FIELD_MAX_PUT_SGL_ZCOPY_COUNT;
status = uct_iface_query_v2(
ucp_worker_iface(ep->worker,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_put_sgl_zcopy_count is queried on every send invocation but is static post-init. Consider caching it in the protocol private data during probe, the same way max_iov is handled in other protocols.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_put_sgl_zcopy_count is now cached in ucp_proto_multi_lane_priv_t.

Comment thread src/ucp/rma/put_offload.c Outdated
/* Multiple progress calls, translate memh + rkey per-chunk on stack */
uct_mem_h *uct_memhs;
uct_rkeys = ucs_alloca(elem_count * sizeof(uct_rkey_t));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucs_alloca asserts if the allocation exceeds ~1KB, which fires at ~128 elements (8 bytes × 128). Consider a small fixed threshold and fall back to ucs_malloc for larger counts — similar to what the SIZE_MAX branch below already does.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to use alloca considering that request may end up in pending queue?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ucs_alloca with fallback to malloc you can use ucs_alloc_on_stack (and free with ucs_free_on_stack)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/dt/datatype_iter.c
Comment thread src/ucp/dt/datatype_iter.c Outdated
ucp_datatype_iter_detect_mem_info(context, local->buffers[0],
local->lengths[0], dt_iter, param);
if (ENABLE_PARAMS_CHECK && (count > 1)) {
status = ucp_dt_sgl_memtype_check(context, local, count,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_dt_sgl_memtype_check is only called under ENABLE_PARAMS_CHECK. In release builds, mixed HOST+GPU buffers pass through silently. Worth noting in the API docs at minimum.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check matches the pre-existing IOV path (ucp_datatype_iov_iter_init).
IOV doesn't document this.
I added a documentation for it (see ucp_dt_local_sgl_t).

Comment thread src/ucp/dt/datatype_iter.c Outdated
break;
case UCP_DATATYPE_SGL:
if (!ucp_memh_is_buffer_in_range(memh, dt_iter->type.sgl.buffers[0],
dt_iter->type.sgl.lengths[0])) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only buffers[0] is range-checked against the user memh. If the intent is one memh per SGL element, this should loop over all entries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_datatype_iter_is_user_memh_valid signature only takes a single memh, so I had to check it from dt_iter.
Fixed + added tests.

Comment thread src/ucp/rma/rma_send.c Outdated
Comment thread src/ucp/rma/rma_send.c Outdated
}

UCP_REQUEST_CHECK_PARAM(param);
UCP_REQUEST_CHECK_PARAM_UNSUPPORTED_REMOTE(param);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we unite it with UCP_REQUEST_CHECK_PARAM?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/dt/datatype_iter.c Outdated
return dst_iov_index;
}

ucs_status_t ucp_datatype_sgl_iter_init(ucp_context_h context,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the common naming is ucp_datatype_iter_*, lets rename accordingly
ucp_datatype_iov_iter_init is the only exception

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to ucp_datatype_iter_sgl_init.

Comment thread src/ucp/dt/datatype_iter.c
Comment thread src/ucp/dt/datatype_iter.h Outdated
ucp_mem_h *memhs;
const uint64_t *remote_addrs;
ucp_rkey_h const *rkeys;
int memhs_owned;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse ucp_memh_is_user_memh somehow instead of adding this field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/dt/dt_sgl.c
Comment thread src/ucp/dt/dt_sgl.h Outdated
Comment thread src/ucp/rma/put_am.c Outdated

if (!ucp_proto_init_check_op(init_params, UCS_BIT(UCP_OP_ID_PUT))) {
if (!ucp_proto_init_check_op(init_params, UCS_BIT(UCP_OP_ID_PUT)) ||
(init_params->select_param->dt_class == UCP_DATATYPE_SGL)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also restrict selecting all other protocols for SGL datatype?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/rma/put_offload.c Outdated
elem_count = ucp_datatype_iter_next_sgl(dt_iter, max_sgl_count, next_iter);

if (max_sgl_count < SIZE_MAX) {
/* Multiple progress calls, translate memh + rkey per-chunk on stack */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why multiple?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for cuda_ipc it's a single progress call.
In any case, I simplified the logic here.

Comment thread src/ucp/rma/put_offload.c Outdated
/* Multiple progress calls, translate memh + rkey per-chunk on stack */
uct_mem_h *uct_memhs;
uct_rkeys = ucs_alloca(elem_count * sizeof(uct_rkey_t));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to use alloca considering that request may end up in pending queue?

Comment thread src/ucp/rma/put_offload.c Outdated
uct_ep,
&dt_iter->type.sgl.buffers[start_index],
&dt_iter->type.sgl.lengths[start_index],
NULL,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no memh here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_ipc doesn't use memh.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of this again, I decided to always fill the uct_memhs array, mirroring the regular zcopy path, and to make UCP/UCT layering cleaner.

Comment thread src/ucp/rma/rma_send.c Outdated
goto out_unlock;
}
remote = (const ucp_dt_remote_sgl_t *)param->remote;
rkey = remote->rkeys[0];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes count > 0, should we guard against it?

Comment thread src/ucp/proto/proto_reconfig.c
Comment thread src/ucp/core/ucp_request.inl Outdated

#define UCP_REQUEST_CHECK_PARAM(_param) \
do { \
UCP_REQUEST_CHECK_PARAM_ALLOW_REMOTE(_param); \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems naming is wrong, as checking remote is below in this macro.
also better to omit ALLOW in the name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wrong?
The _ALLOW_REMOTE suffix is meant to say "this variant allows remote/SGL params" (because it doesn't run the rejection check that UCP_REQUEST_CHECK_PARAM adds).
I renamed it to UCP_REQUEST_CHECK_PARAM_COMMON to make it clearer.

Comment thread src/ucp/core/ucp_ep.c Outdated
void *request = NULL;
ucp_request_t *close_req;

UCP_REQUEST_CHECK_PARAM(param);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd not add it here as it is not relevant for ep_close

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/dt/datatype_iter.c
Comment thread src/ucp/dt/datatype_iter.c Outdated
ucp_datatype_iter_detect_mem_info(context, local->buffers[0],
local->lengths[0], dt_iter, param);
if (ENABLE_PARAMS_CHECK && (count > 1)) {
status = ucp_dt_sgl_check_same_mem_info(context, local, count,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe pass local->buffers and local->lengths separately? Like

Suggested change
status = ucp_dt_sgl_check_same_mem_info(context, local, count,
status = ucp_dt_sgl_check_same_mem_info(context, local->buffers +1 , local->lengths +1 count - 1,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/rma/flush.c Outdated
{
void *request;

UCP_REQUEST_CHECK_PARAM(param);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not needed here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/rma/flush.c Outdated
{
void *request;

UCP_REQUEST_CHECK_PARAM(param);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not needed here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/rma/put_offload.c
Comment thread src/ucp/tag/tag_recv.c
Comment thread test/gtest/common/test_obj_size.cc
@michal-shalev michal-shalev requested a review from brminich April 24, 2026 15:59
@openucx openucx deleted a comment from svc-nixl May 4, 2026
Comment thread src/ucp/core/ucp_request.inl Outdated
Comment thread src/ucp/core/ucp_rkey.c Outdated
Comment thread test/gtest/ucp/test_ucp_rma.cc
Comment thread test/gtest/ucp/ucp_test.cc
@redbrick9

Copy link
Copy Markdown

@michal-shalev will ucx_perftest test case ucp_put_bw support sge datatype?

@michal-shalev

Copy link
Copy Markdown
Contributor Author

will ucx_perftest test case ucp_put_bw support sge datatype?

Support for SGL datatype in ucx_perftest is planned, but it is outside the scope of this PR and will be addressed separately in the future.

@michal-shalev michal-shalev requested a review from iyastreb May 25, 2026 14:09
@redbrick9

Copy link
Copy Markdown

will ucx_perftest test case ucp_put_bw support sge datatype?

Support for SGL datatype in ucx_perftest is planned, but it is outside the scope of this PR and will be addressed separately in the future.

Got it. Thanks.

Comment thread src/ucp/rma/rma_send.c Outdated
if (ucs_unlikely(status != UCS_OK)) {
ret = UCS_STATUS_PTR(status);
goto out_unlock;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline after block missing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/rma/rma_send.c Outdated
ret = UCS_STATUS_PTR(status);
goto out_unlock;
}
remote = (const ucp_dt_remote_sgl_t *)param->remote;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

casting from void * is redundant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/ucp/rma/rma_send.c Outdated


static UCS_F_ALWAYS_INLINE ucs_status_t
ucp_put_sgl_check_params(const void *buffer, size_t count,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a macro, so that we don't have extra if branch in ucp_put_nbx?
Another alternative is to call this function only when ENABLE_PARAMS_CHECK is set, so that in release mode it's never invoked:
ucp_put_nbx:

if (ENABLE_PARAMS_CHECK) {
    status = ucp_put_sgl_check_params(...)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already no-op if ENABLE_PARAMS is not set (see branch below)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a macro to match existing RMA check style as a consistency change, not a performance fix.

Comment thread src/ucp/rma/rma_send.c Outdated
Comment thread src/ucp/rma/rma_send.c Outdated
Comment thread src/ucp/rma/put_offload.c
Comment thread src/ucp/rma/put_offload.c
uct_memhs[i] = (sgl_memhs != NULL) ?
sgl_memhs[start_index + i]->uct[md_index] :
UCT_MEM_HANDLE_NULL;
uct_rkeys[i] = ucp_rkey_get_tl_rkey(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can validate rkey_index once, and then just assign in a loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (rkey_index == UCP_NULL_RESOURCE) {
    for (i = 0; i < elem_count; i++) {
        uct_rkeys[i] = UCT_INVALID_RKEY;
        uct_memhs[i] = (sgl_memhs != NULL) ?
                       sgl_memhs[start_index + i]->uct[md_index] :
                       UCT_MEM_HANDLE_NULL;
    }
} else {
    for (i = 0; i < elem_count; i++) {
        uct_rkeys[i] =
            dt_iter->type.sgl.rkeys[start_index + i]->tl_rkey[rkey_index].rkey.rkey;
        uct_memhs[i] = (sgl_memhs != NULL) ?
                       sgl_memhs[start_index + i]->uct[md_index] :
                       UCT_MEM_HANDLE_NULL;
    }
}

Is this what you had in mind?

It would duplicate the memh lines, and since ucp_rkey_get_tl_rkey() is inline, rkey_index is the same every iteration, and elem_count is bounded by max_put_sgl_zcopy_count, so I'd expect the compiler to already optimize the current version.

Comment thread src/ucp/rma/put_offload.c
Comment thread src/ucp/dt/dt_sgl.c
Comment thread src/ucp/rma/put_offload.c
Comment thread src/ucp/rma/rma_send.c Outdated


static UCS_F_ALWAYS_INLINE ucs_status_t
ucp_put_sgl_check_params(const void *buffer, size_t count,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already no-op if ENABLE_PARAMS is not set (see branch below)

Comment thread src/ucp/rma/rma_send.c
@brminich brminich merged commit d65ffe7 into openucx:master Jun 1, 2026
160 checks passed
@michal-shalev michal-shalev deleted the ucp-sgl-put-impl branch June 1, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants