UCP: SGL datatype implementation#11344
Conversation
…tialized in put/sgl/offload
dc504a0 to
24b9d43
Compare
0d9e37d to
4c64772
Compare
4c64772 to
7598a51
Compare
|
|
||
| iface_attr_v2.field_mask = UCT_IFACE_ATTR_FIELD_MAX_PUT_SGL_ZCOPY_COUNT; | ||
| status = uct_iface_query_v2( | ||
| ucp_worker_iface(ep->worker, |
There was a problem hiding this comment.
max_put_sgl_zcopy_count is queried on every send invocation but is static post-init. Consider caching it in the protocol private data during probe, the same way max_iov is handled in other protocols.
There was a problem hiding this comment.
max_put_sgl_zcopy_count is now cached in ucp_proto_multi_lane_priv_t.
| /* Multiple progress calls, translate memh + rkey per-chunk on stack */ | ||
| uct_mem_h *uct_memhs; | ||
| uct_rkeys = ucs_alloca(elem_count * sizeof(uct_rkey_t)); | ||
|
|
There was a problem hiding this comment.
ucs_alloca asserts if the allocation exceeds ~1KB, which fires at ~128 elements (8 bytes × 128). Consider a small fixed threshold and fall back to ucs_malloc for larger counts — similar to what the SIZE_MAX branch below already does.
There was a problem hiding this comment.
is it safe to use alloca considering that request may end up in pending queue?
There was a problem hiding this comment.
For ucs_alloca with fallback to malloc you can use ucs_alloc_on_stack (and free with ucs_free_on_stack)
| ucp_datatype_iter_detect_mem_info(context, local->buffers[0], | ||
| local->lengths[0], dt_iter, param); | ||
| if (ENABLE_PARAMS_CHECK && (count > 1)) { | ||
| status = ucp_dt_sgl_memtype_check(context, local, count, |
There was a problem hiding this comment.
ucp_dt_sgl_memtype_check is only called under ENABLE_PARAMS_CHECK. In release builds, mixed HOST+GPU buffers pass through silently. Worth noting in the API docs at minimum.
There was a problem hiding this comment.
This check matches the pre-existing IOV path (ucp_datatype_iov_iter_init).
IOV doesn't document this.
I added a documentation for it (see ucp_dt_local_sgl_t).
| break; | ||
| case UCP_DATATYPE_SGL: | ||
| if (!ucp_memh_is_buffer_in_range(memh, dt_iter->type.sgl.buffers[0], | ||
| dt_iter->type.sgl.lengths[0])) { |
There was a problem hiding this comment.
Only buffers[0] is range-checked against the user memh. If the intent is one memh per SGL element, this should loop over all entries.
There was a problem hiding this comment.
ucp_datatype_iter_is_user_memh_valid signature only takes a single memh, so I had to check it from dt_iter.
Fixed + added tests.
| } | ||
|
|
||
| UCP_REQUEST_CHECK_PARAM(param); | ||
| UCP_REQUEST_CHECK_PARAM_UNSUPPORTED_REMOTE(param); |
There was a problem hiding this comment.
can we unite it with UCP_REQUEST_CHECK_PARAM?
| return dst_iov_index; | ||
| } | ||
|
|
||
| ucs_status_t ucp_datatype_sgl_iter_init(ucp_context_h context, |
There was a problem hiding this comment.
seems the common naming is ucp_datatype_iter_*, lets rename accordingly
ucp_datatype_iov_iter_init is the only exception
There was a problem hiding this comment.
Renamed to ucp_datatype_iter_sgl_init.
| ucp_mem_h *memhs; | ||
| const uint64_t *remote_addrs; | ||
| ucp_rkey_h const *rkeys; | ||
| int memhs_owned; |
There was a problem hiding this comment.
can we reuse ucp_memh_is_user_memh somehow instead of adding this field?
|
|
||
| if (!ucp_proto_init_check_op(init_params, UCS_BIT(UCP_OP_ID_PUT))) { | ||
| if (!ucp_proto_init_check_op(init_params, UCS_BIT(UCP_OP_ID_PUT)) || | ||
| (init_params->select_param->dt_class == UCP_DATATYPE_SGL)) { |
There was a problem hiding this comment.
should we also restrict selecting all other protocols for SGL datatype?
| elem_count = ucp_datatype_iter_next_sgl(dt_iter, max_sgl_count, next_iter); | ||
|
|
||
| if (max_sgl_count < SIZE_MAX) { | ||
| /* Multiple progress calls, translate memh + rkey per-chunk on stack */ |
There was a problem hiding this comment.
Because for cuda_ipc it's a single progress call.
In any case, I simplified the logic here.
| /* Multiple progress calls, translate memh + rkey per-chunk on stack */ | ||
| uct_mem_h *uct_memhs; | ||
| uct_rkeys = ucs_alloca(elem_count * sizeof(uct_rkey_t)); | ||
|
|
There was a problem hiding this comment.
is it safe to use alloca considering that request may end up in pending queue?
| uct_ep, | ||
| &dt_iter->type.sgl.buffers[start_index], | ||
| &dt_iter->type.sgl.lengths[start_index], | ||
| NULL, |
There was a problem hiding this comment.
cuda_ipc doesn't use memh.
There was a problem hiding this comment.
Now that I think of this again, I decided to always fill the uct_memhs array, mirroring the regular zcopy path, and to make UCP/UCT layering cleaner.
906a1f9 to
6934aa2
Compare
| goto out_unlock; | ||
| } | ||
| remote = (const ucp_dt_remote_sgl_t *)param->remote; | ||
| rkey = remote->rkeys[0]; |
There was a problem hiding this comment.
This assumes count > 0, should we guard against it?
|
|
||
| #define UCP_REQUEST_CHECK_PARAM(_param) \ | ||
| do { \ | ||
| UCP_REQUEST_CHECK_PARAM_ALLOW_REMOTE(_param); \ |
There was a problem hiding this comment.
seems naming is wrong, as checking remote is below in this macro.
also better to omit ALLOW in the name
There was a problem hiding this comment.
Why wrong?
The _ALLOW_REMOTE suffix is meant to say "this variant allows remote/SGL params" (because it doesn't run the rejection check that UCP_REQUEST_CHECK_PARAM adds).
I renamed it to UCP_REQUEST_CHECK_PARAM_COMMON to make it clearer.
| void *request = NULL; | ||
| ucp_request_t *close_req; | ||
|
|
||
| UCP_REQUEST_CHECK_PARAM(param); |
There was a problem hiding this comment.
i'd not add it here as it is not relevant for ep_close
| ucp_datatype_iter_detect_mem_info(context, local->buffers[0], | ||
| local->lengths[0], dt_iter, param); | ||
| if (ENABLE_PARAMS_CHECK && (count > 1)) { | ||
| status = ucp_dt_sgl_check_same_mem_info(context, local, count, |
There was a problem hiding this comment.
maybe pass local->buffers and local->lengths separately? Like
| status = ucp_dt_sgl_check_same_mem_info(context, local, count, | |
| status = ucp_dt_sgl_check_same_mem_info(context, local->buffers +1 , local->lengths +1 count - 1, |
| { | ||
| void *request; | ||
|
|
||
| UCP_REQUEST_CHECK_PARAM(param); |
| { | ||
| void *request; | ||
|
|
||
| UCP_REQUEST_CHECK_PARAM(param); |
|
@michal-shalev will ucx_perftest test case ucp_put_bw support sge datatype? |
Support for SGL datatype in ucx_perftest is planned, but it is outside the scope of this PR and will be addressed separately in the future. |
Got it. Thanks. |
| if (ucs_unlikely(status != UCS_OK)) { | ||
| ret = UCS_STATUS_PTR(status); | ||
| goto out_unlock; | ||
| } |
There was a problem hiding this comment.
newline after block missing
| ret = UCS_STATUS_PTR(status); | ||
| goto out_unlock; | ||
| } | ||
| remote = (const ucp_dt_remote_sgl_t *)param->remote; |
There was a problem hiding this comment.
casting from void * is redundant
|
|
||
|
|
||
| static UCS_F_ALWAYS_INLINE ucs_status_t | ||
| ucp_put_sgl_check_params(const void *buffer, size_t count, |
There was a problem hiding this comment.
Should this be a macro, so that we don't have extra if branch in ucp_put_nbx?
Another alternative is to call this function only when ENABLE_PARAMS_CHECK is set, so that in release mode it's never invoked:
ucp_put_nbx:
if (ENABLE_PARAMS_CHECK) {
status = ucp_put_sgl_check_params(...)
}
There was a problem hiding this comment.
it is already no-op if ENABLE_PARAMS is not set (see branch below)
There was a problem hiding this comment.
Changed to a macro to match existing RMA check style as a consistency change, not a performance fix.
| uct_memhs[i] = (sgl_memhs != NULL) ? | ||
| sgl_memhs[start_index + i]->uct[md_index] : | ||
| UCT_MEM_HANDLE_NULL; | ||
| uct_rkeys[i] = ucp_rkey_get_tl_rkey( |
There was a problem hiding this comment.
Maybe we can validate rkey_index once, and then just assign in a loop?
There was a problem hiding this comment.
if (rkey_index == UCP_NULL_RESOURCE) {
for (i = 0; i < elem_count; i++) {
uct_rkeys[i] = UCT_INVALID_RKEY;
uct_memhs[i] = (sgl_memhs != NULL) ?
sgl_memhs[start_index + i]->uct[md_index] :
UCT_MEM_HANDLE_NULL;
}
} else {
for (i = 0; i < elem_count; i++) {
uct_rkeys[i] =
dt_iter->type.sgl.rkeys[start_index + i]->tl_rkey[rkey_index].rkey.rkey;
uct_memhs[i] = (sgl_memhs != NULL) ?
sgl_memhs[start_index + i]->uct[md_index] :
UCT_MEM_HANDLE_NULL;
}
}
Is this what you had in mind?
It would duplicate the memh lines, and since ucp_rkey_get_tl_rkey() is inline, rkey_index is the same every iteration, and elem_count is bounded by max_put_sgl_zcopy_count, so I'd expect the compiler to already optimize the current version.
|
|
||
|
|
||
| static UCS_F_ALWAYS_INLINE ucs_status_t | ||
| ucp_put_sgl_check_params(const void *buffer, size_t count, |
There was a problem hiding this comment.
it is already no-op if ENABLE_PARAMS is not set (see branch below)
What?
UCP SGL (scatter–gather list) datatype: iterators, packing, RMA put integration (proto / put offload), UCT v2 iface capability bits, and gtests.
Why?
Lets callers pass multi-buffer local and remote descriptors for one-sided put in a first-class way.
How?
dt_sgl+ extensions to the datatype iterator and proto selection for SGL.put_offload/ RMA send and UCTuct_iface_v2fields for max SGL put segments / zcopy.UCT_IFACE_FLAG_V2_PUT_SGL_ZCOPY.