Skip to content

Conversation

michal-shalev
Copy link
Contributor

What?

Make memh and local_addr optional for counter elements in ucp_device_mem_list_create.

Why?

Counter elements only require remote addressing for atomic operations (ucp_device_counter_inc, or as part of ucp_device_put_multi/put_multi_partial). Requiring local memory registration (memh/local_addr) for these elements is unnecessary overhead.

How?

  • When memh is not provided, detect local_sys_dev by allocating a temporary buffer on the current CUDA context device (similar to ucp_ep_rma_batch_export)
  • Updated tests and documentation

UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR |
UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH;
elem.memh = NULL;
elem.local_addr = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be default to simplify the if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialized elem to 0

return status;
}

if (local_sys_dev == UCS_SYS_DEVICE_ID_UNKNOWN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move to ucp_device_mem_list_params_check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*local_md_map = memh->md_map;
*mem_type = memh->mem_type;
} else {
*mem_type = rkey->mem_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be cuda for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH = UCS_BIT(4) /**< Length of the local buffer in bytes */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR = UCS_BIT(2), /**< Local address (optional for counter elements) */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR = UCS_BIT(3), /**< Remote address */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH = UCS_BIT(4) /**< Length of the local buffer in bytes (optional for counter elements) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memh, laddr and length should be optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length is only optional for partial

UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY = UCS_BIT(1), /**< Unpacked remote memory key */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR = UCS_BIT(2), /**< Local address */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR = UCS_BIT(3), /**< Remote address */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR = UCS_BIT(2), /**< Local address (optional for counter elements) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it can be optional also for data elements, only rkey is required (we only check for rkey in ucp_device_mem_list_params_check), the rest are optional.
We do need local address for ucp_device_put_multi, but mem list is not bound to a specific API func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

elems[i].memh = perf.ucp.send_memh;
elems[i].rkey = perf.ucp.rkey;
elems[i].local_addr = UCS_PTR_BYTE_OFFSET(perf.send_buffer, offset);
bool is_counter = (i == count - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        for (size_t i = 0; i < count - 1; ++i) {
            elems[i].field_mask  = UCP_DEVICE_MEM_LIST_ELEM_FIELD_MEMH |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH;
            elems[i].memh        = perf.ucp.send_memh;
            elems[i].rkey        = perf.ucp.rkey;
            elems[i].local_addr  = UCS_PTR_BYTE_OFFSET(perf.send_buffer, offset);
            elems[i].remote_addr = perf.ucp.remote_addr + offset;
            elems[i].length      = perf.params.msg_size_list[i];
            offset              += elems[i].length;
        }

       
        elems[count - 1].field_mask  = UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY |
                                       UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR |
                                       UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH;
        elems[count - 1].rkey        = perf.ucp.rkey;
        elems[count - 1].remote_addr = perf.ucp.remote_addr + offset;
        elems[count - 1].length      = ONESIDED_SIGNAL_SIZE;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not always have a counter, check UCX_PERF_CMD_PUT_SINGLE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        size_t data_count = perf.params.msg_size_cnt;
        for (size_t i = 0; i < data_count; ++i) {
            elems[i].field_mask  = UCP_DEVICE_MEM_LIST_ELEM_FIELD_MEMH |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR |
                                   UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH;
            elems[i].memh        = perf.ucp.send_memh;
            elems[i].rkey        = perf.ucp.rkey;
            elems[i].local_addr  = UCS_PTR_BYTE_OFFSET(perf.send_buffer, offset);
            elems[i].remote_addr = perf.ucp.remote_addr + offset;
            elems[i].length      = perf.params.msg_size_list[i];
            offset              += elems[i].length;
        }

 
    if (m_has_counter) {
        elems[data_count].field_mask  = UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY |
                                        UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR |
                                        UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH;
        elems[data_count].rkey        = perf.ucp.rkey;
        elems[data_count].remote_addr = perf.ucp.remote_addr + offset;
        elems[data_count].length      = ONESIDED_SIGNAL_SIZE;
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're suggesting to set rkey and remote_addr twice to the same value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, there will be a condition that is checked at each iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/* +1 for the counter */
size_t count = perf.params.msg_size_cnt + 1;
size_t count = perf.params.msg_size_cnt + (m_has_counter ? 1 : 0);
size_t offset = 0;
Copy link
Contributor

@rakhmets rakhmets Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue of this PR.
The variable (offset) can be deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants