Skip to content

TL/NCCL: add user buffer registration via memmap#1260

Open
wfaderhold21 wants to merge 5 commits intoopenucx:masterfrom
wfaderhold21:topic/nccl-ub
Open

TL/NCCL: add user buffer registration via memmap#1260
wfaderhold21 wants to merge 5 commits intoopenucx:masterfrom
wfaderhold21:topic/nccl-ub

Conversation

@wfaderhold21
Copy link
Collaborator

What

Add support for NCCL user buffer registration via UCC mem map interface

@wfaderhold21
Copy link
Collaborator Author

Tested allgather performance on GAIA with 8 nodes, 8 PPN:
allgather_ubr_bandwidth_clean

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Summary

This PR adds NCCL User Buffer Registration (UBR) support to the TL/NCCL component via UCC's mem map interface. Registration is deferred until the first collective that uses the buffer (ncclCommRegister is called lazily in ucc_tl_nccl_init_task), and deregistration happens in ucc_tl_nccl_mem_unmap. The feature is guarded by a compile-time NCCL_HAS_UBR check (NCCL ≥ 2.19) and a runtime ENABLE_UBR ternary config option.

Key concerns:

  • CUDA event leak on UBR failure (tl_nccl_coll.c:357–364 and 394–397): when ucc_tl_nccl_lazy_register_memh returns a non-UCC_OK status, both error paths call ucc_mpool_put(task) without first calling ucc_ec_destroy_event on task->completed. Because task->completed is populated earlier in the same function (when sync_type == SYNC_TYPE_EVENT), this leaks a CUDA event every time a buffer-overflow or OOM error fires during registration.
  • The IMPORT-mode use of memh->address / memh->len (addresses that may be from a remote process and documented as "likely garbage" by the core) was flagged in previous review threads; the inline comment in ucc_tl_nccl_mem_map explains the design rationale (NCCL handles IPC internally), but this should be confirmed with the team.

Confidence Score: 2/5

  • The PR introduces a CUDA event resource leak in the error paths of the UBR lazy-registration block that needs to be fixed before merging.
  • A confirmed CUDA event leak exists when ucc_tl_nccl_lazy_register_memh fails and sync_type == SYNC_TYPE_EVENT. Several other concerns (import-mode address validity, realloc partial-failure handling) were raised in previous review threads and appear partially addressed but not fully resolved. The leak is a correctness issue that would silently exhaust CUDA event resources under repeated error conditions.
  • src/components/tl/nccl/tl_nccl_coll.c requires attention for the CUDA event leak at the UBR error-exit paths (lines 357–364 and 394–397).

Important Files Changed

Filename Overview
src/components/tl/nccl/tl_nccl.h Adds ucc_tl_nccl_memh_data_t struct for lazy UBR handle tracking and ubr_available flag to the context. Clean, no issues.
src/components/tl/nccl/tl_nccl.c Adds ENABLE_UBR ternary config option and minor formatting fix. Clean, no issues.
src/components/tl/nccl/tl_nccl_context.c Implements ucc_tl_nccl_mem_map, ucc_tl_nccl_mem_unmap, and ucc_tl_nccl_memh_pack for lazy UBR. The IMPORT mode stores memh->address/memh->len directly, which are documented by the core as potentially stale after exchange (previously flagged); overall logic is sound for NCCL's IPC model if the design intent is intentional.
src/components/tl/nccl/tl_nccl_coll.c Adds lazy NCCL registration in ucc_tl_nccl_init_task. Two error paths in the UBR block call ucc_mpool_put(task) without first destroying task->completed, leaking the CUDA event when sync_type == SYNC_TYPE_EVENT.

Last reviewed commit: 57c1e9d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +229 to +244
if (!new_comms) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for registered comms array");
/* Buffer is registered but we can't track it - this is a problem */
return UCC_ERR_NO_MEMORY;
}
m_data->registered_comms = new_comms;

new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource leak: if ucc_realloc for new_comms succeeds but the second ucc_realloc for new_handles fails (line 240-244), the newly allocated new_comms is assigned to m_data->registered_comms (line 236) but the function returns UCC_ERR_NO_MEMORY without deregistering the NCCL buffer. This leaks the NCCL registration created at line 208.

Suggested change
if (!new_comms) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for registered comms array");
/* Buffer is registered but we can't track it - this is a problem */
return UCC_ERR_NO_MEMORY;
}
m_data->registered_comms = new_comms;
new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
if (!new_handles) {
/* Failed to grow handles array - must deregister to avoid leak */
ncclCommDeregister(team->nccl_comm, nccl_handle);
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
}

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +344 to +347
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer size calculation is incorrect for variable-size collectives (ALLTOALLV, ALLGATHERV, GATHERV, SCATTERV). For these collectives, coll_args->args.src uses the info_v union member with counts* array, not info.count scalar. Accessing .info.count reads the counts pointer as an integer, resulting in incorrect buffer size. Need to check coll_args->args.coll_type and use appropriate buffer size calculation for each collective type.

Suggested change
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),
// TODO: Need to calculate correct buffer size based on coll_type
// For variable-size collectives (ALLTOALLV, ALLGATHERV, etc),
// must sum counts array instead of using single count value
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),

Comment on lines +368 to +371
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.dst.info.buffer,
coll_args->args.dst.info.count *
ucc_dt_size(coll_args->args.dst.info.datatype),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same buffer size calculation bug for dst buffer - incorrect for variable-size collectives that use info_v.counts* array instead of info.count scalar

Comment on lines +167 to +168
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for potential integer overflow when buf_start + length exceeds UINTPTR_MAX. Should validate that length <= UINTPTR_MAX - buf_start before addition to prevent wraparound

Suggested change
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
buf_start = (uintptr_t)buffer;
if (length > UINTPTR_MAX - buf_start) {
tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
buf_end = buf_start + length;

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +433 to +438

/* If UBR is not available/disabled or no TL data, return empty pack */
if (!ctx->ubr_available || !tl_h || !tl_h->tl_data) {
tl_h->packed_size = 0;
*pack_buffer = NULL;
return UCC_OK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null deref on pack

In ucc_tl_nccl_memh_pack, the early-return condition includes !tl_h, but the block still does tl_h->packed_size = 0. If tl_h == NULL this will crash. This can trigger if the iface is ever invoked with a NULL tl_h (the function explicitly allows it in the condition), so the guard should avoid dereferencing tl_h in that path.

Comment on lines +290 to +339
ucc_status_t ucc_tl_nccl_mem_map(
const ucc_base_context_t *context, ucc_mem_map_mode_t mode,
ucc_mem_map_memh_t *memh, ucc_mem_map_tl_t *tl_h)
{
return UCC_ERR_NOT_SUPPORTED;
ucc_tl_nccl_context_t *ctx = ucc_derived_of(context, ucc_tl_nccl_context_t);
ucc_tl_nccl_memh_data_t *m_data;

/* Check if UBR is available and enabled */
if (!ctx->ubr_available) {
tl_debug(
ctx->super.super.lib, "NCCL UBR not available, skipping mem_map");
return UCC_ERR_NOT_SUPPORTED;
}

/* Support both EXPORT and IMPORT modes for global memh */
if (mode != UCC_MEM_MAP_MODE_EXPORT && mode != UCC_MEM_MAP_MODE_IMPORT) {
tl_debug(ctx->super.super.lib,
"NCCL UBR: unsupported mode %d", mode);
return UCC_ERR_NOT_SUPPORTED;
}

/* Reject zero-length buffers */
if (memh->len == 0) {
tl_debug(ctx->super.super.lib,
"NCCL UBR: zero-length buffer, skipping mem_map");
return UCC_ERR_NOT_SUPPORTED;
}

/* Allocate TL-specific memory handle data */
m_data = (ucc_tl_nccl_memh_data_t *)ucc_calloc(
1, sizeof(ucc_tl_nccl_memh_data_t), "tl_nccl_memh_data");
if (!m_data) {
tl_error(
ctx->super.super.lib, "failed to allocate TL memory handle data");
return UCC_ERR_NO_MEMORY;
}

/* Store buffer information - registration will happen lazily on first use */
m_data->address = memh->address;
m_data->length = memh->len;
m_data->registered_comms = NULL;
m_data->nccl_handles = NULL;
m_data->num_comms = 0;
m_data->max_comms = 0;

/* Set TL handle data */
tl_h->tl_data = m_data;
strncpy(tl_h->tl_name, "nccl", UCC_MEM_MAP_TL_NAME_LEN - 1);
tl_h->tl_name[UCC_MEM_MAP_TL_NAME_LEN - 1] = '\0';

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPORT memh uses garbage

ucc_tl_nccl_mem_map claims to support UCC_MEM_MAP_MODE_IMPORT, but it ignores the packed payload in memh->pack_buffer and instead stores memh->address/memh->len into m_data. In the core import path, those fields are explicitly treated as “likely garbage” for imported handles (src/core/ucc_context.c:1212-1214), so the imported NCCL TL handle will contain invalid address/length and later lazy registration may attempt ncclCommRegister on bogus pointers.

Comment on lines +166 to +175
/* Verify that the entire buffer is within the registered memory region */
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
region_start = (uintptr_t)m_data->address;
region_end = region_start + m_data->length;

if (length > (UINTPTR_MAX - buf_start)) {
tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overflow check too late

In ucc_tl_nccl_lazy_register_memh, buf_end is computed as buf_start + length before checking length <= UINTPTR_MAX - buf_start. If the addition overflows, buf_end is already wrapped/UB, and the subsequent bounds check can misbehave. The overflow guard needs to run before computing buf_end.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +327 to +334
/* For NCCL UBR, we only store metadata (address/length) for lazy registration.
* When ncclCommRegister is called later, it stores this metadata locally.
* The NCCL communicator handles IPC handle exchange internally during collective
* operations (via point-to-point proxy calls), so we don't need special IMPORT
* handling. We can use memh->address/memh->len directly in both EXPORT and IMPORT
* modes - the address should be valid in the current process context. */
m_data->address = memh->address;
m_data->length = memh->len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IMPORT mode, verify whether using memh->address/memh->len directly is correct. The core code treats these as "likely garbage" for imported handles (src/core/ucc_context.c:1212-1214), suggesting TLs should extract data from pack_buffer. If NCCL's IPC model truly makes local addresses valid for IMPORT mode (different from typical TL behavior), add documentation explaining this design choice. Otherwise, implement extraction from pack_buffer similar to the core's approach in ucc_context.c:1224-1238.

Comment on lines +357 to +364
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
ucc_mpool_put(task);
return status;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA event leak on UBR registration failure

Both UBR error paths (src at line 357 and dst at line 395) call ucc_mpool_put(task) directly after a failed ucc_tl_nccl_lazy_register_memh. By this point in the function, task->completed may already hold a valid CUDA event (created at line 309 when sync_type == UCC_TL_NCCL_COMPLETION_SYNC_TYPE_EVENT). Putting the task back to the pool without destroying the event leaks it.

The fix is to destroy the event before returning, mirroring the cleanup in ucc_tl_nccl_free_task:

Suggested change
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
ucc_mpool_put(task);
return status;
}
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
if (task->completed) {
ucc_ec_destroy_event(task->completed, UCC_EE_CUDA_STREAM);
}
ucc_mpool_put(task);
return status;
}

The same fix is required for the dst-buffer error path at line 394–397.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant