UCC/CORE: Added local rank from topo if not provided by user#1245
UCC/CORE: Added local rank from topo if not provided by user#1245MaayanGadishNvidia wants to merge 1 commit intoopenucx:masterfrom
Conversation
Greptile Overview
|
| Filename | Overview |
|---|---|
| src/components/topo/ucc_topo.h | Adds ucc_topo_node_local_rank() helper function to retrieve local rank from topology |
| src/core/ucc_context.c | Implements automatic local rank computation from topology, with context ID initialization moved earlier and topo double-init protection added |
211c631 to
51f149d
Compare
src/core/ucc_context.c
Outdated
| if (ctx->topo) { | ||
| status = ucc_topo_init(set, ctx->topo, &topo); | ||
| if (UCC_OK != status) { | ||
| ucc_warn("failed to init topo for computing local rank"); | ||
| } else { | ||
| b_params.node_local_id = ucc_topo_node_local_rank(topo); | ||
| ucc_topo_cleanup(topo); | ||
| } | ||
| } |
There was a problem hiding this comment.
When n_oob_eps <= 1, ctx->topo is NULL, so local rank computation is skipped. For single-rank scenarios, b_params.node_local_id remains UCC_ULUNITS_AUTO and gets passed to UCX. Should single-rank cases explicitly set node_local_id = 0?
| if (ctx->topo) { | |
| status = ucc_topo_init(set, ctx->topo, &topo); | |
| if (UCC_OK != status) { | |
| ucc_warn("failed to init topo for computing local rank"); | |
| } else { | |
| b_params.node_local_id = ucc_topo_node_local_rank(topo); | |
| ucc_topo_cleanup(topo); | |
| } | |
| } | |
| if (ctx->topo) { | |
| status = ucc_topo_init(set, ctx->topo, &topo); | |
| if (UCC_OK != status) { | |
| ucc_warn("failed to init topo for computing local rank"); | |
| } else { | |
| b_params.node_local_id = ucc_topo_node_local_rank(topo); | |
| ucc_topo_cleanup(topo); | |
| } | |
| } else if (params->mask & UCC_CONTEXT_PARAM_FIELD_OOB && params->oob.n_oob_eps == 1) { | |
| b_params.node_local_id = 0; | |
| } |
| if (addr_storage->addr_len == 0 ) { | ||
| ucc_free(addr_storage->storage); | ||
| addr_storage->storage = NULL; | ||
| return UCC_OK; | ||
| } |
There was a problem hiding this comment.
Unreachable code: addr_len is set to a fixed non-zero value at line 618, so this check will never be true. This code was copied from ucc_core_addr_exchange where addr_len varies, but here it's always sizeof(ucc_context_id_t).
| ucp_params.estimated_num_eps = params->estimated_num_eps; | ||
| } | ||
|
|
||
| #ifdef HAVE_UCX_NODE_LOCAL_ID |
There was a problem hiding this comment.
Trailing whitespace added.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if (addr_storage->addr_len == 0 ) { | ||
| ucc_free(addr_storage->storage); | ||
| addr_storage->storage = NULL; | ||
| return UCC_OK; | ||
| } |
There was a problem hiding this comment.
Dead code - addr_len is always set to sizeof(ucc_context_id_t) on line 618, so this check will never be true
| if (addr_storage->addr_len == 0 ) { | |
| ucc_free(addr_storage->storage); | |
| addr_storage->storage = NULL; | |
| return UCC_OK; | |
| } | |
| ucc_assert(addr_storage->storage != NULL); |
a3f3680 to
36a073b
Compare
| goto error_ctx_create; | ||
| } | ||
| } while (status == UCC_INPROGRESS); | ||
| status = ucc_context_topo_init(&ctx->addr_storage, &ctx->topo); |
There was a problem hiding this comment.
Potential double initialization of ctx->topo. When automatic local rank computation runs (line 747), it initializes ctx->topo. If topo_required is also true, line 880 attempts to initialize ctx->topo again without freeing the first allocation, causing a memory leak of the original topo structure and its procs array.
| if (addr_storage->addr_len == 0 ) { | ||
| ucc_free(addr_storage->storage); | ||
| addr_storage->storage = NULL; | ||
| return UCC_OK; | ||
| } |
There was a problem hiding this comment.
Dead code - addr_len is always set to sizeof(ucc_context_id_t) at line 618, so this condition will never be true
Additional Comments (1)
|
44108af to
a32583c
Compare
| if (addr_storage->addr_len == 0 ) { | ||
| ucc_free(addr_storage->storage); | ||
| addr_storage->storage = NULL; | ||
| return UCC_OK; | ||
| } |
There was a problem hiding this comment.
dead code - addr_len is always set to sizeof(ucc_context_id_t) at line 618 (non-zero), so this condition will never be true
while this pattern exists in ucc_core_addr_exchange() where addr_len varies based on allgather results, here it's a fixed non-zero value
a32583c to
c4f6048
Compare
| if (addr_storage->addr_len == 0 ) { | ||
| ucc_free(addr_storage->storage); | ||
| addr_storage->storage = NULL; | ||
| return UCC_OK; | ||
| } |
There was a problem hiding this comment.
dead code - addr_len is always set to sizeof(ucc_context_id_t) at line 618, so this condition will never be true
| if (addr_storage->addr_len == 0 ) { | |
| ucc_free(addr_storage->storage); | |
| addr_storage->storage = NULL; | |
| return UCC_OK; | |
| } | |
| ucc_assert(addr_storage->storage != NULL); |
c4f6048 to
a6456d8
Compare
|
/build |
What
Adding automation for calculate local rank from topo if not provided by user
Why ?
Continue of #1189