Skip to content

UCC/CORE: Added local rank from topo if not provided by user#1245

Open
MaayanGadishNvidia wants to merge 1 commit intoopenucx:masterfrom
MaayanGadishNvidia:NIC_bid_auto
Open

UCC/CORE: Added local rank from topo if not provided by user#1245
MaayanGadishNvidia wants to merge 1 commit intoopenucx:masterfrom
MaayanGadishNvidia:NIC_bid_auto

Conversation

@MaayanGadishNvidia
Copy link

What

Adding automation for calculate local rank from topo if not provided by user

Why ?

Continue of #1189

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Overview

Greptile Summary

This PR implements automatic local rank computation from topology when not provided by the user. The implementation adds a new ucc_core_ctx_id_exchange() function to exchange context IDs for topology initialization, and uses the topology to compute the node local rank before creating TL contexts. The perf tool is updated to rely on automatic computation instead of manual configuration.

Key changes:

  • Added ucc_topo_node_local_rank() helper to retrieve local rank from node subgroup
  • Context ID initialization moved earlier (before local rank computation) to fix previous double-init issue
  • Added !ctx->topo check on line 878 to prevent double topo initialization
  • Automatic local rank computation only triggers when node_local_id == UCC_ULUNITS_AUTO and OOB is available with multiple endpoints
  • Previous review comments have been addressed regarding context ID timing and topo double-init

Confidence Score: 4/5

  • Safe to merge with minor dead code that should be cleaned up
  • The PR addresses most previous review concerns (context ID double-init, topo double-init). The implementation is sound and follows existing patterns. Score reduced by 1 due to dead code at line 632 that should be removed for code cleanliness, though it doesn't affect functionality.
  • src/core/ucc_context.c contains dead code at line 632 that should be cleaned up

Important Files Changed

Filename Overview
src/components/topo/ucc_topo.h Adds ucc_topo_node_local_rank() helper function to retrieve local rank from topology
src/core/ucc_context.c Implements automatic local rank computation from topology, with context ID initialization moved earlier and topo double-init protection added

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +770 to +778
if (ctx->topo) {
status = ucc_topo_init(set, ctx->topo, &topo);
if (UCC_OK != status) {
ucc_warn("failed to init topo for computing local rank");
} else {
b_params.node_local_id = ucc_topo_node_local_rank(topo);
ucc_topo_cleanup(topo);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When n_oob_eps <= 1, ctx->topo is NULL, so local rank computation is skipped. For single-rank scenarios, b_params.node_local_id remains UCC_ULUNITS_AUTO and gets passed to UCX. Should single-rank cases explicitly set node_local_id = 0?

Suggested change
if (ctx->topo) {
status = ucc_topo_init(set, ctx->topo, &topo);
if (UCC_OK != status) {
ucc_warn("failed to init topo for computing local rank");
} else {
b_params.node_local_id = ucc_topo_node_local_rank(topo);
ucc_topo_cleanup(topo);
}
}
if (ctx->topo) {
status = ucc_topo_init(set, ctx->topo, &topo);
if (UCC_OK != status) {
ucc_warn("failed to init topo for computing local rank");
} else {
b_params.node_local_id = ucc_topo_node_local_rank(topo);
ucc_topo_cleanup(topo);
}
} else if (params->mask & UCC_CONTEXT_PARAM_FIELD_OOB && params->oob.n_oob_eps == 1) {
b_params.node_local_id = 0;
}

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +632 to +636
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable code: addr_len is set to a fixed non-zero value at line 618, so this check will never be true. This code was copied from ucc_core_addr_exchange where addr_len varies, but here it's always sizeof(ucc_context_id_t).

ucp_params.estimated_num_eps = params->estimated_num_eps;
}

#ifdef HAVE_UCX_NODE_LOCAL_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace added.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +632 to +636
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code - addr_len is always set to sizeof(ucc_context_id_t) on line 618, so this check will never be true

Suggested change
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
ucc_assert(addr_storage->storage != NULL);

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

goto error_ctx_create;
}
} while (status == UCC_INPROGRESS);
status = ucc_context_topo_init(&ctx->addr_storage, &ctx->topo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential double initialization of ctx->topo. When automatic local rank computation runs (line 747), it initializes ctx->topo. If topo_required is also true, line 880 attempts to initialize ctx->topo again without freeing the first allocation, causing a memory leak of the original topo structure and its procs array.

Comment on lines +632 to +636
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code - addr_len is always set to sizeof(ucc_context_id_t) at line 618, so this condition will never be true

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 29, 2026

Additional Comments (1)

src/core/ucc_context.c
Added components/topo/ucc_topo.h include to support automatic local rank computation from topology

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +632 to +636
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code - addr_len is always set to sizeof(ucc_context_id_t) at line 618 (non-zero), so this condition will never be true

while this pattern exists in ucc_core_addr_exchange() where addr_len varies based on allgather results, here it's a fixed non-zero value

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +632 to +636
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code - addr_len is always set to sizeof(ucc_context_id_t) at line 618, so this condition will never be true

Suggested change
if (addr_storage->addr_len == 0 ) {
ucc_free(addr_storage->storage);
addr_storage->storage = NULL;
return UCC_OK;
}
ucc_assert(addr_storage->storage != NULL);

@janjust
Copy link
Collaborator

janjust commented Feb 12, 2026

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants