Conversation
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
|
| Filename | Overview |
|---|---|
| src/components/tl/ucp/tl_ucp_ep.c | Refactored endpoint close to batch all requests concurrently, adds a 60-second timeout to both paths, and uses UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts. The main batch path is well-structured; the sequential fallback covers the allocation-failure case. Minor: the fallback timeout warning omits the count of remaining endpoints (unlike the main path), and the uninitialized status variable noted in prior threads is still present. |
| src/components/tl/ucp/tl_ucp_ep.h | Only the declaration of ucc_tl_ucp_close_eps was reformatted; the continuation-line indentation (12 spaces) does not align with the opening parenthesis and is inconsistent with the declaration immediately above it (already noted in prior review threads). |
Last reviewed commit: 70480e7
|
/build |
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
|
/build |
Additional Comments (1)
do {
ucp_worker_progress(...);
status = ucp_request_check_status(close_req);
} while (status == UCS_INPROGRESS);While memory allocation failure during teardown is rare, the fallback should apply the same deadline for consistency and robustness. Consider capturing the deadline before the fallback loop and breaking (with a warning) if |
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
|
/build |
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
|
|
||
| void ucc_tl_ucp_close_eps(ucc_tl_ucp_worker_t * worker, | ||
| ucc_tl_ucp_context_t *ctx) | ||
| #define UCC_TL_UCP_EP_CLOSE_TIMEOUT 60.0 |
There was a problem hiding this comment.
The timeout constant is hardcoded as a compile-time #define with no corresponding config registration or environment variable. Other UCC parameters are exposed through the standard UCC config mechanism so users can tune them at run time without recompilation.
On large clusters, 60 seconds may be either too generous (delaying shutdown) or too restrictive (e.g., on high-latency interconnects under load). Consider registering this as a configurable parameter so it can be overridden via environment variable or config at run time.
| int n_inflight; | ||
| int j; | ||
| ucp_ep_h ep; | ||
| ucs_status_t status; |
There was a problem hiding this comment.
The variable is declared without an initializer. In the sequential-fallback path, if timed_out is already 1 on entry to an iteration, the do/while wait-loop is skipped (line 135–150), so status is never assigned before the guard at line 156. While the guard at line 156 if (!timed_out) prevents use of uninitialized status, static analyzers can still flag this as a potential issue. Initializing at declaration removes the ambiguity:
| ucs_status_t status; | |
| ucs_status_t status = UCS_OK; |
| void ucc_tl_ucp_close_eps( | ||
| ucc_tl_ucp_worker_t *worker, ucc_tl_ucp_context_t *ctx); |
There was a problem hiding this comment.
The continuation line is indented with 12 leading spaces, which does not align with the opening parenthesis and is inconsistent with the style used for the function declaration above it (lines 41–42). For consistency with the surrounding code style:
| void ucc_tl_ucp_close_eps( | |
| ucc_tl_ucp_worker_t *worker, ucc_tl_ucp_context_t *ctx); | |
| void ucc_tl_ucp_close_eps(ucc_tl_ucp_worker_t *worker, | |
| ucc_tl_ucp_context_t *ctx); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
/build |
nsarka
left a comment
There was a problem hiding this comment.
LGTM aside from my one comment about readability--feel free to take it or leave it.
| "failed to allocate close_reqs, falling back to sequential " | ||
| "close"); | ||
| deadline = ucc_get_time() + UCC_TL_UCP_EP_CLOSE_TIMEOUT; | ||
| ep = get_next_ep_to_close(worker, ctx, &i); |
There was a problem hiding this comment.
For readability, please consider moving the sequential fallback to its own function
| tl_warn( | ||
| ctx->super.super.lib, | ||
| "ep close timed out in sequential " | ||
| "fallback"); | ||
| timed_out = 1; |
There was a problem hiding this comment.
Fallback timeout warning omits remaining-endpoint count
The main batch path logs "ep close timed out, %d requests still in-flight" with a concrete count of how many requests are pending. The sequential fallback just says "ep close timed out in sequential fallback" with no count. After the timeout fires, the loop continues iterating remaining endpoints, so an approximate count of those not yet waited-on would help operators diagnose how many endpoints were left unacknowledged:
tl_warn(
ctx->super.super.lib,
"ep close timed out in sequential fallback, remaining eps not waited");Or, track a counter of skipped endpoints and include it in the message, matching the verbosity of the main path.
| max_eps = worker->eps | ||
| ? (size_t)ctx->super.super.ucc_context->params.oob.n_oob_eps | ||
| : kh_size(worker->ep_hash); |
There was a problem hiding this comment.
max_eps over-counts for array-based worker storage
When worker->eps is non-NULL (OOB/array mode), max_eps is set to n_oob_eps — the total number of context-level OOB endpoints. This is the capacity of the array, not the number of actually connected endpoints. In a large cluster where only a subset of processes have exchanged addresses, most slots in worker->eps will be NULL, so n_reqs will be well below max_eps, but the reqs allocation will be sized for all n_oob_eps entries.
On very large jobs (e.g., 100k-process runs) where a single context closes only a handful of endpoints, this over-allocates a potentially large array during teardown — at precisely the time when memory pressure may already be elevated.
Consider counting actual non-NULL entries in worker->eps before allocating, or using a smaller initial size with reallocation, to avoid an unnecessary large allocation:
if (worker->eps) {
size_t n_oob = ctx->super.super.ucc_context->params.oob.n_oob_eps;
for (size_t k = 0; k < n_oob; k++) {
if (worker->eps[k]) max_eps++;
}
} else {
max_eps = kh_size(worker->ep_hash);
}
What
Refactor
ucc_tl_ucp_close_epsto submit all endpoint close requests concurrently (batch close) instead of closing endpoints one at a time sequentially. Add a 60-second timeout (UCC_TL_UCP_EP_CLOSE_TIMEOUT) to prevent indefinite hangs during teardown. UseUCP_EP_CLOSE_FLAG_FORCEfor non-OOB contexts where graceful flush cannot be coordinated.Fixes CI hangs
Why?
The previous implementation closed endpoints sequentially with no timeout protection. If a peer became unreachable, the
do/whileprogress loop would block indefinitely, hanging the entire teardown process. This was particularly problematic without OOB, where there is no barrier to guarantee all peers are reachable for a graceful flush.How?
ucp_ep_close_nbxcalls up front, then drain them in a single polling loop. This improves teardown performance for large endpoint counts.ucc_get_time(). On timeout, in-flight requests are freed and a warning is logged.UCP_EP_CLOSE_FLAG_FORCEis used to avoid hanging on unreachable peers. With OOB, graceful flush (flags = 0) is safe since a barrier ensures reachability.