Skip to content

TL/UCP: fix ep close timeout#1271

Open
ikryukov wants to merge 4 commits intoopenucx:masterfrom
ikryukov:fix/tl-ucp-ep-close-timeout
Open

TL/UCP: fix ep close timeout#1271
ikryukov wants to merge 4 commits intoopenucx:masterfrom
ikryukov:fix/tl-ucp-ep-close-timeout

Conversation

@ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented Mar 3, 2026

What

Refactor ucc_tl_ucp_close_eps to submit all endpoint close requests concurrently (batch close) instead of closing endpoints one at a time sequentially. Add a 60-second timeout (UCC_TL_UCP_EP_CLOSE_TIMEOUT) to prevent indefinite hangs during teardown. Use UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts where graceful flush cannot be coordinated.

Fixes CI hangs

Why?

The previous implementation closed endpoints sequentially with no timeout protection. If a peer became unreachable, the do/while progress loop would block indefinitely, hanging the entire teardown process. This was particularly problematic without OOB, where there is no barrier to guarantee all peers are reachable for a graceful flush.

How?

  • Batch close: Allocate an array of request handles, submit all ucp_ep_close_nbx calls up front, then drain them in a single polling loop. This improves teardown performance for large endpoint counts.
  • Timeout: Both the main batch path and the sequential fallback path enforce a 60-second deadline using ucc_get_time(). On timeout, in-flight requests are freed and a warning is logged.
  • Force close without OOB: Without OOB, peers cannot coordinate shutdown, so UCP_EP_CLOSE_FLAG_FORCE is used to avoid hanging on unreachable peers. With OOB, graceful flush (flags = 0) is safe since a barrier ensures reachability.
  • Allocation fallback: If the request array allocation fails (memory pressure during teardown), the function falls back to sequential close with the same timeout protection, ensuring endpoints are never leaked.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov self-assigned this Mar 3, 2026
@ikryukov ikryukov requested a review from Sergei-Lebedev March 3, 2026 16:06
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR refactors ucc_tl_ucp_close_eps to submit all endpoint-close requests concurrently (batch mode) and adds a 60-second timeout to both the primary batch path and a sequential fallback path (used when the request-array allocation fails). It also applies UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts to avoid indefinite hangs on unreachable peers, directly fixing CI teardown stalls.

Key changes:

  • Batch close: All ucp_ep_close_nbx calls are dispatched up-front before entering a single polling loop, improving teardown performance at scale.
  • Timeout on both paths: ucc_get_time()-based 60-second deadlines protect both the main batch wait-loop and the sequential fallback.
  • Force-close without OOB: UCP_EP_CLOSE_FLAG_FORCE is used when UCC_TL_CTX_HAS_OOB(ctx) is false, consistent with the existing error-handler setup for non-OOB connections.
  • Allocation fallback: If ucc_calloc fails (memory pressure), sequential close with timeout is used, ensuring endpoints are still visited.

Observations:

  • max_eps for array-based (worker->eps) storage is set to n_oob_eps (the full context population), not the number of actually connected endpoints. On large jobs where only a fraction of slots are non-NULL, this may cause a sizeable over-allocation during teardown — see inline comment.
  • The sequential fallback timeout warning does not report how many endpoints remain unclosed, unlike the main path's %d requests still in-flight message — a minor observability regression.
  • UCC_TL_UCP_EP_CLOSE_TIMEOUT is a compile-time #define with no config/environment-variable binding; prior review threads already flagged this as worth exposing via UCC's standard config mechanism.

Confidence Score: 4/5

  • PR is safe to merge; it fixes a real CI hang and handles the main edge cases correctly. Minor issues remain around over-allocation on large jobs and observability gaps.
  • The core logic is sound: batch close, timeout enforcement, force-close semantics, and an allocation fallback are all implemented correctly. No memory leaks or double-frees were found. The max_eps over-allocation on large array-based deployments and the missing count in the fallback timeout warning are the only newly identified issues, and neither causes incorrect behavior. The remaining open items (hardcoded timeout config, status uninitialized at declaration, header indentation) were flagged in prior review threads.
  • Pay attention to src/components/tl/ucp/tl_ucp_ep.c lines 110–115 for the max_eps over-allocation on large OOB deployments.

Important Files Changed

Filename Overview
src/components/tl/ucp/tl_ucp_ep.c Refactored endpoint close to batch all requests concurrently, adds a 60-second timeout to both paths, and uses UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts. The main batch path is well-structured; the sequential fallback covers the allocation-failure case. Minor: the fallback timeout warning omits the count of remaining endpoints (unlike the main path), and the uninitialized status variable noted in prior threads is still present.
src/components/tl/ucp/tl_ucp_ep.h Only the declaration of ucc_tl_ucp_close_eps was reformatted; the continuation-line indentation (12 spaces) does not align with the opening parenthesis and is inconsistent with the declaration immediately above it (already noted in prior review threads).

Last reviewed commit: 70480e7

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 3, 2026

/build

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 3, 2026

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

src/components/tl/ucp/tl_ucp_ep.c, line 153
The fallback sequential path (triggered when reqs allocation fails) lacks the same 60-second timeout that protects the main path. The do/while loop at lines 133–139 will block indefinitely if any endpoint is unreachable:

do {
    ucp_worker_progress(...);
    status = ucp_request_check_status(close_req);
} while (status == UCS_INPROGRESS);

While memory allocation failure during teardown is rare, the fallback should apply the same deadline for consistency and robustness. Consider capturing the deadline before the fallback loop and breaking (with a warning) if ucc_get_time() > deadline inside the do/while, matching the main path behavior.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov requested review from janjust and nsarka March 4, 2026 10:44
@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 4, 2026

/build

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

void ucc_tl_ucp_close_eps(ucc_tl_ucp_worker_t * worker,
ucc_tl_ucp_context_t *ctx)
#define UCC_TL_UCP_EP_CLOSE_TIMEOUT 60.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout constant is hardcoded as a compile-time #define with no corresponding config registration or environment variable. Other UCC parameters are exposed through the standard UCC config mechanism so users can tune them at run time without recompilation.

On large clusters, 60 seconds may be either too generous (delaying shutdown) or too restrictive (e.g., on high-latency interconnects under load). Consider registering this as a configurable parameter so it can be overridden via environment variable or config at run time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with greptile here

int n_inflight;
int j;
ucp_ep_h ep;
ucs_status_t status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is declared without an initializer. In the sequential-fallback path, if timed_out is already 1 on entry to an iteration, the do/while wait-loop is skipped (line 135–150), so status is never assigned before the guard at line 156. While the guard at line 156 if (!timed_out) prevents use of uninitialized status, static analyzers can still flag this as a potential issue. Initializing at declaration removes the ambiguity:

Suggested change
ucs_status_t status;
ucs_status_t status = UCS_OK;

Comment on lines +44 to +45
void ucc_tl_ucp_close_eps(
ucc_tl_ucp_worker_t *worker, ucc_tl_ucp_context_t *ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The continuation line is indented with 12 leading spaces, which does not align with the opening parenthesis and is inconsistent with the style used for the function declaration above it (lines 41–42). For consistency with the surrounding code style:

Suggested change
void ucc_tl_ucp_close_eps(
ucc_tl_ucp_worker_t *worker, ucc_tl_ucp_context_t *ctx);
void ucc_tl_ucp_close_eps(ucc_tl_ucp_worker_t *worker,
ucc_tl_ucp_context_t *ctx);

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 5, 2026

/build

Copy link
Collaborator

@nsarka nsarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from my one comment about readability--feel free to take it or leave it.

"failed to allocate close_reqs, falling back to sequential "
"close");
deadline = ucc_get_time() + UCC_TL_UCP_EP_CLOSE_TIMEOUT;
ep = get_next_ep_to_close(worker, ctx, &i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, please consider moving the sequential fallback to its own function

Comment on lines +138 to +142
tl_warn(
ctx->super.super.lib,
"ep close timed out in sequential "
"fallback");
timed_out = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback timeout warning omits remaining-endpoint count

The main batch path logs "ep close timed out, %d requests still in-flight" with a concrete count of how many requests are pending. The sequential fallback just says "ep close timed out in sequential fallback" with no count. After the timeout fires, the loop continues iterating remaining endpoints, so an approximate count of those not yet waited-on would help operators diagnose how many endpoints were left unacknowledged:

tl_warn(
    ctx->super.super.lib,
    "ep close timed out in sequential fallback, remaining eps not waited");

Or, track a counter of skipped endpoints and include it in the message, matching the verbosity of the main path.

Comment on lines +110 to +112
max_eps = worker->eps
? (size_t)ctx->super.super.ucc_context->params.oob.n_oob_eps
: kh_size(worker->ep_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_eps over-counts for array-based worker storage

When worker->eps is non-NULL (OOB/array mode), max_eps is set to n_oob_eps — the total number of context-level OOB endpoints. This is the capacity of the array, not the number of actually connected endpoints. In a large cluster where only a subset of processes have exchanged addresses, most slots in worker->eps will be NULL, so n_reqs will be well below max_eps, but the reqs allocation will be sized for all n_oob_eps entries.

On very large jobs (e.g., 100k-process runs) where a single context closes only a handful of endpoints, this over-allocates a potentially large array during teardown — at precisely the time when memory pressure may already be elevated.

Consider counting actual non-NULL entries in worker->eps before allocating, or using a smaller initial size with reallocation, to avoid an unnecessary large allocation:

if (worker->eps) {
    size_t n_oob = ctx->super.super.ucc_context->params.oob.n_oob_eps;
    for (size_t k = 0; k < n_oob; k++) {
        if (worker->eps[k]) max_eps++;
    }
} else {
    max_eps = kh_size(worker->ep_hash);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants