TL/UCP: fix ep close timeout by ikryukov · Pull Request #1271 · openucx/ucc

ikryukov · 2026-03-03T16:06:22Z

What

Refactor ucc_tl_ucp_close_eps to submit all endpoint close requests concurrently (batch close) instead of closing endpoints one at a time sequentially. Add a 60-second timeout (UCC_TL_UCP_EP_CLOSE_TIMEOUT) to prevent indefinite hangs during teardown. Use UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts where graceful flush cannot be coordinated.

Fixes CI hangs

Why?

The previous implementation closed endpoints sequentially with no timeout protection. If a peer became unreachable, the do/while progress loop would block indefinitely, hanging the entire teardown process. This was particularly problematic without OOB, where there is no barrier to guarantee all peers are reachable for a graceful flush.

How?

Batch close: Allocate an array of request handles, submit all ucp_ep_close_nbx calls up front, then drain them in a single polling loop. This improves teardown performance for large endpoint counts.
Timeout: Both the main batch path and the sequential fallback path enforce a 60-second deadline using ucc_get_time(). On timeout, in-flight requests are freed and a warning is logged.
Force close without OOB: Without OOB, peers cannot coordinate shutdown, so UCP_EP_CLOSE_FLAG_FORCE is used to avoid hanging on unreachable peers. With OOB, graceful flush (flags = 0) is safe since a barrier ensures reachability.
Allocation fallback: If the request array allocation fails (memory pressure during teardown), the function falls back to sequential close with the same timeout protection, ensuring endpoints are never leaked.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps · 2026-03-03T16:12:09Z

Greptile Summary

This PR refactors ucc_tl_ucp_close_eps to submit all endpoint-close requests concurrently (batch mode) and adds a 60-second timeout to both the primary batch path and a sequential fallback path (used when the request-array allocation fails). It also applies UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts to avoid indefinite hangs on unreachable peers, directly fixing CI teardown stalls.

Key changes:

Batch close: All ucp_ep_close_nbx calls are dispatched up-front before entering a single polling loop, improving teardown performance at scale.
Timeout on both paths: ucc_get_time()-based 60-second deadlines protect both the main batch wait-loop and the sequential fallback.
Force-close without OOB: UCP_EP_CLOSE_FLAG_FORCE is used when UCC_TL_CTX_HAS_OOB(ctx) is false, consistent with the existing error-handler setup for non-OOB connections.
Allocation fallback: If ucc_calloc fails (memory pressure), sequential close with timeout is used, ensuring endpoints are still visited.

Observations:

max_eps for array-based (worker->eps) storage is set to n_oob_eps (the full context population), not the number of actually connected endpoints. On large jobs where only a fraction of slots are non-NULL, this may cause a sizeable over-allocation during teardown — see inline comment.
The sequential fallback timeout warning does not report how many endpoints remain unclosed, unlike the main path's %d requests still in-flight message — a minor observability regression.
UCC_TL_UCP_EP_CLOSE_TIMEOUT is a compile-time #define with no config/environment-variable binding; prior review threads already flagged this as worth exposing via UCC's standard config mechanism.

Confidence Score: 4/5

PR is safe to merge; it fixes a real CI hang and handles the main edge cases correctly. Minor issues remain around over-allocation on large jobs and observability gaps.
The core logic is sound: batch close, timeout enforcement, force-close semantics, and an allocation fallback are all implemented correctly. No memory leaks or double-frees were found. The max_eps over-allocation on large array-based deployments and the missing count in the fallback timeout warning are the only newly identified issues, and neither causes incorrect behavior. The remaining open items (hardcoded timeout config, status uninitialized at declaration, header indentation) were flagged in prior review threads.
Pay attention to src/components/tl/ucp/tl_ucp_ep.c lines 110–115 for the max_eps over-allocation on large OOB deployments.

Important Files Changed

Filename	Overview
src/components/tl/ucp/tl_ucp_ep.c	Refactored endpoint close to batch all requests concurrently, adds a 60-second timeout to both paths, and uses UCP_EP_CLOSE_FLAG_FORCE for non-OOB contexts. The main batch path is well-structured; the sequential fallback covers the allocation-failure case. Minor: the fallback timeout warning omits the count of remaining endpoints (unlike the main path), and the uninitialized `status` variable noted in prior threads is still present.
src/components/tl/ucp/tl_ucp_ep.h	Only the declaration of ucc_tl_ucp_close_eps was reformatted; the continuation-line indentation (12 spaces) does not align with the opening parenthesis and is inconsistent with the declaration immediately above it (already noted in prior review threads).

_{Last reviewed commit: 70480e7}

greptile-apps

_{2 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

src/components/tl/ucp/tl_ucp_ep.c

ikryukov · 2026-03-03T18:35:01Z

/build

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov · 2026-03-03T18:48:34Z

/build

src/components/tl/ucp/tl_ucp_ep.c

greptile-apps · 2026-03-03T18:53:45Z

Additional Comments (1)

src/components/tl/ucp/tl_ucp_ep.c, line 153
The fallback sequential path (triggered when reqs allocation fails) lacks the same 60-second timeout that protects the main path. The do/while loop at lines 133–139 will block indefinitely if any endpoint is unreachable:

do {
    ucp_worker_progress(...);
    status = ucp_request_check_status(close_req);
} while (status == UCS_INPROGRESS);

While memory allocation failure during teardown is rare, the fallback should apply the same deadline for consistency and robustness. Consider capturing the deadline before the fallback loop and breaking (with a warning) if ucc_get_time() > deadline inside the do/while, matching the main path behavior.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov · 2026-03-04T10:44:43Z

/build

src/components/tl/ucp/tl_ucp_ep.c

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps · 2026-03-04T11:19:57Z

src/components/tl/ucp/tl_ucp_ep.c


-void ucc_tl_ucp_close_eps(ucc_tl_ucp_worker_t * worker,
-                          ucc_tl_ucp_context_t *ctx)
+#define UCC_TL_UCP_EP_CLOSE_TIMEOUT 60.0


The timeout constant is hardcoded as a compile-time #define with no corresponding config registration or environment variable. Other UCC parameters are exposed through the standard UCC config mechanism so users can tune them at run time without recompilation.

On large clusters, 60 seconds may be either too generous (delaying shutdown) or too restrictive (e.g., on high-latency interconnects under load). Consider registering this as a configurable parameter so it can be overridden via environment variable or config at run time.

I agree with greptile here

greptile-apps · 2026-03-04T11:19:59Z

src/components/tl/ucp/tl_ucp_ep.c

+    int                 n_inflight;
+    int                 j;
+    ucp_ep_h            ep;
+    ucs_status_t        status;


The variable is declared without an initializer. In the sequential-fallback path, if timed_out is already 1 on entry to an iteration, the do/while wait-loop is skipped (line 135–150), so status is never assigned before the guard at line 156. While the guard at line 156 if (!timed_out) prevents use of uninitialized status, static analyzers can still flag this as a potential issue. Initializing at declaration removes the ambiguity:

Suggested change

ucs_status_t status;

ucs_status_t status = UCS_OK;

greptile-apps · 2026-03-04T11:20:01Z

src/components/tl/ucp/tl_ucp_ep.h

+void         ucc_tl_ucp_close_eps(
+            ucc_tl_ucp_worker_t *worker, ucc_tl_ucp_context_t *ctx);


The continuation line is indented with 12 leading spaces, which does not align with the opening parenthesis and is inconsistent with the style used for the function declaration above it (lines 41–42). For consistency with the surrounding code style:

Suggested change

void ucc_tl_ucp_close_eps(

ucc_tl_ucp_worker_t *worker, ucc_tl_ucp_context_t *ctx);

void ucc_tl_ucp_close_eps(ucc_tl_ucp_worker_t *worker,

ucc_tl_ucp_context_t *ctx);

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

ikryukov · 2026-03-05T14:53:45Z

/build

nsarka

LGTM aside from my one comment about readability--feel free to take it or leave it.

nsarka · 2026-03-05T15:57:39Z

src/components/tl/ucp/tl_ucp_ep.c

+            "failed to allocate close_reqs, falling back to sequential "
+            "close");
+        deadline = ucc_get_time() + UCC_TL_UCP_EP_CLOSE_TIMEOUT;
+        ep       = get_next_ep_to_close(worker, ctx, &i);


For readability, please consider moving the sequential fallback to its own function

greptile-apps · 2026-03-05T20:43:47Z

src/components/tl/ucp/tl_ucp_ep.c

+                            tl_warn(
+                                ctx->super.super.lib,
+                                "ep close timed out in sequential "
+                                "fallback");
+                            timed_out = 1;


Fallback timeout warning omits remaining-endpoint count

The main batch path logs "ep close timed out, %d requests still in-flight" with a concrete count of how many requests are pending. The sequential fallback just says "ep close timed out in sequential fallback" with no count. After the timeout fires, the loop continues iterating remaining endpoints, so an approximate count of those not yet waited-on would help operators diagnose how many endpoints were left unacknowledged:

tl_warn( ctx->super.super.lib, "ep close timed out in sequential fallback, remaining eps not waited");

Or, track a counter of skipped endpoints and include it in the message, matching the verbosity of the main path.

greptile-apps · 2026-03-05T20:43:48Z

src/components/tl/ucp/tl_ucp_ep.c

+    max_eps = worker->eps
+                  ? (size_t)ctx->super.super.ucc_context->params.oob.n_oob_eps
+                  : kh_size(worker->ep_hash);


max_eps over-counts for array-based worker storage

When worker->eps is non-NULL (OOB/array mode), max_eps is set to n_oob_eps — the total number of context-level OOB endpoints. This is the capacity of the array, not the number of actually connected endpoints. In a large cluster where only a subset of processes have exchanged addresses, most slots in worker->eps will be NULL, so n_reqs will be well below max_eps, but the reqs allocation will be sized for all n_oob_eps entries.

On very large jobs (e.g., 100k-process runs) where a single context closes only a handful of endpoints, this over-allocates a potentially large array during teardown — at precisely the time when memory pressure may already be elevated.

Consider counting actual non-NULL entries in worker->eps before allocating, or using a smaller initial size with reallocation, to avoid an unnecessary large allocation:

if (worker->eps) { size_t n_oob = ctx->super.super.ucc_context->params.oob.n_oob_eps; for (size_t k = 0; k < n_oob; k++) { if (worker->eps[k]) max_eps++; } } else { max_eps = kh_size(worker->ep_hash); }

TL/UCP: fix ep close timeout with batch and force

653f159

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov self-assigned this Mar 3, 2026

ikryukov requested a review from Sergei-Lebedev March 3, 2026 16:06

ikryukov added the Ready-for-Review label Mar 3, 2026

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

src/components/tl/ucp/tl_ucp_ep.c Show resolved Hide resolved

src/components/tl/ucp/tl_ucp_ep.c Show resolved Hide resolved

src/components/tl/ucp/tl_ucp_ep.c Outdated Show resolved Hide resolved

TL/UCP: fix comments

5cde4ac

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

src/components/tl/ucp/tl_ucp_ep.c Show resolved Hide resolved

TL/UCP: fixed review

6292376

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov requested review from janjust and nsarka March 4, 2026 10:44

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

src/components/tl/ucp/tl_ucp_ep.c Outdated Show resolved Hide resolved

TL/UCP: fixed

70480e7

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

janjust approved these changes Mar 5, 2026

View reviewed changes

nsarka approved these changes Mar 5, 2026

View reviewed changes

janjust added the WIP - Don't Merge label Mar 5, 2026

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

		void ucc_tl_ucp_close_eps(
		ucc_tl_ucp_worker_t worker, ucc_tl_ucp_context_t ctx);

Conversation

ikryukov commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why?

How?

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikryukov commented Mar 3, 2026

Uh oh!

ikryukov commented Mar 3, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

ikryukov commented Mar 4, 2026

Uh oh!

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

nsarka Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

ikryukov commented Mar 5, 2026

Uh oh!

nsarka left a comment

Choose a reason for hiding this comment

Uh oh!

nsarka Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikryukov commented Mar 3, 2026 •

edited

Loading

greptile-apps bot commented Mar 3, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading