nccl: expose `Comm::abort` / `get_async_error` and stop `Comm::drop` from panicking by grenade · Pull Request #585 · chelsea0x3b/cudarc

grenade · 2026-06-08T13:06:40Z

Fixes #584.

Adds the pieces needed to recover from a hung or failed collective instead of hanging forever:

Comm::abort(&self): wraps ncclCommAbort. Takes &self so it can be called from a watchdog thread while another thread is blocked inside a collective, which is how you unblock a stuck rank.
Comm::get_async_error(&self) wraps ncclCommGetAsyncError. Returns Ok(()) for ncclSuccess/ncclInProgress and Err only for a real error code, so a watchdog can poll comm health without blocking.
Comm::comm(&self) -> sys::ncclComm_t: raw handle accessor for driving NCCL APIs not yet wrapped here, in the same spirit as CudaStream::cu_stream. Happy to rename this to nccl_comm if you'd preferit to match the cu_* convention more closely.

The second commit fixes a panic in Comm::drop. It used comm_abort(...).expect(...), which has the same panic-in-Drop problem that #277 described for the driver destructors (a panic while unwinding turns into a process abort). It also fires on the normal recovery path: once you abort a comm out of band, the destructor aborts it a second time, ncclCommAbort returns non-success, and the expect panics. Changed it to ignore the result, matching the "don't panic in Drop" approach already used elsewhere in the crate.

Scope is src/nccl/safe.rs only; no changes to the sys or result layers (the underlying functions were already wrapped). Builds and clippys clean under nccl,cuda-12060,dynamic-loading.

I left the existing TODO(thenerdstation) about finalize-then-destroy in place since that's a separate question from the panic.

The safe `Comm` only aborts in `Drop`, which can't help recover a hung or failed collective: the thread blocked inside the collective holds an `Arc<Comm>` clone, so the refcount never reaches zero to trigger Drop. To unblock it you must call `ncclCommAbort` on the live communicator from a *different* thread (NCCL supports exactly this), then rebuild a fresh `Comm` on the surviving ranks. Add three thin methods on `impl Comm`: - `abort(&self)` — `ncclCommAbort` on a live `&self` (callable cross-thread). - `get_async_error(&self)` — `ncclCommGetAsyncError`, so a watchdog can detect a failed collective without blocking (ncclInProgress reads healthy). - `comm(&self) -> ncclComm_t` — raw-handle escape hatch for un-wrapped APIs. All wrap functions already present in `result`/`sys`; no new bindings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A `Drop` impl must never panic. The communicator may already have been aborted out of band — `Comm::abort` exists precisely to abort a live comm from another thread to unblock a hung collective — in which case the abort in `Drop` returns a non-success code. `expect`-ing on it would panic during unwind/teardown (e.g. while dropping the model that owns the aborted comm). Ignore the result instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the Comm struct in src/nccl/safe.rs by changing its Drop implementation to safely ignore abort errors instead of panicking. It also introduces new methods: comm() to access the raw handle, abort() to allow out-of-band communicator aborts, and get_async_error() to poll the asynchronous error state. The review feedback highlights two critical issues: first, a memory safety and thread-safety concern where out-of-band aborts can cause a double-free/use-after-free upon drop, which can be resolved by implementing Send/Sync and using AtomicPtr for the communicator handle; second, a compilation failure on older CUDA versions due to the unconditional use of ncclInProgress in get_async_error(), which requires conditional compilation guards.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comm::abort is meant to be called from a watchdog thread while another thread is blocked inside a collective, but Comm holds a raw ncclComm_t and so was !Send + !Sync — there was no safe way to share it across threads, making the feature unusable as documented. Add unsafe Send/Sync impls: NCCL permits using a communicator (and ncclCommAbort in particular) from multiple threads. That alone exposes a second problem: ncclCommAbort frees the communicator, so the out-of-band abort followed by the abort in Drop would call it twice on the same handle — a use-after-free. Guard both paths with an `aborted` AtomicBool swapped exactly once; whichever of abort()/drop() wins issues the single ncclCommAbort, and abort() becomes idempotent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chelsea0x3b

Please clarify more about thread safety of ncclComm_t

Drop the SAFETY prose that claimed ncclCommAbort is meant to be called from another thread while one is blocked inside a collective. Per NCCL's thread-safety and fault-tolerance docs, a communicator may be operated from only one thread at a time, and aborting safely requires either that no thread is in an NCCL op or that the comm was created non-blocking. The unsafe Send/Sync impls now match the bare style used for the crate's other opaque handle types (CudaStream, CudaBlas, etc.); Send/Sync only govern moving/sharing the handle, and operation-ordering remains the caller's responsibility as elsewhere in the crate. Rewrite the abort() docs to state that real contract, and rename the raw handle accessor comm() -> cu_comm() for consistency with cu_stream, cu_device, cu_ctx, etc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

grenade and others added 2 commits June 8, 2026 13:48

grenade requested a review from chelsea0x3b as a code owner June 8, 2026 13:06

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/nccl/safe.rs

Comment thread src/nccl/safe.rs

chelsea0x3b reviewed Jun 18, 2026

View reviewed changes

Comment thread src/nccl/safe.rs Outdated

Comment thread src/nccl/safe.rs Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nccl: expose `Comm::abort` / `get_async_error` and stop `Comm::drop` from panicking#585

nccl: expose `Comm::abort` / `get_async_error` and stop `Comm::drop` from panicking#585
grenade wants to merge 4 commits into
chelsea0x3b:mainfrom
grenade:nccl-comm-abort

grenade commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chelsea0x3b left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

grenade commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chelsea0x3b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants