Skip to content

nccl: expose Comm::abort / get_async_error and stop Comm::drop from panicking#585

Open
grenade wants to merge 4 commits into
chelsea0x3b:mainfrom
grenade:nccl-comm-abort
Open

nccl: expose Comm::abort / get_async_error and stop Comm::drop from panicking#585
grenade wants to merge 4 commits into
chelsea0x3b:mainfrom
grenade:nccl-comm-abort

Conversation

@grenade

@grenade grenade commented Jun 8, 2026

Copy link
Copy Markdown

Fixes #584.

Adds the pieces needed to recover from a hung or failed collective instead of hanging forever:

  • Comm::abort(&self): wraps ncclCommAbort. Takes &self so it can be called from a watchdog thread while another thread is blocked inside a collective, which is how you unblock a stuck rank.
  • Comm::get_async_error(&self) wraps ncclCommGetAsyncError. Returns Ok(()) for ncclSuccess/ncclInProgress and Err only for a real error code, so a watchdog can poll comm health without blocking.
  • Comm::comm(&self) -> sys::ncclComm_t: raw handle accessor for driving NCCL APIs not yet wrapped here, in the same spirit as CudaStream::cu_stream. Happy to rename this to nccl_comm if you'd preferit to match the cu_* convention more closely.

The second commit fixes a panic in Comm::drop. It used comm_abort(...).expect(...), which has the same panic-in-Drop problem that #277 described for the driver destructors (a panic while unwinding turns into a process abort). It also fires on the normal recovery path: once you abort a comm out of band, the destructor aborts it a second time, ncclCommAbort returns non-success, and the expect panics. Changed it to ignore the result, matching the "don't panic in Drop" approach already used elsewhere in the crate.

Scope is src/nccl/safe.rs only; no changes to the sys or result layers (the underlying functions were already wrapped). Builds and clippys clean under nccl,cuda-12060,dynamic-loading.

I left the existing TODO(thenerdstation) about finalize-then-destroy in place since that's a separate question from the panic.

grenade and others added 2 commits June 8, 2026 13:48
The safe `Comm` only aborts in `Drop`, which can't help recover a hung or
failed collective: the thread blocked inside the collective holds an
`Arc<Comm>` clone, so the refcount never reaches zero to trigger Drop. To
unblock it you must call `ncclCommAbort` on the live communicator from a
*different* thread (NCCL supports exactly this), then rebuild a fresh
`Comm` on the surviving ranks.

Add three thin methods on `impl Comm`:
- `abort(&self)` — `ncclCommAbort` on a live `&self` (callable cross-thread).
- `get_async_error(&self)` — `ncclCommGetAsyncError`, so a watchdog can
  detect a failed collective without blocking (ncclInProgress reads healthy).
- `comm(&self) -> ncclComm_t` — raw-handle escape hatch for un-wrapped APIs.

All wrap functions already present in `result`/`sys`; no new bindings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A `Drop` impl must never panic. The communicator may already have been
aborted out of band — `Comm::abort` exists precisely to abort a live comm
from another thread to unblock a hung collective — in which case the
abort in `Drop` returns a non-success code. `expect`-ing on it would
panic during unwind/teardown (e.g. while dropping the model that owns the
aborted comm). Ignore the result instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@grenade grenade requested a review from chelsea0x3b as a code owner June 8, 2026 13:06

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Comm struct in src/nccl/safe.rs by changing its Drop implementation to safely ignore abort errors instead of panicking. It also introduces new methods: comm() to access the raw handle, abort() to allow out-of-band communicator aborts, and get_async_error() to poll the asynchronous error state. The review feedback highlights two critical issues: first, a memory safety and thread-safety concern where out-of-band aborts can cause a double-free/use-after-free upon drop, which can be resolved by implementing Send/Sync and using AtomicPtr for the communicator handle; second, a compilation failure on older CUDA versions due to the unconditional use of ncclInProgress in get_async_error(), which requires conditional compilation guards.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/nccl/safe.rs
Comment thread src/nccl/safe.rs
Comm::abort is meant to be called from a watchdog thread while another
thread is blocked inside a collective, but Comm holds a raw ncclComm_t
and so was !Send + !Sync — there was no safe way to share it across
threads, making the feature unusable as documented. Add unsafe Send/Sync
impls: NCCL permits using a communicator (and ncclCommAbort in
particular) from multiple threads.

That alone exposes a second problem: ncclCommAbort frees the
communicator, so the out-of-band abort followed by the abort in Drop
would call it twice on the same handle — a use-after-free. Guard both
paths with an `aborted` AtomicBool swapped exactly once; whichever of
abort()/drop() wins issues the single ncclCommAbort, and abort() becomes
idempotent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chelsea0x3b chelsea0x3b left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify more about thread safety of ncclComm_t

Comment thread src/nccl/safe.rs Outdated
Comment thread src/nccl/safe.rs Outdated
Drop the SAFETY prose that claimed ncclCommAbort is meant to be called
from another thread while one is blocked inside a collective. Per NCCL's
thread-safety and fault-tolerance docs, a communicator may be operated
from only one thread at a time, and aborting safely requires either that
no thread is in an NCCL op or that the comm was created non-blocking. The
unsafe Send/Sync impls now match the bare style used for the crate's
other opaque handle types (CudaStream, CudaBlas, etc.); Send/Sync only
govern moving/sharing the handle, and operation-ordering remains the
caller's responsibility as elsewhere in the crate.

Rewrite the abort() docs to state that real contract, and rename the raw
handle accessor comm() -> cu_comm() for consistency with cu_stream,
cu_device, cu_ctx, etc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nccl: no way to recover from a hung/failed collective (ncclCommAbort / ncclCommGetAsyncError not exposed)

2 participants