nccl: expose Comm::abort / get_async_error and stop Comm::drop from panicking#585
nccl: expose Comm::abort / get_async_error and stop Comm::drop from panicking#585grenade wants to merge 4 commits into
Comm::abort / get_async_error and stop Comm::drop from panicking#585Conversation
The safe `Comm` only aborts in `Drop`, which can't help recover a hung or failed collective: the thread blocked inside the collective holds an `Arc<Comm>` clone, so the refcount never reaches zero to trigger Drop. To unblock it you must call `ncclCommAbort` on the live communicator from a *different* thread (NCCL supports exactly this), then rebuild a fresh `Comm` on the surviving ranks. Add three thin methods on `impl Comm`: - `abort(&self)` — `ncclCommAbort` on a live `&self` (callable cross-thread). - `get_async_error(&self)` — `ncclCommGetAsyncError`, so a watchdog can detect a failed collective without blocking (ncclInProgress reads healthy). - `comm(&self) -> ncclComm_t` — raw-handle escape hatch for un-wrapped APIs. All wrap functions already present in `result`/`sys`; no new bindings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A `Drop` impl must never panic. The communicator may already have been aborted out of band — `Comm::abort` exists precisely to abort a live comm from another thread to unblock a hung collective — in which case the abort in `Drop` returns a non-success code. `expect`-ing on it would panic during unwind/teardown (e.g. while dropping the model that owns the aborted comm). Ignore the result instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Comm struct in src/nccl/safe.rs by changing its Drop implementation to safely ignore abort errors instead of panicking. It also introduces new methods: comm() to access the raw handle, abort() to allow out-of-band communicator aborts, and get_async_error() to poll the asynchronous error state. The review feedback highlights two critical issues: first, a memory safety and thread-safety concern where out-of-band aborts can cause a double-free/use-after-free upon drop, which can be resolved by implementing Send/Sync and using AtomicPtr for the communicator handle; second, a compilation failure on older CUDA versions due to the unconditional use of ncclInProgress in get_async_error(), which requires conditional compilation guards.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Comm::abort is meant to be called from a watchdog thread while another thread is blocked inside a collective, but Comm holds a raw ncclComm_t and so was !Send + !Sync — there was no safe way to share it across threads, making the feature unusable as documented. Add unsafe Send/Sync impls: NCCL permits using a communicator (and ncclCommAbort in particular) from multiple threads. That alone exposes a second problem: ncclCommAbort frees the communicator, so the out-of-band abort followed by the abort in Drop would call it twice on the same handle — a use-after-free. Guard both paths with an `aborted` AtomicBool swapped exactly once; whichever of abort()/drop() wins issues the single ncclCommAbort, and abort() becomes idempotent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chelsea0x3b
left a comment
There was a problem hiding this comment.
Please clarify more about thread safety of ncclComm_t
Drop the SAFETY prose that claimed ncclCommAbort is meant to be called from another thread while one is blocked inside a collective. Per NCCL's thread-safety and fault-tolerance docs, a communicator may be operated from only one thread at a time, and aborting safely requires either that no thread is in an NCCL op or that the comm was created non-blocking. The unsafe Send/Sync impls now match the bare style used for the crate's other opaque handle types (CudaStream, CudaBlas, etc.); Send/Sync only govern moving/sharing the handle, and operation-ordering remains the caller's responsibility as elsewhere in the crate. Rewrite the abort() docs to state that real contract, and rename the raw handle accessor comm() -> cu_comm() for consistency with cu_stream, cu_device, cu_ctx, etc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes #584.
Adds the pieces needed to recover from a hung or failed collective instead of hanging forever:
Comm::abort(&self): wrapsncclCommAbort. Takes&selfso it can be called from a watchdog thread while another thread is blocked inside a collective, which is how you unblock a stuck rank.Comm::get_async_error(&self)wraps ncclCommGetAsyncError. ReturnsOk(())forncclSuccess/ncclInProgressandErronly for a real error code, so a watchdog can poll comm health without blocking.Comm::comm(&self)->sys::ncclComm_t: raw handle accessor for driving NCCL APIs not yet wrapped here, in the same spirit asCudaStream::cu_stream. Happy to rename this to nccl_comm if you'd preferit to match the cu_* convention more closely.The second commit fixes a panic in
Comm::drop. It usedcomm_abort(...).expect(...), which has the same panic-in-Drop problem that #277 described for the driver destructors (a panic while unwinding turns into a process abort). It also fires on the normal recovery path: once you abort a comm out of band, the destructor aborts it a second time,ncclCommAbortreturns non-success, and theexpectpanics. Changed it to ignore the result, matching the "don't panic in Drop" approach already used elsewhere in the crate.Scope is
src/nccl/safe.rsonly; no changes to the sys or result layers (the underlying functions were already wrapped). Builds and clippys clean under nccl,cuda-12060,dynamic-loading.I left the existing TODO(thenerdstation) about finalize-then-destroy in place since that's a separate question from the panic.