Skip to content

Commit dc2f417

Browse files
grenadeclaude
andcommitted
nccl: correct Comm thread-safety docs and rename accessor to cu_comm
Drop the SAFETY prose that claimed ncclCommAbort is meant to be called from another thread while one is blocked inside a collective. Per NCCL's thread-safety and fault-tolerance docs, a communicator may be operated from only one thread at a time, and aborting safely requires either that no thread is in an NCCL op or that the comm was created non-blocking. The unsafe Send/Sync impls now match the bare style used for the crate's other opaque handle types (CudaStream, CudaBlas, etc.); Send/Sync only govern moving/sharing the handle, and operation-ordering remains the caller's responsibility as elsewhere in the crate. Rewrite the abort() docs to state that real contract, and rename the raw handle accessor comm() -> cu_comm() for consistency with cu_stream, cu_device, cu_ctx, etc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 63327a2 commit dc2f417

1 file changed

Lines changed: 11 additions & 12 deletions

File tree

src/nccl/safe.rs

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,6 @@ pub struct Comm {
2626
world_size: usize,
2727
}
2828

29-
// SAFETY: `ncclComm_t` is an opaque handle that NCCL allows to be used from
30-
// multiple threads. In particular `ncclCommAbort` is explicitly designed to be
31-
// called from a different thread than one blocked inside a collective, which is
32-
// the whole point of [`Comm::abort`]. The single-abort invariant is enforced by
33-
// the `aborted` flag, so sharing a `Comm` across threads is sound.
3429
unsafe impl Send for Comm {}
3530
unsafe impl Sync for Comm {}
3631

@@ -190,18 +185,22 @@ impl Comm {
190185
/// Escape hatch for driving NCCL APIs not yet wrapped by this crate.
191186
/// The handle is only valid for this `Comm`'s lifetime; the returned
192187
/// pointer must not be used after the `Comm` is dropped.
193-
pub fn comm(&self) -> sys::ncclComm_t {
188+
pub fn cu_comm(&self) -> sys::ncclComm_t {
194189
self.comm
195190
}
196191

197192
/// Abort this communicator (`ncclCommAbort`).
198193
///
199-
/// Unlike [`Drop`] (which also aborts), this works on a live `&self`
200-
/// and — crucially — may be called from a *different* thread than one
201-
/// currently blocked inside a collective. That is how NCCL unblocks a
202-
/// hung or failed collective so the surviving ranks can resynchronise.
203-
/// After abort the communicator is unusable; rebuild a fresh `Comm`
204-
/// (e.g. via [`Comm::from_rank`]) to continue.
194+
/// `ncclCommAbort` frees the communicator's resources; afterwards the
195+
/// `Comm` is unusable and a fresh one must be built (e.g. via
196+
/// [`Comm::from_rank`]) to continue.
197+
///
198+
/// NCCL only permits one thread to operate a communicator at a time, so
199+
/// the caller must ensure no other thread is issuing operations on this
200+
/// `Comm` while `abort` runs. Interrupting a thread that is *blocked*
201+
/// inside a collective additionally requires the communicator to have been
202+
/// created non-blocking (so collectives never block in the first place and
203+
/// abort can be issued at any point); see the NCCL fault-tolerance docs.
205204
///
206205
/// Idempotent: aborting an already-aborted `Comm` (or one that will be
207206
/// aborted again by [`Drop`]) is a no-op returning `Ok(())`, since

0 commit comments

Comments
 (0)