Skip to content

Commit dbc425a

Browse files
grenadeclaude
andcommitted
nccl: don't panic in Comm::drop when abort returns non-success
A `Drop` impl must never panic. The communicator may already have been aborted out of band — `Comm::abort` exists precisely to abort a live comm from another thread to unblock a hung collective — in which case the abort in `Drop` returns a non-success code. `expect`-ing on it would panic during unwind/teardown (e.g. while dropping the model that owns the aborted comm). Ignore the result instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 4dff0be commit dbc425a

1 file changed

Lines changed: 6 additions & 1 deletion

File tree

src/nccl/safe.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,13 @@ fn convert_to_nccl_reduce_op(op: &ReduceOp) -> sys::ncclRedOp_t {
5656
impl Drop for Comm {
5757
fn drop(&mut self) {
5858
// TODO(thenerdstation): Shoule we instead do finalize then destory?
59+
//
60+
// Ignore the abort result rather than `expect`: a `Drop` must not
61+
// panic, and the communicator may already have been aborted out of
62+
// band (e.g. via `Comm::abort` to unblock a hung collective), in
63+
// which case this second abort returns a non-success code.
5964
unsafe {
60-
result::comm_abort(self.comm).expect("Error when aborting Comm.");
65+
let _ = result::comm_abort(self.comm);
6166
}
6267
}
6368
}

0 commit comments

Comments
 (0)