Adding safe Group api to nccl#578
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a Group struct to provide a safe, RAII-based wrapper for NCCL group operations, ensuring group_end is called automatically. It migrates several collective operations to the Group API, leveraging CudaView and SyncOnDrop for improved memory safety and synchronization. Review feedback suggests several safety enhancements, including making Group non-Send to respect NCCL's thread-local constraints, returning a Result from the group constructor to handle potential failures, and adding assertions to validate buffer size invariants across collective operations.
| pub struct Group<'a> { | ||
| comm: &'a Comm, | ||
| syncs: Vec<SyncOnDrop<'a>>, | ||
| } |
There was a problem hiding this comment.
Group must not be Send because NCCL groups are thread-local. According to NCCL documentation, ncclGroupEnd must be called by the same thread that called ncclGroupStart. Since Group implements RAII for these calls, moving a Group to another thread would result in undefined behavior when it is dropped. Adding a PhantomData<*const ()> marker will prevent the struct from being Send.
pub struct Group<'a> {
comm: &'a Comm,
syncs: Vec<SyncOnDrop<'a>>,
_marker: std::marker::PhantomData<*const ()>,
}| pub fn group(&self) -> Group<'_> { | ||
| group_start().unwrap(); | ||
| Group { | ||
| comm: self, | ||
| syncs: Vec::new(), | ||
| } | ||
| } |
There was a problem hiding this comment.
group_start() can fail (e.g., if the maximum number of nested groups is reached). This method should return a Result to allow the caller to handle such errors gracefully instead of panicking via unwrap().
pub fn group(&self) -> Result<Group<'_>, result::NcclError> {
group_start()?;
Ok(Group {
comm: self,
syncs: Vec::new(),
_marker: std::marker::PhantomData,
})
}
Resolves #575