Skip to content

Adding safe Group api to nccl#578

Merged
chelsea0x3b merged 4 commits into
mainfrom
safe-nccl-group
May 15, 2026
Merged

Adding safe Group api to nccl#578
chelsea0x3b merged 4 commits into
mainfrom
safe-nccl-group

Conversation

@chelsea0x3b

Copy link
Copy Markdown
Owner

Resolves #575

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Group struct to provide a safe, RAII-based wrapper for NCCL group operations, ensuring group_end is called automatically. It migrates several collective operations to the Group API, leveraging CudaView and SyncOnDrop for improved memory safety and synchronization. Review feedback suggests several safety enhancements, including making Group non-Send to respect NCCL's thread-local constraints, returning a Result from the group constructor to handle potential failures, and adding assertions to validate buffer size invariants across collective operations.

Comment thread src/nccl/safe.rs
Comment on lines +475 to +478
pub struct Group<'a> {
comm: &'a Comm,
syncs: Vec<SyncOnDrop<'a>>,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Group must not be Send because NCCL groups are thread-local. According to NCCL documentation, ncclGroupEnd must be called by the same thread that called ncclGroupStart. Since Group implements RAII for these calls, moving a Group to another thread would result in undefined behavior when it is dropped. Adding a PhantomData<*const ()> marker will prevent the struct from being Send.

pub struct Group<'a> {
    comm: &'a Comm,
    syncs: Vec<SyncOnDrop<'a>>,
    _marker: std::marker::PhantomData<*const ()>,
}

Comment thread src/nccl/safe.rs
Comment thread src/nccl/safe.rs
Comment on lines +488 to +494
pub fn group(&self) -> Group<'_> {
group_start().unwrap();
Group {
comm: self,
syncs: Vec::new(),
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

group_start() can fail (e.g., if the maximum number of nested groups is reached). This method should return a Result to allow the caller to handle such errors gracefully instead of panicking via unwrap().

    pub fn group(&self) -> Result<Group<'_>, result::NcclError> {
        group_start()?;
        Ok(Group {
            comm: self,
            syncs: Vec::new(),
            _marker: std::marker::PhantomData,
        })
    }

Comment thread src/nccl/safe.rs
Comment thread src/nccl/safe.rs
Comment thread src/nccl/safe.rs
Comment thread src/nccl/safe.rs
Comment thread src/nccl/safe.rs
@chelsea0x3b chelsea0x3b merged commit 2b7ace7 into main May 15, 2026
37 checks passed
@chelsea0x3b chelsea0x3b deleted the safe-nccl-group branch May 15, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NCCL group_start()/group_end() are not event aware

1 participant