Skip to content

Commit e90ebe6

Browse files
committed
Address PR microsoft#771 review feedback, add pip install and docs
Review feedback (chhwang): - TorchCommMSCCLPP::init(): replace raw cudaSetDevice with RAII CudaDeviceGuard to restore previous device on return/exception - TorchCommMSCCLPP::init(): remove redundant cudaGetDevice call, use device_.index() directly for compute capability queries - Add pip install support via separate mscclpp-torchcomms package with pyproject.toml, scikit-build-core, and auto-discovery of backend .so - docs/quickstart.md: add tested version table Review feedback (Copilot bot): - TorchCommMSCCLPPBootstrap: add "_" delimiter between name and counter in store key to prevent collisions, make counter_ std::atomic<int> - TorchCommMSCCLPP::finalize(): wrap cudaStreamSynchronize and cudaStreamDestroy with MSCCLPP_CUDATHROW for error surfacing - All 4 supported collectives: replace tensor.contiguous() with TORCH_CHECK(tensor.is_contiguous()) to prevent silently dropping results for non-contiguous tensors - CMakeLists.txt: replace manual glog search with find_package(glog REQUIRED) for consistency with codebase conventions Rename and documentation: - Rename python/mscclpp_torchcomm to python/mscclpp_torchcomms for consistency with the torchcomms library naming - Add docs/torchcomms.md: standalone doc covering architecture, algorithm selection, user-defined algorithms, testing, benchmarks, limitations, and troubleshooting - Slim down quickstart.md TorchComms section to brief snippet + link - Add torchcomms entry to docs/index.rst - Add import mscclpp_torchcomms to all test/benchmark files for automatic backend .so discovery (no env var needed)
1 parent 5ab276f commit e90ebe6

23 files changed

Lines changed: 463 additions & 115 deletions

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,5 +255,5 @@ endif()
255255

256256
# TorchComms MSCCL++ backend
257257
if(MSCCLPP_BUILD_EXT_TORCHCOMMS)
258-
add_subdirectory(python/mscclpp_torchcomm)
258+
add_subdirectory(python/mscclpp_torchcomms)
259259
endif()

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ You can find the followings from this documentation.
1010

1111
- **Overview:** An overview of MSCCL++ and its features. :doc:`🔗 <overview>`
1212
- **Quick Start:** A guide to build, install, and run MSCCL++. :doc:`🔗 <quickstart>`
13+
- **TorchComms:** Using MSCCL++ as a TorchComms backend for PyTorch training. :doc:`🔗 <torchcomms>`
1314
- **MSCCL++ DSL:** A guide to get started with the MSCCL++ DSL. :doc:`🔗 <dsl>`
1415
- **Tutorials:** A step-by-step guide for GPU communication using MSCCL++. :doc:`🔗 <tutorials>`
1516
- **Programming Guide:** Advanced topics and best practices for using MSCCL++. :doc:`🔗 <programming_guide>`
@@ -22,6 +23,7 @@ You can find the followings from this documentation.
2223

2324
overview
2425
quickstart
26+
torchcomms
2527
dsl
2628
tutorials
2729
programming_guide

docs/quickstart.md

Lines changed: 4 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -211,72 +211,20 @@ torchrun --nnodes=1 --nproc_per_node=8 your_script.py
211211

212212
MSCCL++ integrates with [TorchComms](https://github.com/meta-pytorch/torchcomms), enabling PyTorch users to use MSCCL++ collectives through the TorchComms API. This is the recommended way to use MSCCL++ in PyTorch training for mixed-backend setups (e.g., MSCCL++ for allreduce, NCCL for broadcast/barrier).
213213

214-
#### Building
215-
216-
Prerequisites: PyTorch, pybind11, and [torchcomms](https://github.com/meta-pytorch/torchcomms) (`pip install --pre torchcomms`).
217-
218214
```bash
219-
$ mkdir -p build && cd build
220-
$ cmake -DCMAKE_BUILD_TYPE=Release \
221-
-DMSCCLPP_BUILD_EXT_TORCHCOMMS=ON \
222-
..
223-
$ make -j$(nproc)
224-
$ cd ..
225-
```
226-
227-
This produces `_comms_mscclpp.*.so` in the build output. TorchComms discovers MSCCL++ via the `TORCHCOMMS_BACKEND_LIB_PATH_MSCCLPP` environment variable, where `MSCCLPP_BUILD` is your MSCCL++ build directory.
228-
229-
#### Usage
230-
231-
```bash
232-
$ export TORCHCOMMS_BACKEND_LIB_PATH_MSCCLPP=$MSCCLPP_BUILD/lib/_comms_mscclpp.cpython-*.so
233-
$ torchrun --nproc_per_node=8 your_script.py
215+
$ python -m pip install ./python/mscclpp_torchcomms
234216
```
235217

236218
```python
237-
import torch
238219
import torchcomms
220+
import mscclpp_torchcomms # auto-registers the backend
239221
240-
# Create an MSCCL++ communicator
241-
comm = torchcomms.new_comm("mscclpp", torch.device(f"cuda:{local_rank}"), name="my_comm")
242-
243-
# Run allreduce (MSCCL++ automatically selects the best algorithm)
222+
comm = torchcomms.new_comm("mscclpp", device, name="my_comm")
244223
comm.all_reduce(tensor, torchcomms.ReduceOp.SUM, False)
245-
246-
# Cleanup
247224
comm.finalize()
248225
```
249226

250-
#### Supported Collectives
251-
252-
| Collective | Status | Notes |
253-
|---|---|---|
254-
| AllReduce | Supported | SUM, MIN. Auto-selects from ~10 native algorithms by message size and topology |
255-
| AllGather | Supported | Fullmesh algorithms |
256-
| ReduceScatter | Dispatched | Requires a registered DSL algorithm |
257-
| AllToAll | Dispatched | Requires a registered DSL algorithm |
258-
| All others | Not supported | Throws with guidance to use a separate NCCL/RCCL communicator |
259-
260-
#### Environment Variables
261-
262-
| Variable | Description |
263-
|---|---|
264-
| `TORCHCOMMS_BACKEND_LIB_PATH_MSCCLPP` | **Required.** Path to the built `_comms_mscclpp.*.so` module |
265-
266-
#### Running Tests
267-
268-
```bash
269-
$ export TORCHCOMMS_BACKEND_LIB_PATH_MSCCLPP=$MSCCLPP_BUILD/lib/_comms_mscclpp.cpython-*.so
270-
$ torchrun --nproc_per_node=8 test/torchcomms/test_correctness.py --all
271-
```
272-
273-
#### Running Benchmarks
274-
275-
```bash
276-
$ export TORCHCOMMS_BACKEND_LIB_PATH_MSCCLPP=$MSCCLPP_BUILD/lib/_comms_mscclpp.cpython-*.so
277-
$ torchrun --nproc_per_node=8 test/torchcomms/bench_torchcomms.py --collective allreduce --warmup 100 --iters 200
278-
$ torchrun --nproc_per_node=8 test/torchcomms/bench_torchcomms.py --collective allgather --warmup 100 --iters 200
279-
```
227+
See [TorchComms Integration](torchcomms.md) for full documentation including architecture, algorithm selection, user-defined algorithms, testing, benchmarks, and troubleshooting.
280228

281229
## Version Tracking
282230

0 commit comments

Comments
 (0)