Commit e90ebe6
committed
Address PR microsoft#771 review feedback, add pip install and docs
Review feedback (chhwang):
- TorchCommMSCCLPP::init(): replace raw cudaSetDevice with RAII
CudaDeviceGuard to restore previous device on return/exception
- TorchCommMSCCLPP::init(): remove redundant cudaGetDevice call, use
device_.index() directly for compute capability queries
- Add pip install support via separate mscclpp-torchcomms package with
pyproject.toml, scikit-build-core, and auto-discovery of backend .so
- docs/quickstart.md: add tested version table
Review feedback (Copilot bot):
- TorchCommMSCCLPPBootstrap: add "_" delimiter between name and counter
in store key to prevent collisions, make counter_ std::atomic<int>
- TorchCommMSCCLPP::finalize(): wrap cudaStreamSynchronize and
cudaStreamDestroy with MSCCLPP_CUDATHROW for error surfacing
- All 4 supported collectives: replace tensor.contiguous() with
TORCH_CHECK(tensor.is_contiguous()) to prevent silently dropping
results for non-contiguous tensors
- CMakeLists.txt: replace manual glog search with find_package(glog
REQUIRED) for consistency with codebase conventions
Rename and documentation:
- Rename python/mscclpp_torchcomm to python/mscclpp_torchcomms for
consistency with the torchcomms library naming
- Add docs/torchcomms.md: standalone doc covering architecture,
algorithm selection, user-defined algorithms, testing, benchmarks,
limitations, and troubleshooting
- Slim down quickstart.md TorchComms section to brief snippet + link
- Add torchcomms entry to docs/index.rst
- Add import mscclpp_torchcomms to all test/benchmark files for
automatic backend .so discovery (no env var needed)1 parent 5ab276f commit e90ebe6
23 files changed
Lines changed: 463 additions & 115 deletions
File tree
- docs
- python
- mscclpp_torchcomms
- csrc
- mscclpp_torchcomm
- test/torchcomms
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
255 | 255 | | |
256 | 256 | | |
257 | 257 | | |
258 | | - | |
| 258 | + | |
259 | 259 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| |||
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| 26 | + | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
211 | 211 | | |
212 | 212 | | |
213 | 213 | | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | 214 | | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
232 | | - | |
233 | | - | |
| 215 | + | |
234 | 216 | | |
235 | 217 | | |
236 | 218 | | |
237 | | - | |
238 | 219 | | |
| 220 | + | |
239 | 221 | | |
240 | | - | |
241 | | - | |
242 | | - | |
243 | | - | |
| 222 | + | |
244 | 223 | | |
245 | | - | |
246 | | - | |
247 | 224 | | |
248 | 225 | | |
249 | 226 | | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | | - | |
255 | | - | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
265 | | - | |
266 | | - | |
267 | | - | |
268 | | - | |
269 | | - | |
270 | | - | |
271 | | - | |
272 | | - | |
273 | | - | |
274 | | - | |
275 | | - | |
276 | | - | |
277 | | - | |
278 | | - | |
279 | | - | |
| 227 | + | |
280 | 228 | | |
281 | 229 | | |
282 | 230 | | |
| |||
0 commit comments