Skip to content

Commit f5562fc

Browse files
committed
Update README and rccl build/tuning guides for rocm-systems repo
1 parent 50fa5ab commit f5562fc

3 files changed

Lines changed: 206 additions & 117 deletions

File tree

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This repository provides utility scripts to simplify the process of setting up t
99
Setting up NCCL or RCCL on Slingshot involves several steps, including downloading source code, configuring dependencies, and compiling libraries. These scripts ameliorate the complexities by:
1010

1111
- Bringing together the lessons learned from a 4 month collaboration between HPE, Nvidia, and CSCS which addressed collective communications performance at scale, performance variability, and workload hangs.
12-
- Automating the download and build process for [NVIDIA NCCL](https://github.com/NVIDIA/nccl) or [ROCm RCCL](https://github.com/ROCm/rccl), the [AWS OFI NCCL Plugin](https://github.com/aws/aws-ofi-nccl), and [NCCL Tests](https://github.com/NVIDIA/nccl-tests) or [RCCL Tests](https://github.com/ROCm/rccl-tests) (all optional).
12+
- Automating the download and build process for [NVIDIA NCCL](https://github.com/NVIDIA/nccl) or [ROCm RCCL](https://github.com/ROCm/rocm-systems) (under `projects/rccl`), the [AWS OFI NCCL Plugin](https://github.com/aws/aws-ofi-nccl), and [NCCL Tests](https://github.com/NVIDIA/nccl-tests) or [RCCL Tests](https://github.com/ROCm/rocm-systems) (under `projects/rccl-tests`) (all optional).
1313
- Parameterizing dependency versions like CUDA, ROCm, and Libfabric to make it easier to compose custom experiments with different library versions.
1414
- The scripts always generate log files, so if you run out of scroll back buffer or there is a subtle difference in the build output, you have a better chance of catching the issue/behavior.
1515

@@ -69,7 +69,7 @@ The scripts can be run with no command line arguments, or one can override the d
6969
| Option | Description | Default |
7070
|-------------------------------|----------------------------------------------------------------------------------------------|---------------------------------|
7171
| `-b, --base-dir <path>` | Base directory for builds | Current directory (`pwd`) |
72-
| `-l, --libfabric-path <path>` | Path to the Libfabric installation | `/opt/cray/libfabric/1.22.0` |
72+
| `-l, --libfabric-path <path>` | Path to the Libfabric installation | `/opt/cray/libfabric/2.2.0rc1` |
7373
| `-p, --parallelism <threads>` | Number of threads for parallel builds | 16 |
7474
| `-n, --nccl-version <version>`| NCCL version to build | `v2.27.7-1` |
7575
| `-r, --rccl-version <version>`| RCCL version to build | `rocm-6.4.0` |
@@ -136,9 +136,9 @@ Upon successful execution, the following components will be available:
136136

137137
| Component | Path |
138138
|--------------------------|----------------------------------------------------------------------|
139-
| RCCL build artifacts | `<base-dir>/rccl/build` |
139+
| RCCL build artifacts | `<base-dir>/rocm-systems/projects/rccl/build/release` |
140140
| AWS OFI NCCL plugin | `<base-dir>/aws-ofi-nccl/src/.libs` |
141-
| RCCL Tests (if built) | `<base-dir>/rccl-tests/build` |
141+
| RCCL Tests (if built) | `<base-dir>/rocm-systems/projects/rccl-tests/build` |
142142

143143
Additionally, a timestamped log file will be saved in the log directory for debugging/troubleshooting.
144144

@@ -177,9 +177,9 @@ srun --ntasks-per-node=4 --cpus-per-task=72 --network=disable_rdzv_get ./all_red
177177
Setup Environment with build artifacts
178178
```
179179
# Setting up paths to dependencies
180-
export RCCL_HOME=$(pwd)/rccl/build
180+
export RCCL_HOME=$(pwd)/rocm-systems/projects/rccl/build/release
181181
export AWS_OFI_NCCL_HOME=$(pwd)/aws-ofi-nccl/src/.libs
182-
export RCCL_TESTS_HOME=$(pwd)/rccl-tests/build
182+
export RCCL_TESTS_HOME=$(pwd)/rocm-systems/projects/rccl-tests/build
183183
184184
export LD_LIBRARY_PATH=$RCCL_HOME:${LD_LIBRARY_PATH}
185185
export LD_LIBRARY_PATH=$AWS_OFI_NCCL_HOME:${LD_LIBRARY_PATH}
@@ -192,8 +192,8 @@ Setup RCCL - Slingshot variables
192192

193193
Run RCCL-Tests
194194
```
195-
cd <base-dir>/rccl-tests/build
196-
srun --ntasks-per-node=4 --cpus-per-task=72 --network=disable_rdzv_get ./all_reduce_perf -b 8 -e 4G -f 2
195+
srun --ntasks-per-node=8 --cpus-per-task=16 --network=disable_rdzv_get \
196+
all_reduce_perf -b 8 -e 4G -f 2 -g 1
197197
```
198198

199199
## Troubleshooting
@@ -212,9 +212,9 @@ srun --ntasks-per-node=4 --cpus-per-task=72 --network=disable_rdzv_get ./all_red
212212

213213
## Links/Resources
214214
- [NVIDIA NCCL](https://github.com/NVIDIA/nccl)
215-
- [ROCm RCCL](https://github.com/ROCm/rccl)
215+
- [ROCm RCCL](https://github.com/ROCm/rocm-systems) (under `projects/rccl`)
216216
- [AWS OFI NCCL Plugin](https://github.com/aws/aws-ofi-nccl)
217217
- [NCCL Tests](https://github.com/NVIDIA/nccl-tests)
218-
- [RCCL Tests](https://github.com/ROCm/rccl-tests)
218+
- [RCCL Tests](https://github.com/ROCm/rocm-systems) (under `projects/rccl-tests`)
219219

220220
---

0 commit comments

Comments
 (0)