Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This repository provides utility scripts to simplify the process of setting up t
Setting up NCCL or RCCL on Slingshot involves several steps, including downloading source code, configuring dependencies, and compiling libraries. These scripts ameliorate the complexities by:

- Bringing together the lessons learned from a 4 month collaboration between HPE, Nvidia, and CSCS which addressed collective communications performance at scale, performance variability, and workload hangs.
- Automating the download and build process for [NVIDIA NCCL](https://github.com/NVIDIA/nccl) or [ROCm RCCL](https://github.com/ROCm/rccl), the [AWS OFI NCCL Plugin](https://github.com/aws/aws-ofi-nccl), and [NCCL Tests](https://github.com/NVIDIA/nccl-tests) or [RCCL Tests](https://github.com/ROCm/rccl-tests) (all optional).
- Automating the download and build process for [NVIDIA NCCL](https://github.com/NVIDIA/nccl) or [ROCm RCCL](https://github.com/ROCm/rocm-systems) (under `projects/rccl`), the [AWS OFI NCCL Plugin](https://github.com/aws/aws-ofi-nccl), and [NCCL Tests](https://github.com/NVIDIA/nccl-tests) or [RCCL Tests](https://github.com/ROCm/rocm-systems) (under `projects/rccl-tests`) (all optional).
- Parameterizing dependency versions like CUDA, ROCm, and Libfabric to make it easier to compose custom experiments with different library versions.
- The scripts always generate log files, so if you run out of scroll back buffer or there is a subtle difference in the build output, you have a better chance of catching the issue/behavior.

Expand Down Expand Up @@ -136,9 +136,9 @@ Upon successful execution, the following components will be available:

| Component | Path |
|--------------------------|----------------------------------------------------------------------|
| RCCL build artifacts | `<base-dir>/rccl/build` |
| RCCL build artifacts | `<base-dir>/rocm-systems/projects/rccl/build/release` |
| AWS OFI NCCL plugin | `<base-dir>/aws-ofi-nccl/src/.libs` |
| RCCL Tests (if built) | `<base-dir>/rccl-tests/build` |
| RCCL Tests (if built) | `<base-dir>/rocm-systems/projects/rccl-tests/build` |

Additionally, a timestamped log file will be saved in the log directory for debugging/troubleshooting.

Expand Down Expand Up @@ -177,9 +177,9 @@ srun --ntasks-per-node=4 --cpus-per-task=72 --network=disable_rdzv_get ./all_red
Setup Environment with build artifacts
```
# Setting up paths to dependencies
export RCCL_HOME=$(pwd)/rccl/build
export RCCL_HOME=$(pwd)/rocm-systems/projects/rccl/build/release
export AWS_OFI_NCCL_HOME=$(pwd)/aws-ofi-nccl/src/.libs
export RCCL_TESTS_HOME=$(pwd)/rccl-tests/build
export RCCL_TESTS_HOME=$(pwd)/rocm-systems/projects/rccl-tests/build

export LD_LIBRARY_PATH=$RCCL_HOME:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=$AWS_OFI_NCCL_HOME:${LD_LIBRARY_PATH}
Expand Down Expand Up @@ -212,9 +212,9 @@ srun --ntasks-per-node=4 --cpus-per-task=72 --network=disable_rdzv_get ./all_red

## Links/Resources
- [NVIDIA NCCL](https://github.com/NVIDIA/nccl)
- [ROCm RCCL](https://github.com/ROCm/rccl)
- [ROCm RCCL](https://github.com/ROCm/rocm-systems) (under `projects/rccl`)
- [AWS OFI NCCL Plugin](https://github.com/aws/aws-ofi-nccl)
- [NCCL Tests](https://github.com/NVIDIA/nccl-tests)
- [RCCL Tests](https://github.com/ROCm/rccl-tests)
- [RCCL Tests](https://github.com/ROCm/rocm-systems) (under `projects/rccl-tests`)

---
137 changes: 97 additions & 40 deletions rccl/build_rccl_environment.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ ROCM_VERSION="rocm-6.4.0"
SKIP_CLONE=false
SKIP_TESTS=false
LOG_DIR="$BASE_DIR/logs"
# rocm-systems is the unified super-repo containing both rccl and rccl-tests
ROCM_SYSTEMS_REPO="https://github.com/ROCm/rocm-systems.git"
AWS_OFI_VERSION="v1.18.0"

# Help
usage() {
Expand Down Expand Up @@ -62,11 +65,41 @@ echo "============================="
echo "Build log: $LOG_FILE"
echo "============================="

# Install locations (best-effort paths)
RCCL_HOME="$BASE_DIR/rccl/build"
# The rocm-systems repository is cloned into BASE_DIR/rocm-systems.
# rccl and rccl-tests source live under rocm-systems/projects/.
ROCM_SYSTEMS_DIR="$BASE_DIR/rocm-systems"
RCCL_SRC="$ROCM_SYSTEMS_DIR/projects/rccl"
RCCL_TESTS_SRC="$ROCM_SYSTEMS_DIR/projects/rccl-tests"

# Build output locations
# install.sh builds into build/release inside the source tree
RCCL_HOME="$RCCL_SRC/build/release"
HWLOC_HOME="$BASE_DIR/hwloc"
AWS_OFI_NCCL_HOME="$BASE_DIR/aws-ofi-nccl/src/.libs"
RCCL_TESTS_HOME="$BASE_DIR/rccl-tests/build"
RCCL_TESTS_HOME="$RCCL_TESTS_SRC/build"

# Basic preflight: ROCM_PATH
if [ -z "$ROCM_PATH" ]; then
echo "Warning: ROCM_PATH is not set. Attempting to use /opt/$ROCM_VERSION"
export ROCM_PATH="/opt/$ROCM_VERSION"
fi

# Confirm cmake >= 3.22 (required for --toolchain flag used by rccl/install.sh)
CMAKE_VERSION=$(cmake --version 2>/dev/null | awk 'NR==1{print $3}')
CMAKE_MAJOR=$(echo "$CMAKE_VERSION" | cut -d. -f1)
CMAKE_MINOR=$(echo "$CMAKE_VERSION" | cut -d. -f2)
if [ "${CMAKE_MAJOR:-0}" -lt 3 ] || { [ "${CMAKE_MAJOR:-0}" -eq 3 ] && [ "${CMAKE_MINOR:-0}" -lt 22 ]; }; then
echo "ERROR: cmake >= 3.22 is required (found: ${CMAKE_VERSION:-none})."
echo " Run this script on a compute node: srun -N1 --ntasks=1 $0 [options]"
exit 1
fi

# MPI: prefer CRAY_MPICH_PREFIX, fall back to MPICH_DIR
MPI_PREFIX="${CRAY_MPICH_PREFIX:-${MPICH_DIR:-}}"
if [ -z "$MPI_PREFIX" ]; then
echo "Warning: Neither CRAY_MPICH_PREFIX nor MPICH_DIR is set."
echo " rccl-tests will be built without MPI support."
fi

cat <<EOF
=============================
Expand All @@ -81,17 +114,33 @@ Skip rccl-tests: $SKIP_TESTS
=============================
EOF

# Basic preflight
if [ -z "$ROCM_PATH" ]; then
echo "Warning: ROCM_PATH is not set. Attempting to use /opt/$ROCM_VERSION"
export ROCM_PATH="/opt/$ROCM_VERSION"
# ──────────────────────────────────────────────
# Clone rocm-systems (contains both rccl and rccl-tests under projects/)
# ──────────────────────────────────────────────
if [ "$SKIP_CLONE" = false ]; then
if [ ! -d "$ROCM_SYSTEMS_DIR" ]; then
echo "Cloning rocm-systems (this may take a few minutes)..."
git clone "$ROCM_SYSTEMS_REPO" "$ROCM_SYSTEMS_DIR" || {
echo "ERROR: Failed to clone rocm-systems from $ROCM_SYSTEMS_REPO"
exit 1
}
else
echo "rocm-systems directory already exists at $ROCM_SYSTEMS_DIR; skipping clone."
fi
fi

if [ -z "$MPICH_DIR" ]; then
echo "Note: MPICH_DIR not set; rccl-tests and MPI builds may need MPI_HOME provided via environment."
if [ ! -d "$RCCL_SRC" ]; then
echo "ERROR: RCCL source not found at $RCCL_SRC"
exit 1
fi
if [ ! -d "$RCCL_TESTS_SRC" ] && [ "$SKIP_TESTS" = false ]; then
echo "ERROR: rccl-tests source not found at $RCCL_TESTS_SRC"
exit 1
fi

# Clone and build hwloc (replay_hwloc_commands.sh logic)
# ──────────────────────────────────────────────
# Clone and build hwloc (needed by aws-ofi-nccl)
# ──────────────────────────────────────────────
if [ "$SKIP_CLONE" = false ]; then
if [ ! -d "$BASE_DIR/hwloc" ]; then
echo "Cloning hwloc..."
Expand Down Expand Up @@ -123,40 +172,48 @@ if [ -d "$BASE_DIR/aws-ofi-nccl" ]; then
popd
fi

# Build RCCL
if [ "$SKIP_CLONE" = false ]; then
if [ ! -d "$BASE_DIR/rccl" ]; then
echo "Cloning RCCL..."
git clone --recursive https://github.com/ROCm/rccl.git "$BASE_DIR/rccl" || { echo "Failed to clone RCCL"; exit 1; }
fi
fi
if [ -d "$BASE_DIR/rccl" ]; then
pushd "$BASE_DIR/rccl"
# If RCCL provides an install script, use hipcc as CXX similar to original script
if [ -x ./install.sh ]; then
CXX=hipcc srun ./install.sh --disable-msccl-kernel --fast || true
else
echo "No install.sh; attempting make"
make -j"$PARALLELISM" || true
fi
# ──────────────────────────────────────────────
# Build RCCL from rocm-systems/projects/rccl
# ──────────────────────────────────────────────
echo "Building RCCL from $RCCL_SRC ..."
pushd "$RCCL_SRC"
# --fast: local GPU arch only, no collective trace, no MSCCL kernels (fastest build)
# -j: parallel jobs
./install.sh --fast -j "$PARALLELISM" || {
echo "ERROR: RCCL install.sh failed"
popd
fi
exit 1
}
popd
echo "RCCL build complete. Artifacts in $RCCL_HOME"

# Clone and build rccl-tests (adapted from reproduce_rccl_tests.sh)
# ──────────────────────────────────────────────
# Build rccl-tests from rocm-systems/projects/rccl-tests
# ──────────────────────────────────────────────
if [ "$SKIP_TESTS" = false ]; then
if [ "$SKIP_CLONE" = false ] && [ ! -d "$BASE_DIR/rccl-tests" ]; then
git clone https://github.com/ROCm/rccl-tests.git "$BASE_DIR/rccl-tests" || { echo "Failed to clone rccl-tests"; exit 1; }
fi
if [ -d "$BASE_DIR/rccl-tests" ]; then
pushd "$BASE_DIR/rccl-tests"
echo "Listing rccl-tests directory"
pwd
ls -la || true
MPICC_PATH=${CRAY_MPICH_PREFIX}/bin/mpicc"
echo "Using MPICC at $MPICC_PATH"
make MPI=1 MPI_HOME="${CRAY_MPICH_PREFIX}" CXX=hipcc -j"$PARALLELISM" || true
popd
echo "Building rccl-tests from $RCCL_TESTS_SRC ..."
pushd "$RCCL_TESTS_SRC"

# rccl-tests Makefile uses NCCL_HOME to find rccl headers/library
if [ -n "$MPI_PREFIX" ]; then
make MPI=1 \
MPI_HOME="$MPI_PREFIX" \
NCCL_HOME="$RCCL_HOME" \
CUSTOM_RCCL_LIB="$RCCL_HOME/librccl.so" \
HIPCC="$ROCM_PATH/bin/hipcc" \
-j"$PARALLELISM" || {
echo "ERROR: rccl-tests build failed"; popd; exit 1
}
else
make NCCL_HOME="$RCCL_HOME" \
CUSTOM_RCCL_LIB="$RCCL_HOME/librccl.so" \
HIPCC="$ROCM_PATH/bin/hipcc" \
-j"$PARALLELISM" || {
echo "ERROR: rccl-tests build failed (no MPI)"; popd; exit 1
}
fi
popd
echo "rccl-tests build complete. Artifacts in $RCCL_TESTS_HOME"
fi

echo "============================="
Expand Down
2 changes: 1 addition & 1 deletion rccl/rccl_tuning_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ For the best performance and stability, always use the most recent version of th

### OFI (Libfabric) Backend for RCCL

To enable high-performance RDMA, you must use the OFI Plug-In for Libfabric-to-RCCL. This open-source backend can be downloaded from GitHub at `https://github.com/ROCm/aws-ofi-rccl`. HPE and AMD have collaborated to ensure this plugin works with Slingshot NICs. Currently, users must build the code from the repository, as HPE does not provide pre-packaged RPMs.
To enable high-performance RDMA, you must use the OFI Plug-In for Libfabric-to-RCCL. This open-source backend can be downloaded from GitHub at `https://github.com/aws/aws-ofi-nccl`. HPE and AMD have collaborated to ensure this plugin works with Slingshot NICs. Currently, users must build the code from the repository, as HPE does not provide pre-packaged RPMs.

### GPU Driver and User Stack Compatibility

Expand Down