Building Pytorch on NCAR's Derecho supercomputer with a CUDA-Aware Cray-MPICH backend

This repository implements a general process for building recent versions of pytorch (~circa 2024) on Derecho from source. The purpose is to build a version of pytorch to use in distributed ML-training workflows making optimal use of the Cray-EX Slingshot 11 (SS11) interconnect.

Distributed-ML in general and SS11 in particular pose some challenges that drive us to build from source rather than choose any of the available Pytorch versions from e.g conda-forge. Specifically:

We want to enable a CUDA-Aware MPI backend using cray-mpich. (Currently for pytorch any level of MPI support requires building from source.)
We want to use a SS11-optimized NCCL. As of this writing, this requires compiling NCCL from source along with using the AWS OFI NCCL Plugin at specific versions and with specific runtime environment variable settings.
- Note that when installing pytorch from conda-forge, a non-optimal NCCL will generally be installed. The application may appear functional but performance will be much degraded for distributed training.
- Therefore the approach taken here is to install the desired NCCL_plugin, and point pytorch to this version at build time to minimize the likelihood of using a non-optimal version.

User Installation

Quick start

Clone this repo.

git clone https://github.com/benkirk/derecho-pytorch-mpi.git
cd derecho-pytorch-mpi

On a Derecho login node:

export PBS_ACCOUNT=<my_project_ID>

# build default version of pytorch (currently v2.3.1):
make build-pytorch-v2.3.1-pbs

# build pytorch-v2.4.0, also supported:
export PYTORCH_VERSION=v2.4.0
make build-pytorch-v2.4.0-pbs

Run a sample pytorch.dist + MPI backend test on 2 GPU nodes:

# (from a login node)
# (1) request an interactive PBS session with 2 GPU nodes:
qsub -I -l select=2:ncpus=64:mpiprocs=4:ngpus=4 -A ${PBS_ACCOUNT} -q main -l walltime=00:30:00

# (inside PBS)
# (2) activate the conda environment:
module load conda
conda activate ./env-pytorch-v2.4.0-derecho-gcc-12.2.0-cray-mpich-8.1.27

# (3) run a minimal torch.dist program with the MPI backend:
mpiexec -n 8 -ppn 4 --cpu-bind numa ./tests/all_reduce_test.py

Customizing the resulting `conda` environment

The process outlined above will create a minimal conda environment in the current directory containing the pytorch build dependencies and the installed version of pytorch itself. The package list is defined in config_env.sh - users may elect to add packages to the embedded conda.yaml file, or later through the typical conda install command from within the environment.

Developer Details

Important files

config_env.sh: Must be sourced to properly build pytorch.
- Sourcing this file will activate the appropriate env-pytorch=${PYTORCH_VERSION}-[...] conda environment from the same directory.
- If the conda environment does not exist, it will create it. Which in turn requires checking out the pytorch source tree, as this is required to properly define the required conda build environment.
- Therefore this script controls the packages added initially to the conda environment.
- Defines environment variables required to build pytorch.
- Creates an activation script env-pytorch-${PYTORCH_VERSION}-[...]/etc/conda/activate.d/derecho-env_vars.sh with preferred runtime settings.
- After installation, the resulting conda environments can be activated directly without the need for config_env.sh, and should be compatible with the default module environment on Derecho.
  - Re-sourcing this script is not a problem if desired, and will result in the same module environment used to build pytorch.
Makefile contains convenient rules for automation and a reproducible process. Uses the environment variables PYTORCH_VERSION and PBS_ACCOUNT, with sensible defaults for each.
patches/${PYTORCH_VERSION}/*: Any required version-specific patches are located in this directory tree, and are applied in *-wildcard order.
utils/build_nccl-ofi-plugin.sh: builds a compatible NCCL+AWS OFI plugin for use on Derecho with Cray's libfabric. Must be updated periodically with underlying libfabric version changes.

Pytorch, CUDA-Awareness, and Cray-MPICH

pytorch-v2 source only supports CUDA-Aware MPI backend when running under OpenMPI. This is due to some overzealous config settings that probe for CUDA support using MPIX_... extensions not available with Cray-MPICH, and implemented inside #ifdef OPEN_MPI ... anyway. Where these tests occur, when they fail they fall back to assuming the MPI is not CUDA-Aware.

Fortunately the fix is fairly straightforward, find all the places these checks occur and instead fall back to assuming MPI is CUDA-Aware. For example, see patches/v2.3.1/01-cuda-aware-mpi.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
patches		patches
profile.d		profile.d
tests		tests
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
config_env.sh		config_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building Pytorch on NCAR's Derecho supercomputer with a CUDA-Aware Cray-MPICH backend

User Installation

Quick start

Customizing the resulting `conda` environment

Developer Details

Important files

Pytorch, CUDA-Awareness, and Cray-MPICH

About

Uh oh!

Releases

Packages

Languages

benkirk/derecho-pytorch-mpi

Folders and files

Latest commit

History

Repository files navigation

Building Pytorch on NCAR's Derecho supercomputer with a CUDA-Aware Cray-MPICH backend

User Installation

Quick start

Customizing the resulting conda environment

Developer Details

Important files

Pytorch, CUDA-Awareness, and Cray-MPICH

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Customizing the resulting `conda` environment

Packages