This repository implements a general process for building recent versions of pytorch (~circa 2024) on Derecho from source.
The purpose is to build a version of pytorch to use in distributed ML-training workflows making optimal use of the Cray-EX Slingshot 11 (SS11) interconnect.
Distributed-ML in general and SS11 in particular pose some challenges that drive us to build from source rather than choose any of the available Pytorch versions
from e.g conda-forge. Specifically:
- We want to enable a CUDA-Aware MPI backend using
cray-mpich. (Currently forpytorchany level of MPI support requires building from source.) - We want to use a SS11-optimized NCCL. As of this writing, this requires compiling NCCL from source along with using the AWS OFI NCCL Plugin at specific versions and with specific runtime environment variable settings.
- Note that when installing
pytorchfromconda-forge, a non-optimal NCCL will generally be installed. The application may appear functional but performance will be much degraded for distributed training. - Therefore the approach taken here is to install the desired NCCL_plugin, and point
pytorchto this version at build time to minimize the likelihood of using a non-optimal version.
- Note that when installing
- Clone this repo.
git clone https://github.com/benkirk/derecho-pytorch-mpi.git cd derecho-pytorch-mpi - On a Derecho login node:
export PBS_ACCOUNT=<my_project_ID> # build default version of pytorch (currently v2.3.1): make build-pytorch-v2.3.1-pbs # build pytorch-v2.4.0, also supported: export PYTORCH_VERSION=v2.4.0 make build-pytorch-v2.4.0-pbs
- Run a sample
pytorch.dist+ MPI backend test on 2 GPU nodes:# (from a login node) # (1) request an interactive PBS session with 2 GPU nodes: qsub -I -l select=2:ncpus=64:mpiprocs=4:ngpus=4 -A ${PBS_ACCOUNT} -q main -l walltime=00:30:00 # (inside PBS) # (2) activate the conda environment: module load conda conda activate ./env-pytorch-v2.4.0-derecho-gcc-12.2.0-cray-mpich-8.1.27 # (3) run a minimal torch.dist program with the MPI backend: mpiexec -n 8 -ppn 4 --cpu-bind numa ./tests/all_reduce_test.py
The process outlined above will create a minimal conda environment in the current directory containing the pytorch build dependencies and the installed version of pytorch itself. The package list is defined in config_env.sh - users may elect to add packages to the embedded conda.yaml file, or later through the typical conda install command from within the environment.
config_env.sh: Must be sourced to properly buildpytorch.- Sourcing this file will activate the appropriate
env-pytorch=${PYTORCH_VERSION}-[...]condaenvironment from the same directory. - If the
condaenvironment does not exist, it will create it. Which in turn requires checking out thepytorchsource tree, as this is required to properly define the requiredcondabuild environment. - Therefore this script controls the packages added initially to the
condaenvironment. - Defines environment variables required to build
pytorch. - Creates an activation script
env-pytorch-${PYTORCH_VERSION}-[...]/etc/conda/activate.d/derecho-env_vars.shwith preferred runtime settings. - After installation, the resulting
condaenvironments can be activated directly without the need forconfig_env.sh, and should be compatible with the default module environment on Derecho.- Re-sourcing this script is not a problem if desired, and will result in the same module environment used to build
pytorch.
- Re-sourcing this script is not a problem if desired, and will result in the same module environment used to build
- Sourcing this file will activate the appropriate
Makefilecontains convenient rules for automation and a reproducible process. Uses the environment variablesPYTORCH_VERSIONandPBS_ACCOUNT, with sensible defaults for each.patches/${PYTORCH_VERSION}/*: Any required version-specific patches are located in this directory tree, and are applied in*-wildcard order.utils/build_nccl-ofi-plugin.sh: builds a compatible NCCL+AWS OFI plugin for use on Derecho with Cray'slibfabric. Must be updated periodically with underlyinglibfabricversion changes.
pytorch-v2 source only supports CUDA-Aware MPI backend when running under OpenMPI. This is due to some overzealous config settings that probe for CUDA support using MPIX_... extensions not available with Cray-MPICH, and implemented inside #ifdef OPEN_MPI ... anyway. Where these tests occur, when they fail they fall back to assuming the MPI is not CUDA-Aware.
Fortunately the fix is fairly straightforward, find all the places these checks occur and instead fall back to assuming MPI is CUDA-Aware. For example, see patches/v2.3.1/01-cuda-aware-mpi.