Open
Description
What happened?
We check availability of CUDA-aware MPI as follows:
CUDA_AWARE_MPI = False
# check whether OpenMPI support CUDA-aware MPI
if "openmpi" in os.environ.get("MPI_SUFFIX", "").lower():
buffer = subprocess.check_output(["ompi_info", "--parsable", "--all"])
CUDA_AWARE_MPI = b"mpi_built_with_cuda_support:value:true" in buffer
# MVAPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MV2_USE_CUDA") == "1"
# MPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MPIR_CVAR_ENABLE_HCOLL") == "1"
# ParaStationMPI
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("PSP_CUDA") == "1"
On some systems I am using, MPI_SUFFIX
is empty, although OpenMPI is installed (and used by Heat). Nevertheless, in that cases one has to set heat.CUDA_AWARE_MPI = True
manually as the automatic check does not work.
Questions
- is that a bug in our code or a bug in the systems that have empty
MPI_SUFFIX
? - if the first applies, how to find a catch-all version of our check?
Code snippet triggering the error
Error message or erroneous outcome
Version
main (development branch)
Python version
None
PyTorch version
None
Activity