Skip to content

[Bug]: Check for CUDA-aware MPI might fail #1787

Open
@mrfh92

Description

What happened?

We check availability of CUDA-aware MPI as follows:

CUDA_AWARE_MPI = False
# check whether OpenMPI support CUDA-aware MPI
if "openmpi" in os.environ.get("MPI_SUFFIX", "").lower():
    buffer = subprocess.check_output(["ompi_info", "--parsable", "--all"])
    CUDA_AWARE_MPI = b"mpi_built_with_cuda_support:value:true" in buffer
# MVAPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MV2_USE_CUDA") == "1"
# MPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MPIR_CVAR_ENABLE_HCOLL") == "1"
# ParaStationMPI
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("PSP_CUDA") == "1"

On some systems I am using, MPI_SUFFIX is empty, although OpenMPI is installed (and used by Heat). Nevertheless, in that cases one has to set heat.CUDA_AWARE_MPI = True manually as the automatic check does not work.

Questions

  • is that a bug in our code or a bug in the systems that have empty MPI_SUFFIX?
  • if the first applies, how to find a catch-all version of our check?

Code snippet triggering the error

Error message or erroneous outcome

Version

main (development branch)

Python version

None

PyTorch version

None

MPI version

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

MPIAnything related to MPI communicationbugSomething isn't workingcommunication

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions