Skip to content

Conversation

@andrewdnolan
Copy link
Collaborator

@andrewdnolan andrewdnolan commented Nov 4, 2025

With #303, setting the mod_env_commands variable returned by get_modules_env_vars_and_mpi_compilers was broken.

None of the module or environmental variables were parsed. Which was causing an MPI linking issue with mpi4py on (at least) pm-cpu and compy with the deployment of e3sm-unified 1.12.0rc3.

Checklist

  • PR description includes a summary and any relevant issue references
  • Testing comment, if appropriate, in the PR documents testing used to verify the changes

to set the module and environmental variables commands variable
@andrewdnolan
Copy link
Collaborator Author

Demonstration of the issue:

Here's a print out of what is returned by get_modules_env_vars_and_mpi_compilers for various versions of mache. The same command was run for all versions:

from mache.spack.env import get_modules_env_vars_and_mpi_compilers 
mpicc, _, _, modules = get_modules_env_vars_and_mpi_compilers("pm-cpu", "gnu", "mpich", "sh"
print(modules)

2.0.0rc3

module purge

if [ -z "${NERSC_HOST:-}" ]; then
  # happens when building spack environment
  export NERSC_HOST="perlmutter"
fi

1.32.0

module rm cpe \                               
          cray-hdf5-parallel \
          cray-netcdf-hdf5parallel \
          cray-parallel-netcdf \
          PrgEnv-gnu \                        
          PrgEnv-intel \
          PrgEnv-nvidia \
          PrgEnv-cray \
          PrgEnv-aocc \
          gcc-native \                        
          intel \                             
          intel-oneapi \
          cudatoolkit \
          climate-utils \
          cray-libsci \
          matlab \                            
          craype-accel-nvidia80 \
          craype-accel-host \
          perftools-base \
          perftools \                         
          darshan \                           
          cray-mpich &> /dev/null

module load PrgEnv-gnu/8.5.0 \
            gcc-native/12.3 \
            cray-libsci/23.12.5 \
            craype-accel-host \
            craype/2.7.30 \
            libfabric/1.22.0 \
            cray-mpich/8.1.28 \
            cmake/3.24.3
export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1
export MPICH_MPIIO_DVS_MAXNODES=1
## purposefully omitting OMP variables that cause trouble in ESMF                            
# export OMP_STACKSIZE=128M
# export OMP_PROC_BIND=spread
# export OMP_PLACES=threads
export HDF5_USE_FILE_LOCKING=FALSE
## Not needed                                 
# export PERL5LIB=/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch                          
export FI_MR_CACHE_MONITOR=kdreg2

if [ -z "${NERSC_HOST:-}" ]; then
  # happens when building spack environment
  export NERSC_HOST="perlmutter"
fi                                            
export MPICH_COLL_SYNC=MPI_Bcast
export GATOR_INITIAL_MB=4000MB
export BLA_VENDOR=Generic

@andrewdnolan
Copy link
Collaborator Author

(Initial) Testing

Running the same command as above, with this branch produces:

source /usr/share/lmod/8.3.1/init/sh

module rm \
    cpe \
    PrgEnv-gnu \
    PrgEnv-intel \
    PrgEnv-nvidia \
    PrgEnv-cray \
    PrgEnv-aocc \
    gcc-native \
    intel \
    intel-oneapi \
    nvidia \
    aocc \
    cudatoolkit \
    climate-utils \
    cray-libsci \
    matlab \
    craype-accel-nvidia80 \
    craype-accel-host \
    perftools-base \
    perftools \
    darshan &> /dev/null

module load \
    PrgEnv-gnu/8.5.0 \
    gcc-native/13.2 \
    cray-libsci/24.07.0 \
    craype-accel-host \
    craype/2.7.32 \
    cray-mpich/8.1.30 \
    cmake/3.30.2

export MPICH_ENV_DISPLAY="1"
export MPICH_VERSION_DISPLAY="1"
export MPICH_MPIIO_DVS_MAXNODES="1"
export HDF5_USE_FILE_LOCKING="FALSE"
export FI_MR_CACHE_MONITOR="kdreg2"
export MPICH_COLL_SYNC="MPI_Bcast"
export GATOR_INITIAL_MB="4000MB"
export LD_LIBRARY_PATH="${CRAY_LD_LIBRARY_PATH}:${LD_LIBRARY_PATH}"
export MPICH_SMP_SINGLE_COPY_MODE="CMA"
export PKG_CONFIG_PATH="/global/cfs/cdirs/e3sm/3rdparty/protobuf/21.6/gcc-native-12.3/lib/pkgconfig:${PKG_CONFIG_PATH}"
export BLA_VENDOR="Generic"

if [ -z "${NERSC_HOST:-}" ]; then
  # happens when building spack environment
  export NERSC_HOST="perlmutter"
fi

I'll work on deploying with this too see if it fixes our mpi4py linking issue.

@andrewdnolan
Copy link
Collaborator Author

andrewdnolan commented Nov 4, 2025

Testing (Perlmutter)

This appears to fix the issue related to linking MPI and mpi4py on perlmutter. Using the commit from this branch cherry-picked onto update-to-2.0.0 I re-deployed 1.12.0rc3 and I'm now able to run:

salloc --nodes 1 --qos interactive --time 00:15:00 --constraint cpu --account e3sm

source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.12.0rc3_pm-cpu.sh

python -c "from mpi4py import MPI"

Output:

PE 0: MPICH processor detected:                                                                                                                                                           
PE 0:   AMD Milan (25:1:1) (family:model:stepping)                                                                                                                                        
MPI VERSION    : CRAY MPICH version 8.1.30.8 (ANL base 3.4a2)                                                                                                                             
MPI BUILD INFO : Sat Jun 01  4:44 2024 (git hash 69863f7) (CH4)                                                                                                                           
PE 0: MPICH environment settings =====================================                                                                                                                    
PE 0:   MPICH_ENV_DISPLAY                              = 1                                                                                                                                
PE 0:   MPICH_VERSION_DISPLAY                          = 1                                                                                                                                
PE 0:   MPICH_ABORT_ON_ERROR                           = 0                                                                                                                                
PE 0:   MPICH_CPUMASK_DISPLAY                          = 0                                                                                                                                
PE 0:   MPICH_STATS_DISPLAY                            = 0                                                                                                                                
PE 0:   MPICH_RANK_REORDER_METHOD                      = 1                                                                                                                                
PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0                                                                                                                                
PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0                                                                                                                                
PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0                                                                                                                                
PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1                                                                                                                                
PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096                                                                                                                             
PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED                                                                                                                        
PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT                                                                                                                      
PE 0:   MPICH_MALLOC_FALLBACK                          = 0                                                                                                                                
PE 0:   MPICH_MEM_DEBUG_FNAME                          =                                                                                                                                  
PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT                                                                                                                      
PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0                                                                                                                                
PE 0:   MPICH_COLL_SYNC                                = MPI_Bcast                                                                                                                        
PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1                                                                                                                              
PE 0:   MPICH_USE_PERSISTENT_TOPS                      = 0                                                                                                                                
PE 0:   MPICH_DISABLE_PERSISTENT_RECV_TOPS             = 0                                                                                                                                
PE 0:   MPICH_MAX_TOPS_COUNTERS                        = 0                                                                                                                                
PE 0:   MPICH_ENABLE_ACTIVE_WAIT                       = 0       
....

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewdnolan, this makes a ton of sense. I clearly didn't think about this and I'm sorry I missed it.

@andrewdnolan
Copy link
Collaborator Author

@xylar I was curious how I was able to successfully deploy (and test) e3smu_1_12_0rc3 on any machine with this functionality missing. I know I had tested on Aurora, successfully.

Starting a fresh shell on Aurora:

$ which mpicc
/opt/aurora/25.190.0/spack/unified/0.10.1/install/linux-sles15-x86_64/oneapi-2025.2.0/mpich-develop-git.6037a7a-cym6jg6/bin/mpicc

Withing e3smu_1_12_0rc3:

$ source /lus/flare/projects/E3SMinput/soft/e3sm-unified/test_e3sm_unified_1.12.0rc3_aurora.sh
$ which mpicc
/opt/aurora/25.190.0/spack/unified/0.10.1/install/linux-sles15-x86_64/oneapi-2025.2.0/mpich-develop-git.6037a7a-cym6jg6/bin/mpicc

So, I was able to successfully test on Aurora because the default modules are the same as what's in unified. So even without loading them as part of the build script, we were able to properly link MPI and mpi4py. I'm going to assume it's a similar situation on other machines where testing succeeded, without this functionality.

@andrewdnolan andrewdnolan added the bug Something isn't working label Nov 4, 2025
@andrewdnolan andrewdnolan merged commit 0a08161 into E3SM-Project:main Nov 4, 2025
5 checks passed
@andrewdnolan andrewdnolan deleted the fix-setting-mod-env_commands branch November 4, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants