Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 43 additions & 35 deletions cime_config/machines/config_machines.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3603,66 +3603,68 @@
<MAX_MPITASKS_PER_NODE compiler="oneapi-ifxgpu">12</MAX_MPITASKS_PER_NODE>
<PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED>
<mpirun mpilib="default">
<executable>mpiexec</executable>
<!--executable>numactl -m 2-3 mpiexec</executable--><!--for HBM runs-->
<arguments>
<arg name="total_num_tasks">-np {{ total_tasks }} --label</arg>
<arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg>
<arg name="ranks_bind">--cpu-bind $ENV{RANKS_BIND}</arg>
<arg name="threads_per_rank">-d $ENV{OMP_NUM_THREADS}</arg>
<arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg>
</arguments>
<executable>mpiexec</executable>
<!--executable>numactl -m 2-3 mpiexec</executable--><!--for HBM runs-->
<arguments>
<arg name="total_num_tasks">-np {{ total_tasks }} --label</arg>
<arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg>
<arg name="ranks_bind">--cpu-bind $ENV{RANKS_BIND}</arg>
<arg name="threads_per_rank">-d $ENV{OMP_NUM_THREADS} $ENV{RLIMITS}</arg>
<arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove this since we are defining

--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abagusetty what's the equivalent of OpenMPI's OMPI_COMM_WORLD_LOCAL_RANK on Aurora-mpich?
Kokkos raises

`Warning: unable to detect local MPI rank. Falling back to the first GPU available for execution. Raised by Kokkos::initialize()`

if $MPI_LOCALRANKID env-var is undefined. $PALS_LOCAL_RANKID appears to be empty also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove this since we are defining

--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1

I am assuming you are asking for the last line. Yes, for <arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg> this can be removed. Since gpu-bind takes care of binding tiles with MPI-processes and wouldnt need a script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amametjanov Aurora MPICH doesnt have an equivalent. But PALS should work the same: PALS_LOCAL_RANKID. It depends on where you would be using this var. Notsure why is this empty. I believe this warnings were fixed a while back.

As long as PALS_LOCAL_RANKID is embedded after mpiexec launch command, PALS should be defined. For instance: mpiexec ... $PALS_LOCAL_RANK $EXE...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. Straight removal of gpu_tile_compact.sh script was raising those warnings. I'll push a commit to appropriately export correct $MPI_LOCALRANKID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer the gpu-bind argument over the script?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-gpu-bind is a more of a official mpich option that allows topology aware bindings internally over the script. The script was a temp. WA just since we didnt have a GPU-binding mechanism for Aurora in the earlier days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing mpiexec ... --genv MPI_LOCALRANKID=${PALS_LOCAL_RANKID} ... isn't helping: empty MPI_LOCALRANKID.
I added a Kokkos mod adding PALS_LOCAL_RANKID to the list of recognized env-vars at E3SM-Project/EKAT#372 . When that PR makes it into E3SM master, I can remove the call to gpu_tile_compact.sh in a separate PR. If that's okay, then maybe this PR can go without that mod.

</arguments>
</mpirun>
<module_system type="module" allow_error="true">
<init_path lang="sh">/usr/share/lmod/lmod/init/sh</init_path>
<init_path lang="csh">/usr/share/lmod/lmod/init/csh</init_path>
<init_path lang="python">/usr/share/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="sh">module</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<cmd_path lang="python">/usr/share/lmod/lmod/libexec/lmod python</cmd_path>
<modules>
<command name="load">cmake/3.30.5</command>
<command name="load">oneapi/release/2025.0.5</command>
</modules>
</module_system>
<RUNDIR>$CIME_OUTPUT_ROOT/$CASE/run</RUNDIR>
<EXEROOT>$CIME_OUTPUT_ROOT/$CASE/bld</EXEROOT>
<MAX_GB_OLD_TEST_DATA>0</MAX_GB_OLD_TEST_DATA>
<environment_variables>
<init_path lang="sh">/usr/share/lmod/lmod/init/sh</init_path>
<init_path lang="csh">/usr/share/lmod/lmod/init/csh</init_path>
<init_path lang="python">/usr/share/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="sh">module</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<cmd_path lang="python">/usr/share/lmod/lmod/libexec/lmod python</cmd_path>
<modules>
<command name="load">cmake/3.30.5</command>
<command name="load">oneapi/release/2025.0.5</command>
<command name="load">mpich-config/collective-tuning/1024</command>
</modules>
</module_system>
<RUNDIR>$CIME_OUTPUT_ROOT/$CASE/run</RUNDIR>
<EXEROOT>$CIME_OUTPUT_ROOT/$CASE/bld</EXEROOT>
<MAX_GB_OLD_TEST_DATA>0</MAX_GB_OLD_TEST_DATA>
<environment_variables>
<env name="NETCDF_PATH">/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002</env>
<env name="PNETCDF_PATH">/lus/flare/projects/E3SM_Dec/soft/pnetcdf/1.14.0/oneapi.eng.2024.07.30.002</env>
<env name="LD_LIBRARY_PATH">/lus/flare/projects/E3SM_Dec/soft/pnetcdf/1.14.0/oneapi.eng.2024.07.30.002/lib:/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002/lib:$ENV{LD_LIBRARY_PATH}</env>
<env name="PATH">/lus/flare/projects/E3SM_Dec/soft/pnetcdf/1.14.0/oneapi.eng.2024.07.30.002/bin:/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002/bin:$ENV{PATH}</env>
<env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env>
<env name="FI_CXI_CQ_FILL_PERCENT">20</env>
<env name="RLIMITS"> </env>
</environment_variables>
<environment_variables compiler="oneapi-ifxgpu">
<env name="ONEAPI_DEVICE_SELECTOR">level_zero:gpu</env>
<env name="MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE"></env>
<env name="MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE"></env>
<env name="MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE"></env>
<env name="UR_L0_USE_DRIVER_INORDER_LISTS">1</env>
<env name="UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS">1</env>
<env name="UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE">1</env>
<!--<env name="FI_PROVIDER">cxi</env>-->
<env name="FI_MR_CACHE_MONITOR">disabled</env>
<env name="FI_MR_CACHE_MONITOR">disabled</env>
<env name="FI_CXI_OVFLOW_BUF_SIZE">8388608</env>
<env name="PALS_PING_PERIOD">240</env>
<env name="PALS_RPC_TIMEOUT">240</env>
<env name="SYCL_PI_LEVEL_ZERO_SINGLE_THREAD_MODE">1</env>
<env name="SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR">1</env>
<env name="SYCL_PI_LEVEL_ZERO_USM_RESIDENT">0x001</env>
<env name="UR_L0_USE_DRIVER_INORDER_LISTS">1</env>
<env name="UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE">1</env>

<env name="MPIR_CVAR_ENABLE_GPU">1</env>
<env name="MPIR_CVAR_ENABLE_GPU">1</env>
<env name="romio_cb_read">disable</env>
<env name="romio_cb_write">disable</env>
<env name="SYCL_CACHE_PERSISTENT">1</env>
<env name="GATOR_INITIAL_MB">4000MB</env>
<env name="GATOR_DISABLE">0</env>
<env name="GPU_TILE_COMPACT">/lus/flare/projects/E3SM_Dec/tools/mpi_wrapper_utils/gpu_tile_compact.sh</env>
<env name="RANKS_BIND">list:1-8:9-16:17-24:25-32:33-40:41-48:53-60:61-68:69-76:77-84:85-92:93-100 --gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 --mem-bind list:0:0:0:0:0:0:1:1:1:1:1:1</env>
<env name="ZES_ENABLE_SYSMAN">1</env>
<!-- default is ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE: enable this to run 4 MPI/tile or 48 MPI/node
<env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4</env>-->
<!-- <env name="ZE_FLAT_DEVICE_HIERARCHY">FLAT</env>
<env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4</env>-->
<!-- default is ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE: enable this to run 4 MPI/tile or 48 MPI/node
<env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4</env>-->
<!-- <env name="ZE_FLAT_DEVICE_HIERARCHY">FLAT</env>
<env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4</env>-->
</environment_variables>
<environment_variables compiler="oneapi-ifx">
<env name="LIBOMPTARGET_DEBUG">0</env><!--default 0, max 5 -->
Expand All @@ -3675,6 +3677,12 @@
<env name="KMP_AFFINITY">granularity=core,balanced</env>
<env name="OMP_STACKSIZE">128M</env>
</environment_variables>
<environment_variables DEBUG="TRUE">
<env name="RLIMITS">--rlimits CORE</env>
</environment_variables>
<resource_limits DEBUG="TRUE">
<resource name="RLIMIT_CORE">-1</resource>
</resource_limits>
<resource_limits>
<resource name="RLIMIT_STACK">-1</resource>
</resource_limits>
Expand Down