-
Notifications
You must be signed in to change notification settings - Fork 446
Load recommended mpich-config modules and env-vars on Aurora #7399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
1145d45
5b276d8
e75d3fe
b0a2a02
429b10f
e072fff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3603,66 +3603,68 @@ | |
| <MAX_MPITASKS_PER_NODE compiler="oneapi-ifxgpu">12</MAX_MPITASKS_PER_NODE> | ||
| <PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED> | ||
| <mpirun mpilib="default"> | ||
| <executable>mpiexec</executable> | ||
| <!--executable>numactl -m 2-3 mpiexec</executable--><!--for HBM runs--> | ||
| <arguments> | ||
| <arg name="total_num_tasks">-np {{ total_tasks }} --label</arg> | ||
| <arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg> | ||
| <arg name="ranks_bind">--cpu-bind $ENV{RANKS_BIND}</arg> | ||
| <arg name="threads_per_rank">-d $ENV{OMP_NUM_THREADS}</arg> | ||
| <arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg> | ||
| </arguments> | ||
| <executable>mpiexec</executable> | ||
| <!--executable>numactl -m 2-3 mpiexec</executable--><!--for HBM runs--> | ||
| <arguments> | ||
| <arg name="total_num_tasks">-np {{ total_tasks }} --label</arg> | ||
| <arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg> | ||
| <arg name="ranks_bind">--cpu-bind $ENV{RANKS_BIND}</arg> | ||
| <arg name="threads_per_rank">-d $ENV{OMP_NUM_THREADS} $ENV{RLIMITS}</arg> | ||
| <arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we remove this since we are defining
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @abagusetty what's the equivalent of OpenMPI's if
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am assuming you are asking for the last line. Yes, for
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @amametjanov Aurora MPICH doesnt have an equivalent. But PALS should work the same: As long as
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok thanks. Straight removal of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason to prefer the gpu-bind argument over the script?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Passing |
||
| </arguments> | ||
| </mpirun> | ||
| <module_system type="module" allow_error="true"> | ||
| <init_path lang="sh">/usr/share/lmod/lmod/init/sh</init_path> | ||
| <init_path lang="csh">/usr/share/lmod/lmod/init/csh</init_path> | ||
| <init_path lang="python">/usr/share/lmod/lmod/init/env_modules_python.py</init_path> | ||
| <cmd_path lang="sh">module</cmd_path> | ||
| <cmd_path lang="csh">module</cmd_path> | ||
| <cmd_path lang="python">/usr/share/lmod/lmod/libexec/lmod python</cmd_path> | ||
| <modules> | ||
| <command name="load">cmake/3.30.5</command> | ||
| <command name="load">oneapi/release/2025.0.5</command> | ||
| </modules> | ||
| </module_system> | ||
| <RUNDIR>$CIME_OUTPUT_ROOT/$CASE/run</RUNDIR> | ||
| <EXEROOT>$CIME_OUTPUT_ROOT/$CASE/bld</EXEROOT> | ||
| <MAX_GB_OLD_TEST_DATA>0</MAX_GB_OLD_TEST_DATA> | ||
| <environment_variables> | ||
| <init_path lang="sh">/usr/share/lmod/lmod/init/sh</init_path> | ||
| <init_path lang="csh">/usr/share/lmod/lmod/init/csh</init_path> | ||
| <init_path lang="python">/usr/share/lmod/lmod/init/env_modules_python.py</init_path> | ||
| <cmd_path lang="sh">module</cmd_path> | ||
| <cmd_path lang="csh">module</cmd_path> | ||
| <cmd_path lang="python">/usr/share/lmod/lmod/libexec/lmod python</cmd_path> | ||
| <modules> | ||
| <command name="load">cmake/3.30.5</command> | ||
| <command name="load">oneapi/release/2025.0.5</command> | ||
| <command name="load">mpich-config/collective-tuning/1024</command> | ||
| </modules> | ||
| </module_system> | ||
| <RUNDIR>$CIME_OUTPUT_ROOT/$CASE/run</RUNDIR> | ||
| <EXEROOT>$CIME_OUTPUT_ROOT/$CASE/bld</EXEROOT> | ||
| <MAX_GB_OLD_TEST_DATA>0</MAX_GB_OLD_TEST_DATA> | ||
| <environment_variables> | ||
| <env name="NETCDF_PATH">/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002</env> | ||
| <env name="PNETCDF_PATH">/lus/flare/projects/E3SM_Dec/soft/pnetcdf/1.14.0/oneapi.eng.2024.07.30.002</env> | ||
| <env name="LD_LIBRARY_PATH">/lus/flare/projects/E3SM_Dec/soft/pnetcdf/1.14.0/oneapi.eng.2024.07.30.002/lib:/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002/lib:$ENV{LD_LIBRARY_PATH}</env> | ||
| <env name="PATH">/lus/flare/projects/E3SM_Dec/soft/pnetcdf/1.14.0/oneapi.eng.2024.07.30.002/bin:/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002/bin:$ENV{PATH}</env> | ||
| <env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env> | ||
| <env name="FI_CXI_CQ_FILL_PERCENT">20</env> | ||
| <env name="RLIMITS"> </env> | ||
| </environment_variables> | ||
| <environment_variables compiler="oneapi-ifxgpu"> | ||
| <env name="ONEAPI_DEVICE_SELECTOR">level_zero:gpu</env> | ||
| <env name="MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE"></env> | ||
| <env name="MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE"></env> | ||
| <env name="MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE"></env> | ||
| <env name="UR_L0_USE_DRIVER_INORDER_LISTS">1</env> | ||
| <env name="UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS">1</env> | ||
| <env name="UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE">1</env> | ||
| <!--<env name="FI_PROVIDER">cxi</env>--> | ||
| <env name="FI_MR_CACHE_MONITOR">disabled</env> | ||
| <env name="FI_MR_CACHE_MONITOR">disabled</env> | ||
| <env name="FI_CXI_OVFLOW_BUF_SIZE">8388608</env> | ||
| <env name="PALS_PING_PERIOD">240</env> | ||
| <env name="PALS_RPC_TIMEOUT">240</env> | ||
| <env name="SYCL_PI_LEVEL_ZERO_SINGLE_THREAD_MODE">1</env> | ||
| <env name="SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR">1</env> | ||
| <env name="SYCL_PI_LEVEL_ZERO_USM_RESIDENT">0x001</env> | ||
| <env name="UR_L0_USE_DRIVER_INORDER_LISTS">1</env> | ||
| <env name="UR_L0_USE_COPY_ENGINE_FOR_IN_ORDER_QUEUE">1</env> | ||
|
|
||
| <env name="MPIR_CVAR_ENABLE_GPU">1</env> | ||
| <env name="MPIR_CVAR_ENABLE_GPU">1</env> | ||
| <env name="romio_cb_read">disable</env> | ||
| <env name="romio_cb_write">disable</env> | ||
| <env name="SYCL_CACHE_PERSISTENT">1</env> | ||
| <env name="GATOR_INITIAL_MB">4000MB</env> | ||
| <env name="GATOR_DISABLE">0</env> | ||
| <env name="GPU_TILE_COMPACT">/lus/flare/projects/E3SM_Dec/tools/mpi_wrapper_utils/gpu_tile_compact.sh</env> | ||
| <env name="RANKS_BIND">list:1-8:9-16:17-24:25-32:33-40:41-48:53-60:61-68:69-76:77-84:85-92:93-100 --gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 --mem-bind list:0:0:0:0:0:0:1:1:1:1:1:1</env> | ||
| <env name="ZES_ENABLE_SYSMAN">1</env> | ||
| <!-- default is ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE: enable this to run 4 MPI/tile or 48 MPI/node | ||
| <env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4</env>--> | ||
| <!-- <env name="ZE_FLAT_DEVICE_HIERARCHY">FLAT</env> | ||
| <env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4</env>--> | ||
| <!-- default is ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE: enable this to run 4 MPI/tile or 48 MPI/node | ||
| <env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4</env>--> | ||
| <!-- <env name="ZE_FLAT_DEVICE_HIERARCHY">FLAT</env> | ||
| <env name="ZEX_NUMBER_OF_CCS">0:4,1:4,2:4,3:4:4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4</env>--> | ||
| </environment_variables> | ||
| <environment_variables compiler="oneapi-ifx"> | ||
| <env name="LIBOMPTARGET_DEBUG">0</env><!--default 0, max 5 --> | ||
|
|
@@ -3675,6 +3677,12 @@ | |
| <env name="KMP_AFFINITY">granularity=core,balanced</env> | ||
| <env name="OMP_STACKSIZE">128M</env> | ||
| </environment_variables> | ||
| <environment_variables DEBUG="TRUE"> | ||
| <env name="RLIMITS">--rlimits CORE</env> | ||
| </environment_variables> | ||
| <resource_limits DEBUG="TRUE"> | ||
| <resource name="RLIMIT_CORE">-1</resource> | ||
| </resource_limits> | ||
| <resource_limits> | ||
| <resource name="RLIMIT_STACK">-1</resource> | ||
| </resource_limits> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasionally, job submissions'
stdoutreturns more output beyond a job's id: e.g.This regex update is to extract job-id from such longer strings to let CIME continue with its job stages.