Skip to content

Conversation

@amametjanov
Copy link
Member

@amametjanov amametjanov commented May 30, 2025

Also,

  • set env-vars and export to mpiexec unlimited core file size limit in debug runs
  • update jobid_pattern to refine job-id extraction from non-standard PBS output
  • cleanup tabs

[BFB]

@amametjanov amametjanov self-assigned this May 30, 2025
@amametjanov amametjanov added Machine Files BFB PR leaves answers BFB Aurora ALCF machine Aurora labels May 30, 2025
@amametjanov
Copy link
Member Author

amametjanov commented May 30, 2025

Testing:

  • SYPD throughput of base on master and test on this branch:
#      base  scale test  scale test/base
#nodes SYPD  x     SYPD  x     x
 256   0.123 1.00  0.128 1.00  1.04
 512   0.193 1.57  0.238 1.86  1.23
1024   0.264 2.15  0.376 2.94  1.42
2048   0.213 1.73  0.399 3.12  1.87

Case-dirs:

/lus/flare/projects/E3SM_Dec/azamatm/scratch/benchPR7399/

@amametjanov amametjanov changed the title Load recommended mpich-config modules and env-vars Load recommended mpich-config modules and env-vars on Aurora May 30, 2025
@amametjanov amametjanov marked this pull request as ready for review June 3, 2025 01:48
<arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg>
<arg name="ranks_bind">--cpu-bind $ENV{RANKS_BIND}</arg>
<arg name="threads_per_rank">-d $ENV{OMP_NUM_THREADS} $ENV{RLIMITS}</arg>
<arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove this since we are defining

--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abagusetty what's the equivalent of OpenMPI's OMPI_COMM_WORLD_LOCAL_RANK on Aurora-mpich?
Kokkos raises

`Warning: unable to detect local MPI rank. Falling back to the first GPU available for execution. Raised by Kokkos::initialize()`

if $MPI_LOCALRANKID env-var is undefined. $PALS_LOCAL_RANKID appears to be empty also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove this since we are defining

--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1

I am assuming you are asking for the last line. Yes, for <arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg> this can be removed. Since gpu-bind takes care of binding tiles with MPI-processes and wouldnt need a script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amametjanov Aurora MPICH doesnt have an equivalent. But PALS should work the same: PALS_LOCAL_RANKID. It depends on where you would be using this var. Notsure why is this empty. I believe this warnings were fixed a while back.

As long as PALS_LOCAL_RANKID is embedded after mpiexec launch command, PALS should be defined. For instance: mpiexec ... $PALS_LOCAL_RANK $EXE...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. Straight removal of gpu_tile_compact.sh script was raising those warnings. I'll push a commit to appropriately export correct $MPI_LOCALRANKID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer the gpu-bind argument over the script?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-gpu-bind is a more of a official mpich option that allows topology aware bindings internally over the script. The script was a temp. WA just since we didnt have a GPU-binding mechanism for Aurora in the earlier days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing mpiexec ... --genv MPI_LOCALRANKID=${PALS_LOCAL_RANKID} ... isn't helping: empty MPI_LOCALRANKID.
I added a Kokkos mod adding PALS_LOCAL_RANKID to the list of recognized env-vars at E3SM-Project/EKAT#372 . When that PR makes it into E3SM master, I can remove the call to gpu_tile_compact.sh in a separate PR. If that's okay, then maybe this PR can go without that mod.


<batch_system MACH="aurora" type="pbspro">
<batch_submit>/lus/flare/projects/E3SM_Dec/tools/qsub/throttle</batch_submit>
<jobid_pattern>(\d+)\.aurora-pbs</jobid_pattern>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasionally, job submissions' stdout returns more output beyond a job's id: e.g.

ERROR: Couldn't match jobid_pattern '^(\d+)' within submit output:
 'auth: error returned: 15007
auth: Failed to receive auth token
No Permission.
qstat: cannot connect to server aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov (errno=15007)
5435693.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov'

This regex update is to extract job-id from such longer strings to let CIME continue with its job stages.

@rljacob
Copy link
Member

rljacob commented Jun 10, 2025

Is this ready?

@amametjanov
Copy link
Member Author

Yes, just waiting for somebody to approve before starting the merge. :)

amametjanov added a commit that referenced this pull request Jun 11, 2025
Load recommended mpich-config modules and env-vars on Aurora

Also,
- set env-vars and export to mpiexec unlimited core file size limit in
  debug runs
- update `jobid_pattern` to refine job-id extraction from non-standard
  PBS output
- cleanup tabs

[BFB]
Copy link
Member Author

@amametjanov amametjanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary workaround to avoid low node-count cases running into a seg-fault, which can be reproduced on 4 nodes with

./cime/scripts/create_test SMS.ne30pg2_EC30to60E2r2.WCYCLXX2010 --mpilib mpich1024

The module is still needed for scaling on 256+ nodes and such runs will need to add extra-arg --mpilib mpich1024:

./cime/scripts/create_[newcase,test] --mpilib mpich1024 ...

@rljacob
Copy link
Member

rljacob commented Jun 11, 2025

mpich-collectives can only be used on jobs with 256 or more nodes?

@amametjanov
Copy link
Member Author

Yes, for now. @abagusetty is looking to see if the tuned collectives config can be used at low node-counts too (e.g. 4 nodes).

@rljacob
Copy link
Member

rljacob commented Jun 11, 2025

Another Q: do our other libraries like pnetcdf, adios need to be built against mpich-collectives?

@amametjanov
Copy link
Member Author

No, collectives module sets run-time env-vars

MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE=/opt/aurora/24.347.0/updates/mpich/tuning/20230818-1024/CH4_coll_tuning.json
MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE=/opt/aurora/24.347.0/updates/mpich/tuning/20230818-1024/MPIR_Coll_tuning.json

and the json files set various run-time options depending on comm and msg size, ppn and others.

amametjanov added a commit that referenced this pull request Jun 12, 2025
@amametjanov amametjanov merged commit 32bb582 into master Jun 12, 2025
7 checks passed
@amametjanov amametjanov deleted the azamat/aurora/tiles-cores-modules branch June 12, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Aurora ALCF machine Aurora BFB PR leaves answers BFB Machine Files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants