-
Notifications
You must be signed in to change notification settings - Fork 446
Load recommended mpich-config modules and env-vars on Aurora #7399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Testing:
Case-dirs: |
| <arg name="ranks_per_node">-ppn {{ tasks_per_node }}</arg> | ||
| <arg name="ranks_bind">--cpu-bind $ENV{RANKS_BIND}</arg> | ||
| <arg name="threads_per_rank">-d $ENV{OMP_NUM_THREADS} $ENV{RLIMITS}</arg> | ||
| <arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we remove this since we are defining
--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abagusetty what's the equivalent of OpenMPI's OMPI_COMM_WORLD_LOCAL_RANK on Aurora-mpich?
Kokkos raises
`Warning: unable to detect local MPI rank. Falling back to the first GPU available for execution. Raised by Kokkos::initialize()`
if $MPI_LOCALRANKID env-var is undefined. $PALS_LOCAL_RANKID appears to be empty also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we remove this since we are defining
--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1
I am assuming you are asking for the last line. Yes, for <arg name="gpu_maps">$ENV{GPU_TILE_COMPACT}</arg> this can be removed. Since gpu-bind takes care of binding tiles with MPI-processes and wouldnt need a script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amametjanov Aurora MPICH doesnt have an equivalent. But PALS should work the same: PALS_LOCAL_RANKID. It depends on where you would be using this var. Notsure why is this empty. I believe this warnings were fixed a while back.
As long as PALS_LOCAL_RANKID is embedded after mpiexec launch command, PALS should be defined. For instance: mpiexec ... $PALS_LOCAL_RANK $EXE...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks. Straight removal of gpu_tile_compact.sh script was raising those warnings. I'll push a commit to appropriately export correct $MPI_LOCALRANKID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer the gpu-bind argument over the script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-gpu-bind is a more of a official mpich option that allows topology aware bindings internally over the script. The script was a temp. WA just since we didnt have a GPU-binding mechanism for Aurora in the earlier days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing mpiexec ... --genv MPI_LOCALRANKID=${PALS_LOCAL_RANKID} ... isn't helping: empty MPI_LOCALRANKID.
I added a Kokkos mod adding PALS_LOCAL_RANKID to the list of recognized env-vars at E3SM-Project/EKAT#372 . When that PR makes it into E3SM master, I can remove the call to gpu_tile_compact.sh in a separate PR. If that's okay, then maybe this PR can go without that mod.
|
|
||
| <batch_system MACH="aurora" type="pbspro"> | ||
| <batch_submit>/lus/flare/projects/E3SM_Dec/tools/qsub/throttle</batch_submit> | ||
| <jobid_pattern>(\d+)\.aurora-pbs</jobid_pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasionally, job submissions' stdout returns more output beyond a job's id: e.g.
ERROR: Couldn't match jobid_pattern '^(\d+)' within submit output:
'auth: error returned: 15007
auth: Failed to receive auth token
No Permission.
qstat: cannot connect to server aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov (errno=15007)
5435693.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov'
This regex update is to extract job-id from such longer strings to let CIME continue with its job stages.
|
Is this ready? |
|
Yes, just waiting for somebody to approve before starting the merge. :) |
Load recommended mpich-config modules and env-vars on Aurora Also, - set env-vars and export to mpiexec unlimited core file size limit in debug runs - update `jobid_pattern` to refine job-id extraction from non-standard PBS output - cleanup tabs [BFB]
amametjanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary workaround to avoid low node-count cases running into a seg-fault, which can be reproduced on 4 nodes with
./cime/scripts/create_test SMS.ne30pg2_EC30to60E2r2.WCYCLXX2010 --mpilib mpich1024
The module is still needed for scaling on 256+ nodes and such runs will need to add extra-arg --mpilib mpich1024:
./cime/scripts/create_[newcase,test] --mpilib mpich1024 ...
|
mpich-collectives can only be used on jobs with 256 or more nodes? |
|
Yes, for now. @abagusetty is looking to see if the tuned collectives config can be used at low node-counts too (e.g. 4 nodes). |
|
Another Q: do our other libraries like pnetcdf, adios need to be built against mpich-collectives? |
|
No, collectives module sets run-time env-vars and the json files set various run-time options depending on comm and msg size, ppn and others. |
Re-merge to next to bring a new commit
Also,
jobid_patternto refine job-id extraction from non-standard PBS output[BFB]