Skip to content

Conversation

@amametjanov
Copy link
Member

@amametjanov amametjanov commented Oct 21, 2025

Also,

  • remove --gpu-bind options
  • set max mpi+omp to 128
  • fix spelling of --cpu-bind
  • clean-up omp env-vars from mpi-only runs
  • add S/M/L pe-layouts for ne256-wcyclxx on pm-gpu

[BFB]


To get XML settings for EXCL_STRIDE, need to checkout a pending cime branch:

$ cd cime
$ git fetch && git checkout azamat/pes/add-xstrid-to-xml

Also,
- remove --gpu-bind options
- set max mpi+omp to 128
- fix spelling of --cpu-bind
- clean-up omp env-vars from mpi-only runs
@amametjanov amametjanov added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Oct 21, 2025
@amametjanov
Copy link
Member Author

amametjanov commented Oct 21, 2025

Testing:

  • 64 nodes: base 0.18 sypd, this-branch 0.715 sypd -- 3.93x speedup
  • 128 nodes: base 0.34 sypd, this-branch 1.18 sypd -- 3.47x speedup

Example on 128 nodes:

  • base:
  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        512         0        512    x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (1     )
  lnd = elm        512         0        512    x 1       1      (1     )
  ice = mpassi     512         0        512    x 1       1      (1     )
  ocn = mpaso      512         0        512    x 1       1      (1     )
  rof = mosart     512         0        512    x 1       1      (1     )
  glc = sglc       512         0        512    x 1       1      (1     )
  wav = swav       512         0        512    x 1       1      (1     )
  iac = siac       512         0        512    x 1       1      (1     )
  esp = sesp       512         0        512    x 1       1      (1     )

    TOT Run Time:     694.880 seconds      694.880 seconds/mday         0.34 myears/wday
    CPL Run Time:      21.708 seconds       21.708 seconds/mday        10.90 myears/wday
    ATM Run Time:     120.832 seconds      120.832 seconds/mday         1.96 myears/wday
    LND Run Time:       9.515 seconds        9.515 seconds/mday        24.88 myears/wday
    ICE Run Time:     185.903 seconds      185.903 seconds/mday         1.27 myears/wday
    OCN Run Time:     365.385 seconds      365.385 seconds/mday         0.65 myears/wday
    ROF Run Time:       0.628 seconds        0.628 seconds/mday       376.93 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     17.182 seconds       17.182 seconds/mday        13.78 myears/wday
  • test:
 component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        8192        0        8192   x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (16    )
  lnd = elm        8192        0        8192   x 1       1      (1     )
  ice = mpassi     8192        0        8192   x 1       1      (1     )
  ocn = mpaso      8192        0        8192   x 1       1      (1     )
  rof = mosart     8192        0        8192   x 1       1      (1     )
  glc = sglc       1           1        1      x 1       1      (1     )
  wav = swav       1           1        1      x 1       1      (1     )
  iac = siac       1           1        1      x 1       1      (1     )
  esp = sesp       1           1        1      x 1       1      (1     )

    TOT Run Time:     200.377 seconds      200.377 seconds/mday         1.18 myears/wday
    CPL Run Time:      12.038 seconds       12.038 seconds/mday        19.66 myears/wday
    ATM Run Time:     125.552 seconds      125.552 seconds/mday         1.89 myears/wday
    LND Run Time:      12.020 seconds       12.020 seconds/mday        19.69 myears/wday
    ICE Run Time:      39.554 seconds       39.554 seconds/mday         5.98 myears/wday
    OCN Run Time:      99.807 seconds       99.807 seconds/mday         2.37 myears/wday
    ROF Run Time:       1.231 seconds        1.231 seconds/mday       192.29 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     70.586 seconds       70.586 seconds/mday         3.35 myears/wday

@ndkeen
Copy link
Contributor

ndkeen commented Oct 21, 2025

I think we want to keep those settings. You may need to find a conditional way to build if you are finding other settings are better for certain cases.

I do notice the correct syntax to srun is --cpu-bind, not what I had which is --cpu_bind, where it may be srun simply ignores the command rather than error. Testing this change now.

@amametjanov amametjanov added BFB PR leaves answers BFB Performance labels Oct 21, 2025
<arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu_bind=cores"; else echo "--cpu_bind=threads";fi;} </arg>
<arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu-bind=cores"; else echo "--cpu-bind=threads";fi;} </arg>
<arg name="placement"> -m plane=$SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>
<arg name="gpu-bind"> /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh $SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this set_affinity_npergu.sh file do?

@amametjanov
Copy link
Member Author

It exports CUDA_VISIBLE_DEVICES=[0|1|2|3] at run-time depending on node-local mpi task id:

  • if 4 tasks per node, then 1 task per gpu: each task sees only 1 gpu (either of 0,1,2,3), like before
  • if 64 tpn, then 16 tasks per gpu: first 16 tasks on gpu 0, next 16 on gpu 1 etc.
  • without this, prior behavior is round-robin task 0 on gpu 0, task 1 on gpu 1: e.g. task 0 and 16 on gpu 0 with pstrid=16 leads to out-of-memory errors.
> cat /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh
#!/bin/bash
#num_gpus=$(nvidia-smi -L | wc -l)
tasks_per_node=$1
tasks_per_gpu=$(( ${tasks_per_node} / 4 ))
gpu=$(( (${SLURM_LOCALID} / ${tasks_per_gpu}) % 4 ))
export CUDA_VISIBLE_DEVICES=$gpu
echo “RANK= ${SLURM_PROCID} LOCAL_RANK= ${SLURM_LOCALID} gpu= ${gpu}”
shift
"$@"

e.g. with 64 tpn:

   0: “RANK= 0 LOCAL_RANK= 0 gpu= 0”
   1: “RANK= 1 LOCAL_RANK= 1 gpu= 0”
  15: “RANK= 15 LOCAL_RANK= 15 gpu= 0”
  16: “RANK= 16 LOCAL_RANK= 16 gpu= 1”
  63: “RANK= 63 LOCAL_RANK= 63 gpu= 3”
  64: “RANK= 64 LOCAL_RANK= 0 gpu= 0”
...

More info at https://docs.nersc.gov/jobs/affinity/ . But --gpu-bind there doesn't work for us, because of direct gpu-to-gpu comms with MPICH_GPU_SUPPORT_ENABLED=1: --gpu-bind leads to IPC cuIpcOpenMemHandle errors like

118: gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
118: (GTL DEBUG: 118) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFB PR leaves answers BFB Machine Files Performance pm-gpu Perlmutter machine at NERSC (GPU nodes)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants