Update gpu affinity on pm-gpu #7818

amametjanov · 2025-10-21T21:14:16Z

Also,

remove --gpu-bind options
set max mpi+omp to 128
fix spelling of --cpu-bind
clean-up omp env-vars from mpi-only runs
add S/M/L pe-layouts for ne256-wcyclxx on pm-gpu

[BFB]

To get XML settings for EXCL_STRIDE, need to checkout a pending cime branch:

$ cd cime
$ git fetch && git checkout azamat/pes/add-xstrid-to-xml

Also, - remove --gpu-bind options - set max mpi+omp to 128 - fix spelling of --cpu-bind - clean-up omp env-vars from mpi-only runs

amametjanov · 2025-10-21T21:14:27Z

Testing:

64 nodes: base 0.18 sypd, this-branch 0.715 sypd -- 3.93x speedup
128 nodes: base 0.34 sypd, this-branch 1.18 sypd -- 3.47x speedup

Example on 128 nodes:

base:

  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        512         0        512    x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (1     )
  lnd = elm        512         0        512    x 1       1      (1     )
  ice = mpassi     512         0        512    x 1       1      (1     )
  ocn = mpaso      512         0        512    x 1       1      (1     )
  rof = mosart     512         0        512    x 1       1      (1     )
  glc = sglc       512         0        512    x 1       1      (1     )
  wav = swav       512         0        512    x 1       1      (1     )
  iac = siac       512         0        512    x 1       1      (1     )
  esp = sesp       512         0        512    x 1       1      (1     )

    TOT Run Time:     694.880 seconds      694.880 seconds/mday         0.34 myears/wday
    CPL Run Time:      21.708 seconds       21.708 seconds/mday        10.90 myears/wday
    ATM Run Time:     120.832 seconds      120.832 seconds/mday         1.96 myears/wday
    LND Run Time:       9.515 seconds        9.515 seconds/mday        24.88 myears/wday
    ICE Run Time:     185.903 seconds      185.903 seconds/mday         1.27 myears/wday
    OCN Run Time:     365.385 seconds      365.385 seconds/mday         0.65 myears/wday
    ROF Run Time:       0.628 seconds        0.628 seconds/mday       376.93 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     17.182 seconds       17.182 seconds/mday        13.78 myears/wday

test:

 component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        8192        0        8192   x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (16    )
  lnd = elm        8192        0        8192   x 1       1      (1     )
  ice = mpassi     8192        0        8192   x 1       1      (1     )
  ocn = mpaso      8192        0        8192   x 1       1      (1     )
  rof = mosart     8192        0        8192   x 1       1      (1     )
  glc = sglc       1           1        1      x 1       1      (1     )
  wav = swav       1           1        1      x 1       1      (1     )
  iac = siac       1           1        1      x 1       1      (1     )
  esp = sesp       1           1        1      x 1       1      (1     )

    TOT Run Time:     200.377 seconds      200.377 seconds/mday         1.18 myears/wday
    CPL Run Time:      12.038 seconds       12.038 seconds/mday        19.66 myears/wday
    ATM Run Time:     125.552 seconds      125.552 seconds/mday         1.89 myears/wday
    LND Run Time:      12.020 seconds       12.020 seconds/mday        19.69 myears/wday
    ICE Run Time:      39.554 seconds       39.554 seconds/mday         5.98 myears/wday
    OCN Run Time:      99.807 seconds       99.807 seconds/mday         2.37 myears/wday
    ROF Run Time:       1.231 seconds        1.231 seconds/mday       192.29 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     70.586 seconds       70.586 seconds/mday         3.35 myears/wday

ndkeen · 2025-10-21T21:16:48Z

I think we want to keep those settings. You may need to find a conditional way to build if you are finding other settings are better for certain cases.

I do notice the correct syntax to srun is --cpu-bind, not what I had which is --cpu_bind, where it may be srun simply ignores the command rather than error. Testing this change now.

rljacob · 2025-10-21T21:33:54Z

cime_config/machines/config_machines.xml

-        <arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu_bind=cores"; else echo "--cpu_bind=threads";fi;} </arg>
+        <arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu-bind=cores"; else echo "--cpu-bind=threads";fi;} </arg>
        <arg name="placement"> -m plane=$SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>
+        <arg name="gpu-bind"> /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh $SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>


What does this set_affinity_npergu.sh file do?

amametjanov · 2025-10-22T02:40:24Z

It exports CUDA_VISIBLE_DEVICES=[0|1|2|3] at run-time depending on node-local mpi task id:

if 4 tasks per node, then 1 task per gpu: each task sees only 1 gpu (either of 0,1,2,3), like before
if 64 tpn, then 16 tasks per gpu: first 16 tasks on gpu 0, next 16 on gpu 1 etc.
without this, prior behavior is round-robin task 0 on gpu 0, task 1 on gpu 1: e.g. task 0 and 16 on gpu 0 with pstrid=16 leads to out-of-memory errors.

> cat /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh
#!/bin/bash
#num_gpus=$(nvidia-smi -L | wc -l)
tasks_per_node=$1
tasks_per_gpu=$(( ${tasks_per_node} / 4 ))
gpu=$(( (${SLURM_LOCALID} / ${tasks_per_gpu}) % 4 ))
export CUDA_VISIBLE_DEVICES=$gpu
echo “RANK= ${SLURM_PROCID} LOCAL_RANK= ${SLURM_LOCALID} gpu= ${gpu}”
shift
"$@"

e.g. with 64 tpn:

   0: “RANK= 0 LOCAL_RANK= 0 gpu= 0”
   1: “RANK= 1 LOCAL_RANK= 1 gpu= 0”
  15: “RANK= 15 LOCAL_RANK= 15 gpu= 0”
  16: “RANK= 16 LOCAL_RANK= 16 gpu= 1”
  63: “RANK= 63 LOCAL_RANK= 63 gpu= 3”
  64: “RANK= 64 LOCAL_RANK= 0 gpu= 0”
...

More info at https://docs.nersc.gov/jobs/affinity/ . But --gpu-bind there doesn't work for us, because of direct gpu-to-gpu comms with MPICH_GPU_SUPPORT_ENABLED=1: --gpu-bind leads to IPC cuIpcOpenMemHandle errors like

118: gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
118: (GTL DEBUG: 118) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 360

amametjanov added 2 commits October 21, 2025 14:07

Add mpi_task to gpu affinity script

6e6daee

Also, - remove --gpu-bind options - set max mpi+omp to 128 - fix spelling of --cpu-bind - clean-up omp env-vars from mpi-only runs

Add S/M/L pe-layouts for ne256-wcyclxx on pm-gpu

162842f

amametjanov added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Oct 21, 2025

amametjanov added BFB PR leaves answers BFB Performance labels Oct 21, 2025

rljacob reviewed Oct 21, 2025

View reviewed changes

rljacob assigned ndkeen Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update gpu affinity on pm-gpu #7818

Update gpu affinity on pm-gpu #7818

Uh oh!

amametjanov commented Oct 21, 2025 •

edited

Loading

Uh oh!

amametjanov commented Oct 21, 2025 •

edited

Loading

Uh oh!

ndkeen commented Oct 21, 2025 •

edited

Loading

Uh oh!

rljacob Oct 21, 2025

Uh oh!

amametjanov commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update gpu affinity on pm-gpu #7818

Are you sure you want to change the base?

Update gpu affinity on pm-gpu #7818

Uh oh!

Conversation

amametjanov commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amametjanov commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndkeen commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rljacob Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

amametjanov commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amametjanov commented Oct 21, 2025 •

edited

Loading

amametjanov commented Oct 21, 2025 •

edited

Loading

ndkeen commented Oct 21, 2025 •

edited

Loading