-
Notifications
You must be signed in to change notification settings - Fork 446
Open
Labels
EAMxxC++ based E3SM atmosphere model (aka SCREAM)C++ based E3SM atmosphere model (aka SCREAM)
Description
In a 6-month run of --compset F2010-SCREAMv1 --res ne256pg2_ne256pg2 --machine pm-gpu --compiler gnugpu (script below), jobs are running into errors like
$ tail e3sm.log*
219: Bad dphi, dp3d, or vtheta_dp; label: 'DIRK Newton loop nm1'; see hommexx.errlog.256.219
...
$ head hommexx.errlog.256.219
label: DIRK Newton loop nm1
time-level 0
lat -2.673614897152932e-01 lon 2.998770420815903e+00
ie 936 igll 3 jgll 2 lev 0: bad dphi
level dphi dp3d vtheta_dp
0 -nan 6.439032159248143e+01 8.787763796686809e+04
...
after ~2 months at YYYYMMDD 20180214.
Run-dir:
/pscratch/sd/a/azamat/e3sm_scratch/pm-gpu/bench/ppe/ne256pg2_ne256pg2.F2010-SCREAMv1.pm-gpu_gnugpu.20251027.ppe.n64.t2/run/
Run-script: run.ne256pg2_ne256pg2.F2010-SCREAMv1.sh
Yaml inputs:
A similar error occurs on --machine aurora --compiler oneapi-ifxgpu:
$ tail e3sm.log*
x4315c4s2b0n0.hsn.cm.aurora.alcf.anl.gov 669: WARNING: Tl1_1 has 1 values <= allowable value. Resetting to minimum value.
x4314c4s3b0n0.hsn.cm.aurora.alcf.anl.gov 0: bfbhash> 8172 e2675347aabc7a9e (Hommexx)
x4315c4s2b0n0.hsn.cm.aurora.alcf.anl.gov 669: Bad dphi, dp3d, or vtheta_dp; label: 'CaarFunctorImpl::run TagPreExchange'; see hommexx.errlog.768.669
Exiting...
$ head hommexx.errlog.768.669
label: CaarFunctorImpl::run TagPreExchange
time-level 1
lat -2.469408023496295e-01 lon 2.399145952253143e+00
ie 166 igll 1 jgll 3 lev 121: bad dphi
level dphi dp3d vtheta_dp
0 -1.750491872851586e+04 6.500425338745119e+01 8.627967553816583e+04
...
120 -9.885440405719305e+01 3.290536127386427e+02 8.899586850071557e+04
121 4.141349216841650e+02 3.199318230615362e+02 7.667957395378580e+04
122 -2.574021162628712e+02 3.177201924738347e+02 6.866978494951430e+04
...
Run-dir:
/lus/flare/projects/E3SM_Dec/azamatm/scratch/profiling/ppe/20251028/ne256pg2_ne256pg2.F2010-SCREAMv1.aurora_oneapi-ifxgpu.20251028.ppe.n64/run/
Metadata
Metadata
Assignees
Labels
EAMxxC++ based E3SM atmosphere model (aka SCREAM)C++ based E3SM atmosphere model (aka SCREAM)