Skip to content

Failing compare ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72 #7843

@ndkeen

Description

@ndkeen

At some point, this test started failing. On cdash, the report is a test named
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-bfbhash--eamxx-L72

but I verified we dont need the eamxx-bfbhash modifier. Note the eamxx-L72 is still needed and the test has always been 72 levels (we recently changed default to 128 levels, so now need the modifier).

This test fails compare:

ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/c29-oct25/ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72.gh7843

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/c29-oct25/ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-bfbhash--eamxx-L72.gh7843

However, interesting, these pass:

PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
ERS_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
PET_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-bfbhash--eamxx-L72

suggesting it may not be simply an issue of change with MPI count, or restarts, but some combination.

I was also trying a DEBUG variety:

ERP_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
ERS_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72

which both pass (note they need more than debug qos walltime, 48 and 40 minutes each)

I also tried with intel compiler. We don't really test eamxx much with intel compiler on pm-cpu and there is a known issue with tests not being BFB when varying MPI tasks (can point to older issues #6834 and #7746 ). So I didnt expect it to pass here and they do not -- these 2 fail compare:

ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72

Some other tests I tried:

fail compare  ERP_P1024x1_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72

The P1024x1 not needed as it defaults to 1 thread. But does show we still get fail with more nodes as default cases use 2 nodes and 1024 MPI's would use 8.

Note from TestStatus.log of one of the cases:

SUMMARY of cprnc:
 A total number of    413 fields were compared
          of which    149 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of      0 fields could not be analyzed
 A total number of      0 time-varying fields on file 1 were not found on file 2.
 A total number of      0 time-constant fields on file 1 were not found on file 2.
 A total number of      0 time-varying fields on file 2 were not found on file 1.
 A total number of      0 time-constant fields on file 2 were not found on file 1.
  diff_test: the two files seem to be DIFFERENT 

Metadata

Metadata

Labels

EAMxxC++ based E3SM atmosphere model (aka SCREAM)TestingAnything related to unit/system testspm-cpuPerlmutter at NERSC (CPU-only nodes)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions