-
Notifications
You must be signed in to change notification settings - Fork 446
Description
At some point, this test started failing. On cdash, the report is a test named
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-bfbhash--eamxx-L72
but I verified we dont need the eamxx-bfbhash modifier. Note the eamxx-L72 is still needed and the test has always been 72 levels (we recently changed default to 128 levels, so now need the modifier).
This test fails compare:
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/c29-oct25/ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72.gh7843
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/c29-oct25/ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-bfbhash--eamxx-L72.gh7843
However, interesting, these pass:
PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
ERS_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
PET_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-bfbhash--eamxx-L72
suggesting it may not be simply an issue of change with MPI count, or restarts, but some combination.
I was also trying a DEBUG variety:
ERP_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
ERS_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
which both pass (note they need more than debug qos walltime, 48 and 40 minutes each)
I also tried with intel compiler. We don't really test eamxx much with intel compiler on pm-cpu and there is a known issue with tests not being BFB when varying MPI tasks (can point to older issues #6834 and #7746 ). So I didnt expect it to pass here and they do not -- these 2 fail compare:
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
Some other tests I tried:
fail compare ERP_P1024x1_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
The P1024x1 not needed as it defaults to 1 thread. But does show we still get fail with more nodes as default cases use 2 nodes and 1024 MPI's would use 8.
Note from TestStatus.log of one of the cases:
SUMMARY of cprnc:
A total number of 413 fields were compared
of which 149 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
and 0 had different data types
A total number of 0 fields could not be analyzed
A total number of 0 time-varying fields on file 1 were not found on file 2.
A total number of 0 time-constant fields on file 1 were not found on file 2.
A total number of 0 time-varying fields on file 2 were not found on file 1.
A total number of 0 time-constant fields on file 2 were not found on file 1.
diff_test: the two files seem to be DIFFERENT