-
Notifications
You must be signed in to change notification settings - Fork 446
Description
In early October, I increased several module versions on pm-cpu/pm-gpu, including the GNU compiler to the current machine default 13.2 #7740. All of the vanilla e3sm tests passed and were BFB. Turns out there was a failing eamxx conus test on pm-gpu that I must have missed #7843. And now we also see there are DEBUG eamxx tests with pm-cpu that sporadically fail #7842. Debugging these, one theory is that they are related to issues with openmp threads (dont see issues without threads), and they may be related. Reverting back to GNU 12.3 (and leaving all other modules the same), I see these tests pass as well as other tests. The vanilla e3sm cases are still BFB, but eamx cases are not -- they were also not BFB when moving to 13.2, so that makes sense. I propose we make this change (revert to 12.3) to see the tests pass and then try to investigate why we see these fails with version 13.2.
Looking at performance, it does not seem to have much, if any impact. I ran ne256 without IO for 5 days using 32,64, and 128 nodes to compare directly the branch with GNU version 12.3 and 13.2. The perf looks to be the same within timing noise.
Tested so far with gnu on pm-cpu: e3sm_developer, e3sm_eamxx_v1, a test with netcdf-4 input
Tested so far with gnugpu on pm-gpu: e3sm_eamxx_v1, e3sm_eamxx_large, and several ne256 performance tests, a test with netcdf-4 input