-
Notifications
You must be signed in to change notification settings - Fork 208
Long runtimes and error messages in some gfs_gempak_f jobs on WCOSS2 #3630
Description
What is wrong?
Several gfs_gempak_f* jobs in the extended CI test case on WCOSS2 are consistently hitting their 30min walltimes. The automated 2nd attempt will also hit the walltime. A third attempt hours later always completes successfully. First thought was that the issue was a machine issue but other CI testing since the first instance are continuing to have the same issues in the same jobs.
Viewing the logs shows the following message repeated many times for those jobs: Error in message send = 22. The resulting job logs are about 15-20GB in size (!) with those error messages printed, whereas the gempak logs are usually MBs and not GBs.
The affected jobs are gfs_gempak_f123-f144 and gfs_gempak_f147-f168. Do not see the error messages in any of the other gfs_gempak_f* jobs in the same CI tests.
Snippet from log where the error first appears:
+ cpfs[13]cpdstfile=/lfs/h2/emc/ptmp/emc.global/PR/PR_3626/RUNTESTS/COMROOT/C96_atm3DVar_extended_3626/gfs.20211221/06//products/atmos/gempak/35km_pac/gfs_35km_pac_2021122
106f168
Error in message send = 22
itype, ichan, nwords,2,22216705,2
Error in message send = 22
itype, ichan, nwords,2,22216705,2
Error in message send = 22
itype, ichan, nwords,2,22216705,2
...
See saved logs on Cactus: /lfs/h2/emc/global/noscrub/emc.global/ci/SAVE_LOGS_3626
What should have happened?
No error message and job completes on time.
What machines are impacted?
WCOSS2
What global-workflow hash are you using?
develop and recent PR hashes
Steps to reproduce
Run develop extended CI test case on WCOSS2 with DO_GEMPAK=YES.
Additional information
No response
Do you have a proposed solution?
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status