Skip to content

Long runtimes and error messages in some gfs_gempak_f jobs on WCOSS2 #3630

@KateFriedman-NOAA

Description

@KateFriedman-NOAA

What is wrong?

Several gfs_gempak_f* jobs in the extended CI test case on WCOSS2 are consistently hitting their 30min walltimes. The automated 2nd attempt will also hit the walltime. A third attempt hours later always completes successfully. First thought was that the issue was a machine issue but other CI testing since the first instance are continuing to have the same issues in the same jobs.

Viewing the logs shows the following message repeated many times for those jobs: Error in message send = 22. The resulting job logs are about 15-20GB in size (!) with those error messages printed, whereas the gempak logs are usually MBs and not GBs.

The affected jobs are gfs_gempak_f123-f144 and gfs_gempak_f147-f168. Do not see the error messages in any of the other gfs_gempak_f* jobs in the same CI tests.

Snippet from log where the error first appears:

+ cpfs[13]cpdstfile=/lfs/h2/emc/ptmp/emc.global/PR/PR_3626/RUNTESTS/COMROOT/C96_atm3DVar_extended_3626/gfs.20211221/06//products/atmos/gempak/35km_pac/gfs_35km_pac_2021122
106f168
Error in message send = 22
itype, ichan, nwords,2,22216705,2
Error in message send = 22
itype, ichan, nwords,2,22216705,2
Error in message send = 22
itype, ichan, nwords,2,22216705,2
...

See saved logs on Cactus: /lfs/h2/emc/global/noscrub/emc.global/ci/SAVE_LOGS_3626

What should have happened?

No error message and job completes on time.

What machines are impacted?

WCOSS2

What global-workflow hash are you using?

develop and recent PR hashes

Steps to reproduce

Run develop extended CI test case on WCOSS2 with DO_GEMPAK=YES.

Additional information

No response

Do you have a proposed solution?

No response

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions