-
Notifications
You must be signed in to change notification settings - Fork 208
Description
What is wrong?
Running C768 S2SW on gaea had the following messages:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/c768t01/COMROOT/c768t01/logs/2019120300/
ariable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
2144: PE 2144: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
404: PE 404: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
3076: PE 3076: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
3772: PE 3772: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
1727: PE 1727: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.] OFI retry continuing...
1727:
1918: PE 1918: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.] OFI retry continuing...
1918:
2686: PE 2686: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.] OFI retry continuing...
2686:
In ufs-weather-model the GaeaC6 forecast job has the following environment variables:
export FI_VERBS_PREFER_XRC=0
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=65536
export FI_CXI_RDZV_THRESHOLD=65536
Adding these to
https://github.com/NOAA-EMC/global-workflow/blob/develop/env/GAEAC6.env#L203-L212
has gotten past this hang issue.
What should have happened?
C768 S2SW should work on Gaea C6
What machines are impacted?
All or N/A
What global-workflow hash are you using?
Steps to reproduce
Use the CI test here: https://github.com/NOAA-EMC/global-workflow/blob/develop/ci/cases/hires/C768_S2SW.yaml
Additional information
There are additional variables in the ufs-weather-model https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L26C1-L28C19
export OMP_NUM_THREADS=@[THRD]
export OMP_STACKSIZE=1024M
export NC_BLKSZ=1M
That should be potentially also added.
Do you have a proposed solution?
Add variables from ufs-weather-model to gaeac6 environment file for forecast.