Skip to content

Gaea: Issues running C768 S2SW #3324

@JessicaMeixner-NOAA

Description

@JessicaMeixner-NOAA

What is wrong?

Running C768 S2SW on gaea had the following messages:

/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/c768t01/COMROOT/c768t01/logs/2019120300/

ariable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
2144: PE 2144: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
 404: PE 404: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
3076: PE 3076: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
3772: PE 3772: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
1727: PE 1727: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
1727:
1918: PE 1918: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
1918:
2686: PE 2686: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
2686:

In ufs-weather-model the GaeaC6 forecast job has the following environment variables:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L31-L34:

export FI_VERBS_PREFER_XRC=0
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=65536
export FI_CXI_RDZV_THRESHOLD=65536

Adding these to
https://github.com/NOAA-EMC/global-workflow/blob/develop/env/GAEAC6.env#L203-L212

has gotten past this hang issue.

What should have happened?

C768 S2SW should work on Gaea C6

What machines are impacted?

All or N/A

What global-workflow hash are you using?

#3289

Steps to reproduce

Use the CI test here: https://github.com/NOAA-EMC/global-workflow/blob/develop/ci/cases/hires/C768_S2SW.yaml

Additional information

There are additional variables in the ufs-weather-model https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L26C1-L28C19

export OMP_NUM_THREADS=@[THRD]
export OMP_STACKSIZE=1024M
export NC_BLKSZ=1M

That should be potentially also added.

Do you have a proposed solution?

Add variables from ufs-weather-model to gaeac6 environment file for forecast.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions