Skip to content

Iterative coupling jobs fail after a few years – problem with PMIx authentication mechanism and Munge? #1367

@jannitzbon

Description

@jannitzbon

Describe the problem you are facing
I am running an iteratively coupled simulation (awiesm-2.6 + cryogrid-1.0.0) on albedo, where each model runs chunks of one year. After several (mostly 4) years, the compute phase of awiesm fails with the following error:

600: slurmstepd: error: Munge decode failed: Unauthorized credential for client UID=5948 GID=500
600: slurmstepd: error: mpi/pmix_v3: _auth_cred_verify: prod-137 [5]: pmixp_server.c:540: Verifying authentication credential: Invalid authentication credential
600: slurmstepd: error: mpi/pmix_v3: _direct_conn_establish: prod-137 [5]: pmixp_server.c:1285: Connection reject from 6(prod-141)

ChatGPT was suggesting this is related to failing communication / authorization between the nodes. However, diagnosis of the causes it suggested is beyond my skills / permissions.

Is this a known issue and does anybody have an idea how to fix this on my side?

Runscrip and other relevant files
Runscripts are here:
/albedo/albedo/work/projects/paleo_work/jnitzbon/PermafrostCouplingComparison/runscripts
Experiment and log files here:
/albedo/work/projects/p_paleo_ollie/jnitzbon/PermafrostCouplingComparison/experiments/awiesm-2.6/PIctrl_adaptedThermal_adaptedHydro_cglite
Logfile with error(s):
PIctrl_adaptedThermal_adaptedHydro_cglite_awiesm_compute_52160101-52161231_35992984.log

System (please complete the following information):

  • Supercomputer: albedo
  • Version: 6.53.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions