-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Describe the problem you are facing
I am running an iteratively coupled simulation (awiesm-2.6 + cryogrid-1.0.0) on albedo, where each model runs chunks of one year. After several (mostly 4) years, the compute phase of awiesm fails with the following error:
600: slurmstepd: error: Munge decode failed: Unauthorized credential for client UID=5948 GID=500
600: slurmstepd: error: mpi/pmix_v3: _auth_cred_verify: prod-137 [5]: pmixp_server.c:540: Verifying authentication credential: Invalid authentication credential
600: slurmstepd: error: mpi/pmix_v3: _direct_conn_establish: prod-137 [5]: pmixp_server.c:1285: Connection reject from 6(prod-141)
ChatGPT was suggesting this is related to failing communication / authorization between the nodes. However, diagnosis of the causes it suggested is beyond my skills / permissions.
Is this a known issue and does anybody have an idea how to fix this on my side?
Runscrip and other relevant files
Runscripts are here:
/albedo/albedo/work/projects/paleo_work/jnitzbon/PermafrostCouplingComparison/runscripts
Experiment and log files here:
/albedo/work/projects/p_paleo_ollie/jnitzbon/PermafrostCouplingComparison/experiments/awiesm-2.6/PIctrl_adaptedThermal_adaptedHydro_cglite
Logfile with error(s):
PIctrl_adaptedThermal_adaptedHydro_cglite_awiesm_compute_52160101-52161231_35992984.log
System (please complete the following information):
- Supercomputer: albedo
- Version: 6.53.2