Skip to content

GCHP abort with status -124 when writing checkpoints (pnc4) in parallel with multiple writers #519

@Daisy0419

Description

@Daisy0419

Your name

Daisy Wang

Your affiliation

Washington University in Saint Louis

Please provide a clear and concise description of your question or discussion topic.

When running GCHP with Midrun_Checkpoint = ON and multiple writers (6), the model writes the first checkpoint file then aborts with MAPL status = -124.
That first checkpoint file has the “correct” size but appears to be all zeros while a single-writer run on the same setup produces a valid netCDF-4 checkpoint.

This looks like a invalid parallel write via Parallel-NetCDF (or format mismatch), followed by a read that returns -124.

The same problem appears for both versions 14.5.2 and 14.6.3 on both AWS + Lustre (FSx) and compute1.

Question:

  • Has anyone encountered same issue before? What are the possible causes?
  • Could you please suggest how do I further debug or fix the issue?

Environment:
Environment (build from source)
MPI: Intel MPI 2021.13
Compilers: GCC/GFortran
HDF5 (parallel): 1.14.6
PnetCDF: 1.14.1
netCDF-C: 4.9.3
netCDF-Fortran: 4.6.2
PIO: 2.6.6
ESMF: 8.4.2

Error Log:

                                                             Mem/Swap Used (MB) at HISTMAPL_GenericInitialize=  5.277E+04  0.000E+00
                                                          Mem/Swap Used (MB) at EXTDATAMAPL_GenericInitialize=  5.246E+04  0.000E+00
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  5.128E+04  0.000E+00
 Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
 Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint.20190701_0000z.nc4
pe=00072 FAIL at line=00325    NetCDF4_FileFormatter.F90                <status=-124>
pe=00072 FAIL at line=03841    NCIO.F90                                 <status=-124>
pe=00096 FAIL at line=00325    NetCDF4_FileFormatter.F90                <status=-124>
pe=00096 FAIL at line=03841    NCIO.F90                                 <status=-124>
pe=00072 FAIL at line=04081    NCIO.F90                                 <status=-124>
pe=00072 FAIL at line=05807    MAPL_Generic.F90                         <status=-124>
pe=00072 FAIL at line=02472    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=04081    NCIO.F90                                 <status=-124>
pe=00096 FAIL at line=05807    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=02472    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=02387    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=01807    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=02319    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=01807    MAPL_Generic.F90                         <status=-124>
pe=00096 FAIL at line=01343    MAPL_CapGridComp.F90                     <status=-124>
pe=00096 FAIL at line=01300    MAPL_CapGridComp.F90                     <status=-124>
pe=00096 FAIL at line=01260    MAPL_CapGridComp.F90                     <status=-124>
pe=00096 FAIL at line=00837    MAPL_CapGridComp.F90                     <status=-124>
pe=00096 FAIL at line=00977    MAPL_CapGridComp.F90                     <status=-124>
pe=00096 FAIL at line=00313    MAPL_Cap.F90                             <status=-124>
pe=00096 FAIL at line=00258    MAPL_Cap.F90                             <status=-124>
pe=00096 FAIL at line=00192    MAPL_Cap.F90                             <status=-124>
pe=00096 FAIL at line=00169    MAPL_Cap.F90                             <status=-124>
pe=00096 FAIL at line=00029    GCHPctm.F90                              <status=-124>
Abort(-1602421760) on node 96 (rank 96 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1602421760) - process 96

NetCDF Config:

nc-config --all

This netCDF 4.9.3 has been built with the following features: 

  --cc                -> mpicc
  --cflags            -> -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include
  --libs              -> -L/usr/local/lib -lnetcdf
  --static            -> -lpnetcdf -lhdf5_hl -lhdf5 -lm -lz -lbz2 -lzstd -lxml2 -lcurl 
  --has-dap           -> yes
  --has-dap2          -> yes
  --has-dap4          -> yes
  --has-nc2           -> yes
  --has-nc4           -> yes
  --has-hdf5          -> yes
  --has-hdf4          -> no
  --has-logging       -> no
  --has-pnetcdf       -> yes
  --has-szlib         -> no
  --has-cdf5          -> yes
  --has-parallel4     -> yes
  --has-parallel      -> yes
  --has-nczarr        -> yes
  --has-zstd          -> yes
  --has-benchmarks    -> no
  --has-multifilters  -> yes
  --has-stdfilters    -> bz2 deflate zstd
  --has-quantize      -> yes

  --prefix            -> /usr/local
  --includedir        -> /usr/local/include
  --libdir            -> /usr/local/lib
  --plugindir         -> /usr/local/hdf5/lib/plugin
  --plugin-searchpath -> /usr/local/hdf5/lib/plugin:/usr/local/hdf5/lib/plugin
  --version           -> netCDF 4.9.3
  --build-system      -> autotools

ESMF log:

20251023 210156.405 INFO             PET026 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.405 INFO             PET026 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20251023 210156.405 INFO             PET026 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20251023 210156.405 INFO             PET026 !!! FOR PRODUCTION RUNS, USE:                      !!!
20251023 210156.405 INFO             PET026 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20251023 210156.405 INFO             PET026 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.405 INFO             PET026 Running with ESMF Version   : 8.4.2
20251023 210156.405 INFO             PET026 ESMF library build date/time: "Oct 23 2025" "17:14:39"
20251023 210156.405 INFO             PET026 ESMF library build location : /tmp/esmf-8.4.2
20251023 210156.405 INFO             PET026 ESMF_COMM                   : intelmpi
20251023 210156.478 INFO             PET026 ESMF_MOAB                   : enabled
20251023 210156.478 INFO             PET026 ESMF_LAPACK                 : enabled
20251023 210156.478 INFO             PET026 ESMF_NETCDF                 : enabled
20251023 210156.478 INFO             PET026 ESMF_PNETCDF                : enabled
20251023 210156.478 INFO             PET026 ESMF_PIO                    : enabled
20251023 210156.478 INFO             PET026 ESMF_YAMLCPP                : enabled
20251023 210156.678 INFO             PET008 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.678 INFO             PET008 !!!        MOAB turned OFF            !!!
20251023 210156.678 INFO             PET008 !!! Meshes now created using native   !!!
20251023 210156.678 INFO             PET008 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210157.478 ERROR            PET008 ESMF_Clock.F90:887 ESMF_ClockGetAlarm() Failure  - Internal subroutine call returned Error
            PET051 ESMF_MOAB                   : enabled
20251023 210156.531 INFO             PET051 ESMF_LAPACK                 : enabled
20251023 210156.531 INFO             PET051 ESMF_NETCDF                 : enabled
20251023 210156.531 INFO             PET051 ESMF_PNETCDF                : enabled
20251023 210156.531 INFO             PET051 ESMF_PIO                    : enabled
20251023 210156.531 INFO             PET051 ESMF_YAMLCPP                : enabled
20251023 210156.679 INFO             PET065 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.679 INFO             PET065 !!!        MOAB turned OFF            !!!
20251023 210156.679 INFO             PET065 !!! Meshes now created using native   !!!
20251023 210156.679 INFO             PET065 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210157.479 ERROR            PET065 ESMF_Clock.F90:887 ESMF_ClockGetAlarm() Failure  - Internal subroutine call returned Error

Metadata

Metadata

Assignees

Labels

category: QuestionFurther information is requestednever staleNever label this issue as stale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions