-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Your name
Daisy Wang
Your affiliation
Washington University in Saint Louis
Please provide a clear and concise description of your question or discussion topic.
When running GCHP with Midrun_Checkpoint = ON and multiple writers (6), the model writes the first checkpoint file then aborts with MAPL status = -124.
That first checkpoint file has the “correct” size but appears to be all zeros while a single-writer run on the same setup produces a valid netCDF-4 checkpoint.
This looks like a invalid parallel write via Parallel-NetCDF (or format mismatch), followed by a read that returns -124.
The same problem appears for both versions 14.5.2 and 14.6.3 on both AWS + Lustre (FSx) and compute1.
Question:
- Has anyone encountered same issue before? What are the possible causes?
- Could you please suggest how do I further debug or fix the issue?
Environment:
Environment (build from source)
MPI: Intel MPI 2021.13
Compilers: GCC/GFortran
HDF5 (parallel): 1.14.6
PnetCDF: 1.14.1
netCDF-C: 4.9.3
netCDF-Fortran: 4.6.2
PIO: 2.6.6
ESMF: 8.4.2
Error Log:
Mem/Swap Used (MB) at HISTMAPL_GenericInitialize= 5.277E+04 0.000E+00
Mem/Swap Used (MB) at EXTDATAMAPL_GenericInitialize= 5.246E+04 0.000E+00
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 5.128E+04 0.000E+00
Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint.20190701_0000z.nc4
pe=00072 FAIL at line=00325 NetCDF4_FileFormatter.F90 <status=-124>
pe=00072 FAIL at line=03841 NCIO.F90 <status=-124>
pe=00096 FAIL at line=00325 NetCDF4_FileFormatter.F90 <status=-124>
pe=00096 FAIL at line=03841 NCIO.F90 <status=-124>
pe=00072 FAIL at line=04081 NCIO.F90 <status=-124>
pe=00072 FAIL at line=05807 MAPL_Generic.F90 <status=-124>
pe=00072 FAIL at line=02472 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=04081 NCIO.F90 <status=-124>
pe=00096 FAIL at line=05807 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=02472 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=02387 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=01807 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=02319 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=01807 MAPL_Generic.F90 <status=-124>
pe=00096 FAIL at line=01343 MAPL_CapGridComp.F90 <status=-124>
pe=00096 FAIL at line=01300 MAPL_CapGridComp.F90 <status=-124>
pe=00096 FAIL at line=01260 MAPL_CapGridComp.F90 <status=-124>
pe=00096 FAIL at line=00837 MAPL_CapGridComp.F90 <status=-124>
pe=00096 FAIL at line=00977 MAPL_CapGridComp.F90 <status=-124>
pe=00096 FAIL at line=00313 MAPL_Cap.F90 <status=-124>
pe=00096 FAIL at line=00258 MAPL_Cap.F90 <status=-124>
pe=00096 FAIL at line=00192 MAPL_Cap.F90 <status=-124>
pe=00096 FAIL at line=00169 MAPL_Cap.F90 <status=-124>
pe=00096 FAIL at line=00029 GCHPctm.F90 <status=-124>
Abort(-1602421760) on node 96 (rank 96 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1602421760) - process 96NetCDF Config:
nc-config --all
This netCDF 4.9.3 has been built with the following features:
--cc -> mpicc
--cflags -> -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include
--libs -> -L/usr/local/lib -lnetcdf
--static -> -lpnetcdf -lhdf5_hl -lhdf5 -lm -lz -lbz2 -lzstd -lxml2 -lcurl
--has-dap -> yes
--has-dap2 -> yes
--has-dap4 -> yes
--has-nc2 -> yes
--has-nc4 -> yes
--has-hdf5 -> yes
--has-hdf4 -> no
--has-logging -> no
--has-pnetcdf -> yes
--has-szlib -> no
--has-cdf5 -> yes
--has-parallel4 -> yes
--has-parallel -> yes
--has-nczarr -> yes
--has-zstd -> yes
--has-benchmarks -> no
--has-multifilters -> yes
--has-stdfilters -> bz2 deflate zstd
--has-quantize -> yes
--prefix -> /usr/local
--includedir -> /usr/local/include
--libdir -> /usr/local/lib
--plugindir -> /usr/local/hdf5/lib/plugin
--plugin-searchpath -> /usr/local/hdf5/lib/plugin:/usr/local/hdf5/lib/plugin
--version -> netCDF 4.9.3
--build-system -> autotoolsESMF log:
20251023 210156.405 INFO PET026 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.405 INFO PET026 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20251023 210156.405 INFO PET026 !!! THIS MAY CAUSE SLOWDOWN IN PERFORMANCE !!!
20251023 210156.405 INFO PET026 !!! FOR PRODUCTION RUNS, USE: !!!
20251023 210156.405 INFO PET026 !!! ESMF_LOGKIND_Multi_On_Error !!!
20251023 210156.405 INFO PET026 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.405 INFO PET026 Running with ESMF Version : 8.4.2
20251023 210156.405 INFO PET026 ESMF library build date/time: "Oct 23 2025" "17:14:39"
20251023 210156.405 INFO PET026 ESMF library build location : /tmp/esmf-8.4.2
20251023 210156.405 INFO PET026 ESMF_COMM : intelmpi
20251023 210156.478 INFO PET026 ESMF_MOAB : enabled
20251023 210156.478 INFO PET026 ESMF_LAPACK : enabled
20251023 210156.478 INFO PET026 ESMF_NETCDF : enabled
20251023 210156.478 INFO PET026 ESMF_PNETCDF : enabled
20251023 210156.478 INFO PET026 ESMF_PIO : enabled
20251023 210156.478 INFO PET026 ESMF_YAMLCPP : enabled
20251023 210156.678 INFO PET008 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.678 INFO PET008 !!! MOAB turned OFF !!!
20251023 210156.678 INFO PET008 !!! Meshes now created using native !!!
20251023 210156.678 INFO PET008 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210157.478 ERROR PET008 ESMF_Clock.F90:887 ESMF_ClockGetAlarm() Failure - Internal subroutine call returned Error
PET051 ESMF_MOAB : enabled
20251023 210156.531 INFO PET051 ESMF_LAPACK : enabled
20251023 210156.531 INFO PET051 ESMF_NETCDF : enabled
20251023 210156.531 INFO PET051 ESMF_PNETCDF : enabled
20251023 210156.531 INFO PET051 ESMF_PIO : enabled
20251023 210156.531 INFO PET051 ESMF_YAMLCPP : enabled
20251023 210156.679 INFO PET065 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210156.679 INFO PET065 !!! MOAB turned OFF !!!
20251023 210156.679 INFO PET065 !!! Meshes now created using native !!!
20251023 210156.679 INFO PET065 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20251023 210157.479 ERROR PET065 ESMF_Clock.F90:887 ESMF_ClockGetAlarm() Failure - Internal subroutine call returned Error