Open
Description
Hi,
I got crashs with large 3D simulations on Frontier. The crash is concerning a MPI_Waitall
routine :
MPICH ERROR [Rank 8281] [job id 1410467.0] [Wed Aug 23 06:05:55 2023] [frontier01982] - Abort(136982671) (rank 8281 in comm 0): Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(339)..............: MPI_Waitall(count=26, req_array=0x1006e9b0, status_array=0xfefafb0) failed
MPIR_Waitall(167)..............:
MPIR_Waitall_impl(51)..........:
MPID_Progress_wait(193)........:
MPIDI_Progress_test(89)........:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - UNDELIVERABLE)
Those crashs are not at the same time and don't produce backtraces.
Here is the error output of the simulations and the submit and input file of one of those cases to reproduce it : input.txt out.txt batch.txt
Here are the modules used for the compilation :
warpx_profile.txt