Skip to content

Crash with large 3D simulations on Frontier #4234

Open
@tmsclark2

Description

@tmsclark2

Hi,
I got crashs with large 3D simulations on Frontier. The crash is concerning a MPI_Waitall routine :

MPICH ERROR [Rank 8281] [job id 1410467.0] [Wed Aug 23 06:05:55 2023] [frontier01982] - Abort(136982671) (rank 8281 in comm 0): Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(339)..............: MPI_Waitall(count=26, req_array=0x1006e9b0, status_array=0xfefafb0) failed
MPIR_Waitall(167)..............: 
MPIR_Waitall_impl(51)..........: 
MPID_Progress_wait(193)........: 
MPIDI_Progress_test(89)........: 
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - UNDELIVERABLE)

Those crashs are not at the same time and don't produce backtraces.

Here is the error output of the simulations and the submit and input file of one of those cases to reproduce it : input.txt out.txt batch.txt

Here are the modules used for the compilation :
warpx_profile.txt

Metadata

Metadata

Labels

backend: hipSpecific to ROCm execution (GPUs)bugSomething isn't workingmachine / systemMachine or system-specific issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions