Skip to content

Crash with large 3D simulations on LUMI #4236

Open
@tmsclark2

Description

@tmsclark2

Hi,
I got crashs with large 3D simulations on LUMI. The crash is concerning a MPI_Allgather routine :

MPICH ERROR [Rank 0] [job id 4292261.0] [Thu Aug  3 00:01:58 2023] [nid006593] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(170).......:
MPID_Init(501)..............:
MPIDI_OFI_mpi_init_hook(805):
MPIDU_bc_table_create(204)..:  PMI_Allgather failed: -1

This crash happens before warpx starts and does not produce traces.

Here is the error output of the simulations and the submit file : warpx-4292261.txt batch.txt

Here are the modules used for the compilation : Recipe_warpx.txt

Metadata

Metadata

Labels

backend: hipSpecific to ROCm execution (GPUs)bugSomething isn't workingmachine / systemMachine or system-specific issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions