MPI error with custom ElementDeletionModifier in parallel: MPI_Allreduce/MPI_Bcast failed during mesh initialization #32840
Unanswered
CJL-CFDCSD
asked this question in
Q&A Modules: General
Replies: 2 comments 1 reply
-
|
What's the minimal number of MPI ranks that it fails with? Will it fail with 3? 2? To debug MPI ranks, first build with debug: After that, you can use a few convenient command line arguments. You can run with either:
In either case, you should be able to get a stack trace to help us see what each rank is doing. https://mooseframework.inl.gov/application_development/debugging.html is a useful reference as well |
Beta Was this translation helpful? Give feedback.
0 replies
-
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment







Uh oh!
There was an error while loading. Please reload this page.
-
**##
When running my MOOSE-based application (
roshan-opt) in parallel withmpiexec -n 4, I encounter MPI collective communication errors during mesh initialization. The same input file runs successfully withmpiexec -n 1.ElementDeletionModifier initialization...
Generic Warning: In ../Parallel/MPI/vtkMPICommunicator.cxx, line 72
MPI had an error
Other MPI error, error stack:
PMPI_Allreduce(450)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x59cde1bbfaa0, count=1, datatype=MPI_DOUBLE, op=MPI_MAX, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(436)...........:
MPIR_Allreduce_impl(293)......:
MPIR_Allreduce_intra_auto(178):
MPIR_Allreduce_intra_auto(84).:
MPIR_Bcast_impl(310)..........:
MPIR_Bcast_intra_auto(223)....:
MPIR_Bcast_intra_binomial(182): Failure during collective
application called MPI_Abort(MPI_COMM_WORLD, 542712079) - process 0
Generic Warning: In ../Parallel/MPI/vtkMPICommunicator.cxx, line 72
MPI had an error
Other MPI error, error stack:
PMPI_Allreduce(450)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x568c494447f0, count=1, datatype=MPI_DOUBLE, op=MPI_MAX, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(436)...........:
MPIR_Allreduce_impl(293)......:
MPIR_Allreduce_intra_auto(178):
MPIR_Allreduce_intra_auto(84).:
MPIR_Bcast_impl(310)..........:
MPIR_Bcast_intra_auto(223)....:
MPIR_Bcast_intra_binomial(182): Failure during collective
application called MPI_Abort(MPI_COMM_WORLD, 876159247) - process 1
Generic Warning: In ../Parallel/MPI/vtkMPICommunicator.cxx, line 72
MPI had an error
Message truncated, error stack:
PMPI_Allreduce(450).....................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7ffe309da1fc, count=1, datatype=MPI_UNSIGNED, op=MPI_MAX, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(436).....................:
MPIR_Allreduce_impl(293)................:
MPIR_Allreduce_intra_auto(178)..........:
MPIR_Allreduce_intra_auto(84)...........:
MPIR_Bcast_impl(310)....................:
MPIR_Bcast_intra_auto(223)..............:
MPIR_Bcast_intra_binomial(112)..........:
MPIDI_CH3_
application called MPI_Abort(MPI_COMM_WORLD, 409542926) - process 2
Generic Warning: In ../Parallel/MPI/vtkMPICommunicator.cxx, line 72
MPI had an error
Other MPI error, error stack:
PMPI_Allreduce(450)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x61abb9596b70, count=1, datatype=MPI_DOUBLE, op=MPI_MAX, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(436)...........:
MPIR_Allreduce_impl(293)......:
MPIR_Allreduce_intra_auto(178):
MPIR_Allreduce_intra_auto(84).:
MPIR_Bcast_impl(310)..........:
MPIR_Bcast_intra_auto(223)....:
MPIR_Bcast_intra_binomial(123): message sizes do not match across processes in the collective routine: Received 4 but expected
application called MPI_Abort(MPI_COMM_WORLD, 205070607) - process 3
The error happens during the initialization of my custom ElementDeletionModifierCCANYAddSide (a MeshModifier that deletes elements and potentially adds sides).
I suspect the issue is related to parallel mesh consistency — possibly the modifier modifies the mesh differently across MPI ranks, causing subsequent collective operations (like MPI_Allreduce or MPI_Bcast on mesh data) to fail due to mismatched message sizes or inconsistent states.
Are there specific requirements for ensuring MeshModifier operations are consistent across all MPI ranks in MOOSE?
Should ElementDeletionModifier (or similar modifiers) be run only on rank 0 and then broadcast, or does MOOSE handle synchronization automatically?
What is the best practice for debugging MPI mesh synchronization issues in MOOSE?
OS: WSL2 (Ubuntu 20.04.6 LTS)
MOOSE: Installed via conda environment
MPI: MPICH (bundled with conda environment)
Application: Custom XXX-opt with custom mesh modifiers
Beta Was this translation helpful? Give feedback.
All reactions