Segfault on GPU machine #1315

tianninglyu · 2025-09-11T06:56:09Z

tianninglyu
Sep 11, 2025

Hi all,

I try to run quokka on a NVIDIA GPU machine but meet segfault.

My script is:

mpirun -np 8 \
--mca pml ucx -x UCX_TLS=^cma \
--bind-to core --map-by core \
./run.sh ../../build/src/problems/RadhydroShell/test_radhydro3d_shell ../../inputs/radhydro_shell_512.in > test_RadhydroShell.log 2>&1

where run.sh is a wrapper:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
exec "$@"

And it reports a segfault:

[node074:1371617] SET UCX_TLS=^cma
--------------------------------------------------------------------------
PMIx was unable to find a usable compression library
on the system. We will therefore be unable to compress
large data streams. This may result in longer-than-normal
startup times and larger memory footprints. We will
continue, but strongly recommend installing zlib or
a comparable compression library for better user experience.

You can suppress this warning by adding "pcompress_base_silence_warning=1"
to your PMIx MCA default parameter file, or by adding
"PMIX_MCA_pcompress_base_silence_warning=1" to your environment.
--------------------------------------------------------------------------
Initializing AMReX (25.08-2-ge6dfad939fc5)...
MPI initialized with 8 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 8 devices.
AMReX (25.08-2-ge6dfad939fc5) initialized
Segfault
Segfault
Segfault
Segfault
Segfault
Segfault
Segfault
Segfault
See Backtrace.7 file for details
See Backtrace.6 file for details
See Backtrace.0 file for details
See Backtrace.2 file for details
See Backtrace.1 file for details
See Backtrace.5 file for details
See Backtrace.4 file for details
See Backtrace.3 file for details
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
  Proc: [[42172,1],7]
  Errorcode: 11

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
prterun has exited due to process rank 7 with PID 0 on node node074 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------

I can confirm I enable cuda when installing openmpi. How can I solve the problem?

Thanks!

Answered by BenWibking

Sep 12, 2025

Ok, there are several additional configuration steps needed to get GPU-aware MPI working correctly. The key step is to compile and install the gdrcopy kernel module: https://github.com/NVIDIA/gdrcopy and then recompile UCX with gdrcopy support.

Then you should test that it works using NVIDIA's test program here: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/cuda-aware-mpi-example/src

If that doesn't work, you should ask for help in the OpenMPI and/or UCX GitHub repos.

View full answer

tianninglyu · 2025-09-11T06:59:10Z

dosubot[bot]
bot Sep 11, 2025

A few things can cause segfaults like this in quokka GPU runs. The most common are out-of-bounds array access, accessing host variables from GPU code, or GPU memory/race condition bugs. These often only show up in Release mode or with multiple MPI ranks. Here’s what you can try:

Set CUDA_LAUNCH_BLOCKING=1 in your environment to force synchronous GPU execution. This often surfaces the real error location and can help catch race conditions or async bugs [debugging tips].
Try running with just 1 MPI rank and see if the segfault still occurs. The RadhydroShell-GPU regression test is configured to use only 1 process, so multi-GPU runs may expose less-tested code paths [test config].
If you can, rebuild quokka without GPU support but with -DCMAKE_BUILD_TYPE=Debug or with AddressSanitizer (-DENABLE_ASAN=ON). Run the same input on CPU to catch out-of-bounds or memory bugs before debugging on GPU [debugging guide].
Recent PRs fixed GPU memory management and race conditions, especially in particle and MHD code. Make sure your code is up to date, as older versions may have bugs that cause segfaults in multi-GPU runs [PR 1273, PR 1263].
If you’re using particles, a recent fix improved MPI communication and device-to-device copying, which could affect stability [PR 1297].

If the above doesn’t help, please share the contents of one of the Backtrace.* files—these usually pinpoint the failing function or code line.

Let me know what you find after trying these steps!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

tianninglyu Sep 11, 2025
Author

Hi @dosu,

When running just use 1 MPI rank, there is no segfault, just out of memory.
Set CUDA_LAUNCH_BLOCKING=1 doesn't work.

dosubot[bot] bot Sep 11, 2025

The out-of-memory error with 1 MPI rank means your problem size (likely 512³) exceeds the available GPU memory, which is expected for large domains on a single device. The segfault with multiple ranks is likely a separate issue—recent fixes addressed GPU memory management and race conditions in multi-rank runs, especially around temporary vectors and MPI communication [PR 1273].

To pinpoint the segfault, could you share the contents of one of the Backtrace.* files from the failed multi-rank run? That will show the exact function and code line causing the crash. Also, confirming your GPU model and memory size, and whether your codebase includes the latest fixes (PRs #1273, #1263, #1297), would help narrow things down.

For now, if you want to test stability, try shrinking the problem size so it fits in GPU memory for a single rank, or rebuild Quokka with AddressSanitizer enabled for CPU runs to catch out-of-bounds errors before debugging on GPU [debugging guide].

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

BenWibking · 2025-09-11T17:26:12Z

BenWibking
Sep 11, 2025
Maintainer

Something is wrong with the GPU-aware MPI communication in your setup. You should ask your sysadmin about this. This is not a Quokka problem.

3 replies

tianninglyu Sep 12, 2025
Author

Hi Ben, thank you for your reply. The supercomputer I use is delivered bare metal. I installed all the dependencies myself. I used openmpi-5.0.8, enabled ucx and cuda during configure, and did not configure any other options.

BenWibking Sep 12, 2025
Maintainer

Ok, there are several additional configuration steps needed to get GPU-aware MPI working correctly. The key step is to compile and install the gdrcopy kernel module: https://github.com/NVIDIA/gdrcopy and then recompile UCX with gdrcopy support.

Then you should test that it works using NVIDIA's test program here: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/cuda-aware-mpi-example/src

If that doesn't work, you should ask for help in the OpenMPI and/or UCX GitHub repos.

Answer selected by tianninglyu

tianninglyu Sep 13, 2025
Author

Hi Ben, Thank you for your guidance! I have run the code successfully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segfault on GPU machine #1315

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Segfault on GPU machine #1315

Uh oh!

tianninglyu Sep 11, 2025

Replies: 2 comments · 5 replies

Uh oh!

dosubot[bot] bot Sep 11, 2025

Uh oh!

tianninglyu Sep 11, 2025 Author

Uh oh!

dosubot[bot] bot Sep 11, 2025

Uh oh!

BenWibking Sep 11, 2025 Maintainer

Uh oh!

tianninglyu Sep 12, 2025 Author

Uh oh!

BenWibking Sep 12, 2025 Maintainer

Uh oh!

tianninglyu Sep 13, 2025 Author

tianninglyu
Sep 11, 2025

Replies: 2 comments 5 replies

dosubot[bot]
bot Sep 11, 2025

tianninglyu Sep 11, 2025
Author

BenWibking
Sep 11, 2025
Maintainer

tianninglyu Sep 12, 2025
Author

BenWibking Sep 12, 2025
Maintainer

tianninglyu Sep 13, 2025
Author