Segfault on GPU machine #1315
-
|
Hi all, I try to run quokka on a NVIDIA GPU machine but meet segfault. My script is: where run.sh is a wrapper: And it reports a segfault: I can confirm I enable cuda when installing openmpi. How can I solve the problem? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
|
A few things can cause segfaults like this in quokka GPU runs. The most common are out-of-bounds array access, accessing host variables from GPU code, or GPU memory/race condition bugs. These often only show up in Release mode or with multiple MPI ranks. Here’s what you can try:
If the above doesn’t help, please share the contents of one of the Backtrace.* files—these usually pinpoint the failing function or code line. Let me know what you find after trying these steps! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
Something is wrong with the GPU-aware MPI communication in your setup. You should ask your sysadmin about this. This is not a Quokka problem. |
Beta Was this translation helpful? Give feedback.
Ok, there are several additional configuration steps needed to get GPU-aware MPI working correctly. The key step is to compile and install the
gdrcopykernel module: https://github.com/NVIDIA/gdrcopy and then recompile UCX with gdrcopy support.Then you should test that it works using NVIDIA's test program here: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/cuda-aware-mpi-example/src
If that doesn't work, you should ask for help in the OpenMPI and/or UCX GitHub repos.