Skip to content

nvbandwidth: use Ubuntu 24 to bump OpenMPI (try to fix MPI shutdown bugs)#51

Merged
jgehrcke merged 1 commit into
NVIDIA:mainfrom
jgehrcke:jp/nvb-bump-openmpi
Oct 9, 2025
Merged

nvbandwidth: use Ubuntu 24 to bump OpenMPI (try to fix MPI shutdown bugs)#51
jgehrcke merged 1 commit into
NVIDIA:mainfrom
jgehrcke:jp/nvb-bump-openmpi

Conversation

@jgehrcke
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke commented Oct 9, 2025

We sometimes see this during MPI launcher shutdown:

mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
...
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: (nil)
[ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xe2a0e92109d0]
[ 1] /lib/aarch64-linux-gnu/libpmix.so.2(PMIx_Finalize+0x53c)[0xe2a0e6285be0]
[ 2] /usr/lib/aarch64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext3x.so(ext3x_client_finalize+0x37c)[0xe2a0e6407bcc]
[ 3] /usr/lib/aarch64-linux-gnu/openmpi/lib/openmpi3/mca_ess_hnp.so(+0x511c)[0xe2a0e8b3511c]
[ 4] /lib/aarch64-linux-gnu/libevent_core-2.1.so.7(+0x1e4c4)[0xe2a0e901e4c4]
[ 5] /lib/aarch64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x444)[0xe2a0e90202f8]
[ 6] mpirun(+0xf4c)[0xba62c26e0f4c]

This is a segmentation fault in the shutdown procedure, preceded by a previous hang in a previously triggered shutdown (cf. "abort is already in progress").

Said hang prevents a fail-lover from happening in the fault injection testing we are doing -- which is how we get to see this issue regularly now, even if it is nominally rare.

Rare means: for example, ~40 out of ~1000 runs affected:

nvidia@gb-nvl-043-compute02:~/jp/k8s-dra-driver-gpu/nvbtests-02$ grep -inR 'segmentation fault' | wc -l
43

I did research this segfault quite a bit, there is a lot to be said about it and discussed elsewhere.

Notably, the crash is in PMIx_Finalize().

I found that open-mpi/ompi#10117 looks very similar, the stack trace(s) in there look highly related.

Both show a crash occurring in libpmix.so.2, both involve libevent in a way that it's obvious that this is in the event-loop-controlled shutdown sequence of OpenMPI.

We are currently using (from nvbandwidth logs): Open MPI v4.1.2, package: Debian OpenMPI, ident: 4.1.2.

See https://www.open-mpi.org/software/ompi/v4.1/ -- that's from Nov 2021.

We use Ubuntu 22.04 as build env: https://github.com/NVIDIA/k8s-samples/blob/6dc12f17159fc101dd5603dbf6a8dc978a2614b3/deployments/container/nvbandwidth/Dockerfile.

That's indeed stuck on 4.1.2: https://launchpad.net/ubuntu/jammy/+source/openmpi

On ubuntu 24.04 we can at least get 4.1.6: https://launchpad.net/ubuntu/noble/+source/openmpi -- hence this patch. It's worth a try.

This patch might be the actual fix for the problem:
SiPearl/ompi@f94d66e

But apparently it made it only into OpenMPI 5.x.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Copy link
Copy Markdown
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would hook this into dependabot like the other components

Copy link
Copy Markdown
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could you rename the commit/PR to

"""
Bump nvcr.io/nvidia/cuda:13.0.0-base-ubuntu22.04 to 13.0.0-base-ubuntu24.04 at deployments/container/nvbandwidth/Dockerfile
"""

As this will be more in line to Dependabot bump PR's

@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Oct 9, 2025

will be more in line to Dependabot bump PR's

:) I am not dependabot, and this is not a routine bump, but a patch trying to fix a bug.

You can force-push into my branch I think -- I choose not to care, that's ok with me. But for the moment I'll not touch this branch again myself.

@jgehrcke jgehrcke merged commit 70c40e6 into NVIDIA:main Oct 9, 2025
9 checks passed
@jgehrcke
Copy link
Copy Markdown
Contributor Author

This broke something. Worker container startup error:

$ kubectl logs test-failover-job-worker-0
/home/mpiuser/.sshd_config: Permission denied

Maybe the user ID changed or something else. Probably need to review

COPY --chown=mpiuser ./deployments/container/nvbandwidth/sshd_config .sshd_config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants