nvbandwidth: use Ubuntu 24 to bump OpenMPI (try to fix MPI shutdown bugs)#51
Conversation
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
klueska
left a comment
There was a problem hiding this comment.
Ideally we would hook this into dependabot like the other components
ArangoGutierrez
left a comment
There was a problem hiding this comment.
nit: Could you rename the commit/PR to
"""
Bump nvcr.io/nvidia/cuda:13.0.0-base-ubuntu22.04 to 13.0.0-base-ubuntu24.04 at deployments/container/nvbandwidth/Dockerfile
"""
As this will be more in line to Dependabot bump PR's
:) I am not dependabot, and this is not a routine bump, but a patch trying to fix a bug. You can force-push into my branch I think -- I choose not to care, that's ok with me. But for the moment I'll not touch this branch again myself. |
|
This broke something. Worker container startup error: Maybe the user ID changed or something else. Probably need to review |
We sometimes see this during MPI launcher shutdown:
This is a segmentation fault in the shutdown procedure, preceded by a previous hang in a previously triggered shutdown (cf. "abort is already in progress").
Said hang prevents a fail-lover from happening in the fault injection testing we are doing -- which is how we get to see this issue regularly now, even if it is nominally rare.
Rare means: for example, ~40 out of ~1000 runs affected:
I did research this segfault quite a bit, there is a lot to be said about it and discussed elsewhere.
Notably, the crash is in
PMIx_Finalize().I found that open-mpi/ompi#10117 looks very similar, the stack trace(s) in there look highly related.
Both show a crash occurring in
libpmix.so.2, both involve libevent in a way that it's obvious that this is in the event-loop-controlled shutdown sequence of OpenMPI.We are currently using (from nvbandwidth logs):
Open MPI v4.1.2, package: Debian OpenMPI, ident: 4.1.2.See https://www.open-mpi.org/software/ompi/v4.1/ -- that's from Nov 2021.
We use Ubuntu 22.04 as build env: https://github.com/NVIDIA/k8s-samples/blob/6dc12f17159fc101dd5603dbf6a8dc978a2614b3/deployments/container/nvbandwidth/Dockerfile.
That's indeed stuck on 4.1.2: https://launchpad.net/ubuntu/jammy/+source/openmpi
On ubuntu 24.04 we can at least get 4.1.6: https://launchpad.net/ubuntu/noble/+source/openmpi -- hence this patch. It's worth a try.
This patch might be the actual fix for the problem:
SiPearl/ompi@f94d66e
But apparently it made it only into OpenMPI 5.x.