-
Notifications
You must be signed in to change notification settings - Fork 10
Fix the segfault when using Open MPI 5 to bootstrap UCX #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #333 +/- ##
========================================
Coverage 27.15% 27.15%
========================================
Files 190 190
Lines 39174 39174
Branches 14289 14221 -68
========================================
Hits 10638 10638
+ Misses 27681 27436 -245
- Partials 855 1100 +245 ☔ View full report in Codecov by Sentry. |
src/realm/ucx/bootstrap/bootstrap.cc
Outdated
case BOOTSTRAP_PLUGIN: | ||
status = bootstrap_loader_init(config->plugin_name, NULL, handle); | ||
// for safety, use MPI as the bootstrap mode to not unload the plugin | ||
status = bootstrap_loader_init(config->plugin_name, NULL, handle, BootstrapMode::BOOTSTRAP_MPI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually what I feared. Realm shouldn't be making any assumptions about the plugin. We should have the plugin itself make a reference to itself such that it won't unload when dlclose is called. You should be able to do that with the following snippet:
/// bootstrap_mpi.cc
int realm_ucp_bootstrap_plugin_init()
{
// ...
Dl_info info;
dladdr(&realm_ucp_bootstrap_plugin_init, &info);
void *self = dlopen(info.dli_fname, RTLD_NODELETE);
dlclose(self);
}
This way it's on the plugin itself to acknowledge that it needs to keep itself loaded rather than Realm having to reason if it needs to keep it loaded or not based on some arbitrary context (like the name of the plugin).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a very good suggestion. I did not find a way to let the plugin itself to decide which mode to use. It should be fixed now.
Hi @muraj @apryakhin could you help to review this PR? We need it to get a build for a smooth DGX Spark support desirably before GTC DTC. |
fprintf(stderr, "dlopen failed for %s: %s\n", info.dli_fname, dlerror()); | ||
return -1; | ||
} | ||
dlclose(handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention here to actually close handle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is the dlclose is just to decrease the reference count of the handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but maybe the intention is to dlclose the handle we just opened above, not the one that was passed in? That's what I see in @muraj's example.
Please see #331 for the bug.