-
Notifications
You must be signed in to change notification settings - Fork 902
TEST MPI4PY #12460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEST MPI4PY #12460
Conversation
bot:notacherrypick Signed-off-by: Wenduo Wang <[email protected]>
Saw this in the log
@hppritcha Is this the same issue that you successfully tracked down? |
This looks like the singleton spawn problem we saw with ompi main a couple of months ago. I will take a look |
@hppritcha Thanks for checking. During yesterday's call @naughtont3 volunteered to investigate the issue. I was poking you to see if that rings any bell. |
Just a suggestion: I doubt the fix @hppritcha mentions is in your v5.0 branch yet, unless you have updated the PMIx and PRRTE submodules in the last couple of weeks. |
Thanks. I updated openpmix to include Howard's fix. Running the tests again. |
Howard's fix does quell create intercomm from groups error but the spawn failure remains
|
Try pulling PRRTE as well - any singleton comm_spawn issues would be there. |
Bumped both pmix and prrte to head of master branch. |
Back to square 1
|
Not quite that bad - we know the ULFM tests won't pass (documented elsewhere and being worked on by Lisandro), so they should probably be turned "off" here. |
I want to confirm if this was a regression. Reverted to openpmix v4.2.8 + prrte v3.0.3 |
One thing you can do to more directly check is just see how far down the tests you get. If the ulfm tests come after the singleton comm_spawn, then you know that you got further than before. However, if everything passes with the reverted submodule pointers, then why not just stay there for this release branch? |
So the test passed with the previous openpmix + prrte combo which confirms that this is a regression. This is not news. Now that we have already released with the new submodule pointers I don't think we can simply revert. We have to fix the problem and move forward. |
checking this out. |
hmm... i can't reproduce on my usual aarch64 cluster and yes i did use the shas from 94e834e |
The ULFM tests should pass. They passed before, so if the intercomm and the singleton issues are fixed, then all ULFM tests shall pass. |
@hppritcha in my branch I reverted prrte/pmix to 3.0.3/4.2.8 You should be able to reproduce with master branches of both. |
@wenduwan at which np count do you first observe the failures? |
Please see #12464 for a pass/fail matrix of PMIx/PRRTE combinations |
I will address the PMIx v5.0.2 problem - @hppritcha I could definitely use some help from @samuelkgutierrez, if available. |
bot:notacherrypick Signed-off-by: Wenduo Wang <[email protected]>
@hppritcha I pointed back to pmix 5.0.2 + prrte 3.0.4 and it is failing at np=2 |
In case anyone wonders, previously I did a bisect and found the failure to start with this commit openpmix/openpmix@6163f21 |
Yeah, I don't necessarily believe that bisect result. We'll dig into it. |
Yes (according to the labeling), it is running all the mpi4py tests. The intent was to (a) identify if there are issues relating to both PMIx and PRRTE, or just one of them, or...? and (b) provide a starting place to investigate one or the other without having to simultaneously deal with multiple issues. I didn't bother recording the specifics of the failures - I figure that can be investigated while looking at the code itself. |
Per the matrix, the key is PMIx v5.0.2 - all versions of PRRTE beyond v3.0.3 fail these tests with that version of PMIx. |
bot:notacherrypick