-
Notifications
You must be signed in to change notification settings - Fork 37
support MPI-IO large count APIs #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
carns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of notes to address, otherwise it looks good to me.
Kind of wild how long the static options list is now. At some point we should think about if its even worth maintaining static linking support at this point in time, but we shouldn't worry about it yet.
carns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
* when large-count feature is not available, test only the 28 non-large-count MPI-IO APIs
|
I added an exhausted test for all 56 MPI file read and write APIs in @roblatham00 , Maybe test like this can also be considered for ROMIO, if |
|
This is great @wkliao . We should see if it would be possible to add it to the github ci actions. |
60e17d9 to
14e2b45
Compare
|
It has been added to GitHub CI in 14e2b45 |
|
Something is a little weird in the CI setup. IIUC browsing the github test output, it looks like this is done in the "Install Darshan" step, which is shared by several actions (sorry I may not have the terminology quite right). It looks like this make check test fails in some of the actions (like the LDMS one and the end-to-end pytest) but it succeeds in the end-to-end regression (the error message is about lack of "slots" in the github environment). Will the latter action fail if the make check fails there? I don't know how hard it would be, but maybe the "make check" part of that step should be split out to be separate from darshan installation so that it is easier to inspect separately? Sorry to ask for more work here, but this test is really helpful; we should follow this example as best practice in the future. |
|
This kind of error message i.e. "slots" only happens when OpenMPI is used |
8122dbc to
88ff81a
Compare
|
I am now seeing the following errors and have no clue how to fix it. The actual error code returned from inflate() is Z_DATA_ERROR darshan/darshan-util/darshan-logutils.c Line 1709 in 0ad3933
|
|
FYI. This error appeared when running darshan-parser |
The full support of large-count feature in MPICH starts from version 4.2.2
88ff81a to
e6a305d
Compare
|
I notice these error messages are also the ones shown in #1052. After some diggings, I found this error only happens when using OpenMPI In commit e6a305d, I added a yaml file to build both MPICH and OpenMPI The next step I tried is to set MPI-IO hint cb_nodes to 1 when configuring |
|
That's weird. Can you pull that darshan log out and attach it to a separate issue about unparseable log files being generated by openmpi so that we can look at that issue separately? For the CI we can stick with MPICH for now until we resolve that issue. |
| #!/bin/bash | ||
|
|
||
| # Exit immediately if a command exits with a non-zero status. | ||
| set -e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'set -eu' would have caught the typo below
darshan-runtime/test/tst_runs.sh
Outdated
| USERNAME_ENV=$USER | ||
| fi | ||
|
|
||
| DARSGAN_PARSER=../../darshan-util/darshan-parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's clearly a typo, right?
|
One question about the default MPI-IO hints set at darshan-runtime configure time, darshan/darshan-runtime/configure.ac Line 601 in 0fc3911
Why Darshan sets these default hints? I expect the log files are small after compression. It appears that the OpenMPI (5.0.8 used in my tests) does not take these hints, if using its own I suspect setting those ROMIO hints is the reason I am seeing the failure of |
|
Ah it's because these log files are small that we set the hints. "no_indep_rw" is how we get "deferred open" (https://wordpress.cels.anl.gov/romio/2003/08/05/deferred-open/ ) so we're asking for only 4 processes to open and write the darshan log file OpenMPI must ignore hints it does not understand. sounds like a bug in OpenMPI-IO. A small write from many processes should be easy for OpenMPI's MPI-IO implementation to handle. |
|
Ugh. Ok. We should be able to create a standalone reproducer (not using Darshan, just a C program that sets hints and makes similar In the mean time, is there a reliable/safe way to detect OpenMPI in Darshan and alter our default hints? |
|
sounds like OMPIO doesn't know how to handle "cb_nodes" more than number of processes. should be an easy fix for them, but what a headache for us to deal with. Phil, the reproducer should be prett simple: take ROMIO's |
My read of #1060 (comment) is that the problem happens with 4 procs and cb_nodes=4? |
|
I have been trying to write a reproducer on a local machine, but could not do so. When I developed the test program, I simply set the number of MPI processes to 4. (I did add |
|
Check the new CI that tests OpenMPI in When running NP=4, it failed. |
Note the MPI test program ran fine, i.e. it returned normally. I wonder if during MPI_Finalize when Darshan is writing the log file, |
|
What is the data partitioning pattern when Darshan writes log data into the file? |
Rank 0 will write the header/metadata on behalf of everyone. Then then all ranks to a scan to determine their offsets and do a collective write (in rank order) of all records simultaneously. There could be quite a bit of variation in how much data each rank is contributing depending on the what the workload looked like, how successful Darshan was at reducing shared records, and the gzip compression ratio at each rank. We could take the test case that's failing in the github action and instrument it to see what the exact sizes are so that we can replicate it in a standalone reproducer. All of the I/O happens in |
Oh actually, one correction. The scan/collective write is repeated for each module. So there are actually multiple rounds of collective writes. |
|
Like using Darshan to profile Darshan 😄 |
|
Inside of darshan-core.c, shouldn't darshan/darshan-runtime/lib/darshan-core.c Line 1691 in 0fc3911
|
Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls. We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the |
In file darshan-core.c, I am seeing only calls to "PMPI_File_xxx", except for "MPI_File_open".
I though Darshan intercepts only "MPI_File_xxx" calls, not "PMPI_File_xxx", so it can |
The wrappers probably trigger during shutdown too, it's just that the macros ( |
|
A quick test using OpenMPI 5.0.8 shows that OpenMPI recognizes hint "cb_nodes", |
|
To understand the Darshan's data partitioning pattern when writing the In this case, calling MPI collective write may internally switch to |
* check file system of log folder * allow setting env variable NP, number of MPI processes * run `darshan-config --all` to dump Darshan's configuration
ad8f532 to
beea162
Compare
beea162 to
47a5134
Compare
47a5134 to
7212fb9
Compare
Yeah as the name implies it is a romio-specific hint. should be ignored on other implementations, as ROMIO would ignore IBM or OMPIO-specific hints. If you set that hint, then ask OMPIO "what hints did we set", it will not return |
|
I checked OpenMPI user guide. There is no hint of similar functionality. |
carns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
This PR address #1020