Skip to content

Conversation

@wkliao
Copy link
Collaborator

@wkliao wkliao commented Aug 4, 2025

This PR address #1020

@wkliao wkliao requested review from carns and roblatham00 August 4, 2025 20:21
Copy link

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes to address, otherwise it looks good to me.

Kind of wild how long the static options list is now. At some point we should think about if its even worth maintaining static linking support at this point in time, but we shouldn't worry about it yet.

carns
carns previously approved these changes Aug 5, 2025
Copy link

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

* when large-count feature is not available, test only the 28
  non-large-count MPI-IO APIs
@wkliao
Copy link
Collaborator Author

wkliao commented Aug 6, 2025

I added an exhausted test for all 56 MPI file read and write APIs in
commit 7346629. Command "make check" runs the test.
At this moment, it only checked field "MPIIO_BYTES_WRITTEN"
(the total amount written) for write case from the Darshan log. Similarly,
"MPIIO_BYTES_READ" for read case. If necessary, we can expand this
test for checking other fields, e.g. file offsets and counts.

@roblatham00 , Maybe test like this can also be considered for ROMIO, if
it does not have one yet.

@carns
Copy link

carns commented Aug 6, 2025

This is great @wkliao . We should see if it would be possible to add it to the github ci actions.

@wkliao wkliao force-pushed the mpi_io_large_count branch 2 times, most recently from 60e17d9 to 14e2b45 Compare August 6, 2025 20:05
@github-actions github-actions bot added the CI continuous integration label Aug 6, 2025
@wkliao
Copy link
Collaborator Author

wkliao commented Aug 7, 2025

It has been added to GitHub CI in 14e2b45
and looks like it ran fine.

@carns
Copy link

carns commented Aug 7, 2025

Something is a little weird in the CI setup. IIUC browsing the github test output, it looks like this is done in the "Install Darshan" step, which is shared by several actions (sorry I may not have the terminology quite right). It looks like this make check test fails in some of the actions (like the LDMS one and the end-to-end pytest) but it succeeds in the end-to-end regression (the error message is about lack of "slots" in the github environment). Will the latter action fail if the make check fails there?

I don't know how hard it would be, but maybe the "make check" part of that step should be split out to be separate from darshan installation so that it is easier to inspect separately?

Sorry to ask for more work here, but this test is really helpful; we should follow this example as best practice in the future.

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 7, 2025

This kind of error message i.e. "slots" only happens when OpenMPI is used
and running more than 1 MPI process. In my test, it runs 4.
I will find a fix soon.

@wkliao wkliao force-pushed the mpi_io_large_count branch 3 times, most recently from 8122dbc to 88ff81a Compare August 7, 2025 23:01
@wkliao
Copy link
Collaborator Author

wkliao commented Aug 8, 2025

I am now seeing the following errors and have no clue how to fix it.

Error: unable to inflate darshan log data.
Error: failed to read name hash from darshan log file.

The actual error code returned from inflate() is Z_DATA_ERROR

ret = inflate(z_strmp, Z_NO_FLUSH);

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 8, 2025

FYI. This error appeared when running darshan-parser

wkliao added 2 commits August 9, 2025 16:42
The full support of large-count feature in MPICH starts from version 4.2.2
@wkliao wkliao force-pushed the mpi_io_large_count branch from 88ff81a to e6a305d Compare August 9, 2025 23:19
@wkliao
Copy link
Collaborator Author

wkliao commented Aug 11, 2025

I notice these error messages are also the ones shown in #1052.
Not sure whether it is related.

Error: unable to inflate darshan log data.
Error: failed to read name hash from darshan log file.

After some diggings, I found this error only happens when using OpenMPI
and only on Github action CI. I could not reproduce it on my local machine
running Redhat or compute nodes on cels.anl.gov running Ubuntu.
Fail runs all have the error code Z_DATA_ERROR, which means the
Darshan log file is corrupted.

In commit e6a305d, I added a yaml file to build both MPICH and OpenMPI
and use both to run "make check". Only OpenMPI failed and the failure only
happened when running 4 MPI processes. It ran OK for 2.

The next step I tried is to set MPI-IO hint cb_nodes to 1 when configuring
Darshan (Darshan's default is to set hint cb_nodes to 4). In this case,
"make check" running the test program on 4 MPI processes did not fail.
This makes me think maybe Darshan has an issue with OpenMPI's MPI-IO ...?

@carns
Copy link

carns commented Aug 11, 2025

That's weird. Can you pull that darshan log out and attach it to a separate issue about unparseable log files being generated by openmpi so that we can look at that issue separately?

For the CI we can stick with MPICH for now until we resolve that issue.

#!/bin/bash

# Exit immediately if a command exits with a non-zero status.
set -e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'set -eu' would have caught the typo below

USERNAME_ENV=$USER
fi

DARSGAN_PARSER=../../darshan-util/darshan-parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clearly a typo, right?

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 12, 2025

One question about the default MPI-IO hints set at darshan-runtime configure time,
"romio_no_indep_rw=true;cb_nodes=4"

__DARSHAN_LOG_HINTS_DEFAULT="romio_no_indep_rw=true;cb_nodes=4"

Why Darshan sets these default hints? I expect the log files are small after compression.
Setting hints would not matter for most of the cases.

It appears that the OpenMPI (5.0.8 used in my tests) does not take these hints, if using its own
I/O component (now default). OpenMPI-IO hints are listed in
https://docs.open-mpi.org/en/v5.0.6/man-openmpi/man3/MPI_File_open.3.html#hints
Its mpiexec also takes command-line option through -- mca to select/set a component
and hint such as io_ompio_num_aggregators.

I suspect setting those ROMIO hints is the reason I am seeing the failure of make check.
Consider removing them?

@roblatham00
Copy link
Contributor

Ah it's because these log files are small that we set the hints. "no_indep_rw" is how we get "deferred open" (https://wordpress.cels.anl.gov/romio/2003/08/05/deferred-open/ ) so we're asking for only 4 processes to open and write the darshan log file

OpenMPI must ignore hints it does not understand. sounds like a bug in OpenMPI-IO. A small write from many processes should be easy for OpenMPI's MPI-IO implementation to handle.

@carns
Copy link

carns commented Aug 12, 2025

Ugh. Ok. We should be able to create a standalone reproducer (not using Darshan, just a C program that sets hints and makes similar MPI_File* calls) in #1064. Once we have done so we should open a bug report with OpenMPI for help.

In the mean time, is there a reliable/safe way to detect OpenMPI in Darshan and alter our default hints?

@roblatham00
Copy link
Contributor

sounds like OMPIO doesn't know how to handle "cb_nodes" more than number of processes. should be an easy fix for them, but what a headache for us to deal with. Phil, the reproducer should be prett simple: take ROMIO's coll_test and set cb_nodes to nprocs*2

@carns
Copy link

carns commented Aug 12, 2025

sounds like OMPIO doesn't know how to handle "cb_nodes" more than number of processes. should be an easy fix for them, but what a headache for us to deal with. Phil, the reproducer should be prett simple: take ROMIO's coll_test and set cb_nodes to nprocs*2

My read of #1060 (comment) is that the problem happens with 4 procs and cb_nodes=4?

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 12, 2025

I have been trying to write a reproducer on a local machine, but could not do so.
The errors only happened on GitHub CI.

When I developed the test program, I simply set the number of MPI processes to 4.
Because it failed in Github CI, I changed it to 2. See line 18 in darshan-runtime/test/tst_runs.sh
I will change it back to 4, so we all can see it failed in the CI report.

(I did add --oversubscribe to OpenMPI's mpiexec command line, so I can run more
than 2 processes.)

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 12, 2025

Check the new CI that tests OpenMPI in
https://github.com/darshan-hpc/darshan/actions/runs/16921849384/job/47949590879?pr=1060

When running NP=4, it failed.
When running NP=4 and export DARSHAN_LOGHINTS="", it ran fine.

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 13, 2025

Ugh. Ok. We should be able to create a standalone reproducer (not using Darshan, just a C program that sets hints and makes similar MPI_File* calls) in #1064. Once we have done so we should open a bug report with OpenMPI for help.

Note the MPI test program ran fine, i.e. it returned normally.
It was the Darshan log file that was complained by the darshan-parser.
Thus, I think creating a reproducer can be tricky.

I wonder if during MPI_Finalize when Darshan is writing the log file,
something is wrong. For example, all MPI communicators duplicated/created
internally and used by OMPI's MPI-IO module were instantly freed, causing
parallel writes to log file to fail (just a wild guess).

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 14, 2025

What is the data partitioning pattern when Darshan writes log data into the file?
Is it appending one process after another based on the MPI rank order?

@carns
Copy link

carns commented Aug 14, 2025

What is the data partitioning pattern when Darshan writes log data into the file? Is it appending one process after another based on the MPI rank order?

Rank 0 will write the header/metadata on behalf of everyone. Then then all ranks to a scan to determine their offsets and do a collective write (in rank order) of all records simultaneously. There could be quite a bit of variation in how much data each rank is contributing depending on the what the workload looked like, how successful Darshan was at reducing shared records, and the gzip compression ratio at each rank.

We could take the test case that's failing in the github action and instrument it to see what the exact sizes are so that we can replicate it in a standalone reproducer. All of the I/O happens in darshan_core_shutdown(), there aren't that many actual I/O operations.

@carns
Copy link

carns commented Aug 14, 2025

What is the data partitioning pattern when Darshan writes log data into the file? Is it appending one process after another based on the MPI rank order?

Rank 0 will write the header/metadata on behalf of everyone. Then then all ranks to a scan to determine their offsets and do a collective write (in rank order) of all records simultaneously. There could be quite a bit of variation in how much data each rank is contributing depending on the what the workload looked like, how successful Darshan was at reducing shared records, and the gzip compression ratio at each rank.

We could take the test case that's failing in the github action and instrument it to see what the exact sizes are so that we can replicate it in a standalone reproducer. All of the I/O happens in darshan_core_shutdown(), there aren't that many actual I/O operations.

Oh actually, one correction. The scan/collective write is repeated for each module. So there are actually multiple rounds of collective writes.

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 14, 2025

Like using Darshan to profile Darshan 😄

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 14, 2025

Inside of darshan-core.c, shouldn't PMPI_File_open be called instead of MPI_File_open?
Or it does not matter?

ret = MPI_File_open(core->mpi_comm, logfile_name,

@carns
Copy link

carns commented Aug 14, 2025

Inside of darshan-core.c, shouldn't PMPI_File_open be called instead of MPI_File_open? Or it does not matter?

ret = MPI_File_open(core->mpi_comm, logfile_name,

Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls. We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the MPI_* or PMPI_* convention (this is important because some library implementations may use the latter explicitly in bindings for other languages, so we have to catch the PMPI_* versions to get any instrumentation at all in those cases). Maybe for consistency the Darshan code should use all one or the other internally though.

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 14, 2025

Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls.

In file darshan-core.c, I am seeing only calls to "PMPI_File_xxx", except for "MPI_File_open".

We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the MPI_* or PMPI_* convention (this is important because some library implementations may use the latter explicitly in bindings for other languages, so we have to catch the PMPI_* versions to get any instrumentation at all in those cases). Maybe for consistency the Darshan code should use all one or the other internally though.

I though Darshan intercepts only "MPI_File_xxx" calls, not "PMPI_File_xxx", so it can
avoid LD_PRELOAD from looping back to Darshan's interception when writing the log file.
Apparently, my understanding is not right. One question is how Darshan prevents from
profiling those MPI-IO calls in file darshan-core.c?

@carns
Copy link

carns commented Aug 18, 2025

Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls.

In file darshan-core.c, I am seeing only calls to "PMPI_File_xxx", except for "MPI_File_open".

We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the MPI_* or PMPI_* convention (this is important because some library implementations may use the latter explicitly in bindings for other languages, so we have to catch the PMPI_* versions to get any instrumentation at all in those cases). Maybe for consistency the Darshan code should use all one or the other internally though.

I though Darshan intercepts only "MPI_File_xxx" calls, not "PMPI_File_xxx", so it can avoid LD_PRELOAD from looping back to Darshan's interception when writing the log file. Apparently, my understanding is not right. One question is how Darshan prevents from profiling those MPI-IO calls in file darshan-core.c?

The wrappers probably trigger during shutdown too, it's just that the macros (MAP_OR_FAIL() etc.) within the wrapper identify that Darshan has been disabled and effectively do nothing except invoke the next symbol.

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 18, 2025

A quick test using OpenMPI 5.0.8 shows that OpenMPI recognizes hint "cb_nodes",
but not "romio_no_indep_rw".

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 18, 2025

To understand the Darshan's data partitioning pattern when writing the
log file, I ran a few tests and it appears to me that the pattern is
"appending" one contiguous space, based on the order of rank IDs.

In this case, calling MPI collective write may internally switch to
independent subroutines in OpenMPI and MPICH.

* check file system of log folder
* allow setting env variable NP, number of MPI processes
* run `darshan-config --all` to dump Darshan's configuration
@wkliao wkliao force-pushed the mpi_io_large_count branch 3 times, most recently from ad8f532 to beea162 Compare August 19, 2025 04:57
@wkliao wkliao force-pushed the mpi_io_large_count branch from beea162 to 47a5134 Compare August 22, 2025 18:16
@wkliao wkliao force-pushed the mpi_io_large_count branch from 47a5134 to 7212fb9 Compare August 22, 2025 18:55
@roblatham00
Copy link
Contributor

A quick test using OpenMPI 5.0.8 shows that OpenMPI recognizes hint "cb_nodes", but not "romio_no_indep_rw".

Yeah as the name implies it is a romio-specific hint. should be ignored on other implementations, as ROMIO would ignore IBM or OMPIO-specific hints.

If you set that hint, then ask OMPIO "what hints did we set", it will not return romio_no_indep_rw , right? could be a run-time test...

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 23, 2025

I checked OpenMPI user guide. There is no hint of similar functionality.
Thus, we should expect a higher cost of MPI_File_open.

Copy link

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

@wkliao wkliao merged commit dd0c2e7 into darshan-hpc:main Sep 15, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI continuous integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants