support MPI-IO large count APIs #1060

wkliao · 2025-08-04T20:20:59Z

This PR address #1020

carns

A couple of notes to address, otherwise it looks good to me.

Kind of wild how long the static options list is now. At some point we should think about if its even worth maintaining static linking support at this point in time, but we shouldn't worry about it yet.

darshan-runtime/lib/darshan-mpiio.c

carns

Looks good to me

* when large-count feature is not available, test only the 28 non-large-count MPI-IO APIs

wkliao · 2025-08-06T00:43:26Z

I added an exhausted test for all 56 MPI file read and write APIs in
commit 7346629. Command "make check" runs the test.
At this moment, it only checked field "MPIIO_BYTES_WRITTEN"
(the total amount written) for write case from the Darshan log. Similarly,
"MPIIO_BYTES_READ" for read case. If necessary, we can expand this
test for checking other fields, e.g. file offsets and counts.

@roblatham00 , Maybe test like this can also be considered for ROMIO, if
it does not have one yet.

carns · 2025-08-06T13:51:10Z

This is great @wkliao . We should see if it would be possible to add it to the github ci actions.

wkliao · 2025-08-07T04:15:07Z

It has been added to GitHub CI in 14e2b45
and looks like it ran fine.

carns · 2025-08-07T21:24:43Z

Something is a little weird in the CI setup. IIUC browsing the github test output, it looks like this is done in the "Install Darshan" step, which is shared by several actions (sorry I may not have the terminology quite right). It looks like this make check test fails in some of the actions (like the LDMS one and the end-to-end pytest) but it succeeds in the end-to-end regression (the error message is about lack of "slots" in the github environment). Will the latter action fail if the make check fails there?

I don't know how hard it would be, but maybe the "make check" part of that step should be split out to be separate from darshan installation so that it is easier to inspect separately?

Sorry to ask for more work here, but this test is really helpful; we should follow this example as best practice in the future.

wkliao · 2025-08-07T21:31:42Z

This kind of error message i.e. "slots" only happens when OpenMPI is used
and running more than 1 MPI process. In my test, it runs 4.
I will find a fix soon.

wkliao · 2025-08-08T02:02:41Z

I am now seeing the following errors and have no clue how to fix it.

Error: unable to inflate darshan log data.
Error: failed to read name hash from darshan log file.

The actual error code returned from inflate() is Z_DATA_ERROR

darshan/darshan-util/darshan-logutils.c

Line 1709 in 0ad3933

ret = inflate(z_strmp, Z_NO_FLUSH);

wkliao · 2025-08-08T15:54:09Z

FYI. This error appeared when running darshan-parser

The full support of large-count feature in MPICH starts from version 4.2.2

wkliao · 2025-08-11T03:48:53Z

I notice these error messages are also the ones shown in #1052.
Not sure whether it is related.

Error: unable to inflate darshan log data.
Error: failed to read name hash from darshan log file.

After some diggings, I found this error only happens when using OpenMPI
and only on Github action CI. I could not reproduce it on my local machine
running Redhat or compute nodes on cels.anl.gov running Ubuntu.
Fail runs all have the error code Z_DATA_ERROR, which means the
Darshan log file is corrupted.

In commit e6a305d, I added a yaml file to build both MPICH and OpenMPI
and use both to run "make check". Only OpenMPI failed and the failure only
happened when running 4 MPI processes. It ran OK for 2.

The next step I tried is to set MPI-IO hint cb_nodes to 1 when configuring
Darshan (Darshan's default is to set hint cb_nodes to 4). In this case,
"make check" running the test program on 4 MPI processes did not fail.
This makes me think maybe Darshan has an issue with OpenMPI's MPI-IO ...?

carns · 2025-08-11T13:00:15Z

That's weird. Can you pull that darshan log out and attach it to a separate issue about unparseable log files being generated by openmpi so that we can look at that issue separately?

For the CI we can stick with MPICH for now until we resolve that issue.

roblatham00 · 2025-08-12T16:21:22Z

darshan-runtime/test/tst_runs.sh

+#!/bin/bash
+
+# Exit immediately if a command exits with a non-zero status.
+set -e


'set -eu' would have caught the typo below

roblatham00 · 2025-08-12T16:21:26Z

darshan-runtime/test/tst_runs.sh

+   USERNAME_ENV=$USER
+fi
+
+DARSGAN_PARSER=../../darshan-util/darshan-parser


that's clearly a typo, right?

darshan-runtime/test/tst_runs.sh

darshan-runtime/test/tst_mpi_io.c

wkliao · 2025-08-12T18:26:53Z

One question about the default MPI-IO hints set at darshan-runtime configure time,
"romio_no_indep_rw=true;cb_nodes=4"

darshan/darshan-runtime/configure.ac

Line 601 in 0fc3911

__DARSHAN_LOG_HINTS_DEFAULT="romio_no_indep_rw=true;cb_nodes=4"

Why Darshan sets these default hints? I expect the log files are small after compression.
Setting hints would not matter for most of the cases.

It appears that the OpenMPI (5.0.8 used in my tests) does not take these hints, if using its own
I/O component (now default). OpenMPI-IO hints are listed in
https://docs.open-mpi.org/en/v5.0.6/man-openmpi/man3/MPI_File_open.3.html#hints
Its mpiexec also takes command-line option through -- mca to select/set a component
and hint such as io_ompio_num_aggregators.

I suspect setting those ROMIO hints is the reason I am seeing the failure of make check.
Consider removing them?

roblatham00 · 2025-08-12T20:35:38Z

Ah it's because these log files are small that we set the hints. "no_indep_rw" is how we get "deferred open" (https://wordpress.cels.anl.gov/romio/2003/08/05/deferred-open/ ) so we're asking for only 4 processes to open and write the darshan log file

OpenMPI must ignore hints it does not understand. sounds like a bug in OpenMPI-IO. A small write from many processes should be easy for OpenMPI's MPI-IO implementation to handle.

carns · 2025-08-12T21:13:01Z

Ugh. Ok. We should be able to create a standalone reproducer (not using Darshan, just a C program that sets hints and makes similar MPI_File* calls) in #1064. Once we have done so we should open a bug report with OpenMPI for help.

In the mean time, is there a reliable/safe way to detect OpenMPI in Darshan and alter our default hints?

roblatham00 · 2025-08-12T21:13:57Z

sounds like OMPIO doesn't know how to handle "cb_nodes" more than number of processes. should be an easy fix for them, but what a headache for us to deal with. Phil, the reproducer should be prett simple: take ROMIO's coll_test and set cb_nodes to nprocs*2

carns · 2025-08-12T21:17:10Z

sounds like OMPIO doesn't know how to handle "cb_nodes" more than number of processes. should be an easy fix for them, but what a headache for us to deal with. Phil, the reproducer should be prett simple: take ROMIO's coll_test and set cb_nodes to nprocs*2

My read of #1060 (comment) is that the problem happens with 4 procs and cb_nodes=4?

wkliao · 2025-08-12T21:32:42Z

I have been trying to write a reproducer on a local machine, but could not do so.
The errors only happened on GitHub CI.

When I developed the test program, I simply set the number of MPI processes to 4.
Because it failed in Github CI, I changed it to 2. See line 18 in darshan-runtime/test/tst_runs.sh
I will change it back to 4, so we all can see it failed in the CI report.

(I did add --oversubscribe to OpenMPI's mpiexec command line, so I can run more
than 2 processes.)

wkliao · 2025-08-12T22:01:01Z

Check the new CI that tests OpenMPI in
https://github.com/darshan-hpc/darshan/actions/runs/16921849384/job/47949590879?pr=1060

When running NP=4, it failed.
When running NP=4 and export DARSHAN_LOGHINTS="", it ran fine.

wkliao · 2025-08-13T03:24:06Z

Ugh. Ok. We should be able to create a standalone reproducer (not using Darshan, just a C program that sets hints and makes similar MPI_File* calls) in #1064. Once we have done so we should open a bug report with OpenMPI for help.

Note the MPI test program ran fine, i.e. it returned normally.
It was the Darshan log file that was complained by the darshan-parser.
Thus, I think creating a reproducer can be tricky.

I wonder if during MPI_Finalize when Darshan is writing the log file,
something is wrong. For example, all MPI communicators duplicated/created
internally and used by OMPI's MPI-IO module were instantly freed, causing
parallel writes to log file to fail (just a wild guess).

wkliao · 2025-08-14T01:34:55Z

What is the data partitioning pattern when Darshan writes log data into the file?
Is it appending one process after another based on the MPI rank order?

carns · 2025-08-14T14:16:22Z

What is the data partitioning pattern when Darshan writes log data into the file? Is it appending one process after another based on the MPI rank order?

Rank 0 will write the header/metadata on behalf of everyone. Then then all ranks to a scan to determine their offsets and do a collective write (in rank order) of all records simultaneously. There could be quite a bit of variation in how much data each rank is contributing depending on the what the workload looked like, how successful Darshan was at reducing shared records, and the gzip compression ratio at each rank.

We could take the test case that's failing in the github action and instrument it to see what the exact sizes are so that we can replicate it in a standalone reproducer. All of the I/O happens in darshan_core_shutdown(), there aren't that many actual I/O operations.

carns · 2025-08-14T14:18:36Z

What is the data partitioning pattern when Darshan writes log data into the file? Is it appending one process after another based on the MPI rank order?

Rank 0 will write the header/metadata on behalf of everyone. Then then all ranks to a scan to determine their offsets and do a collective write (in rank order) of all records simultaneously. There could be quite a bit of variation in how much data each rank is contributing depending on the what the workload looked like, how successful Darshan was at reducing shared records, and the gzip compression ratio at each rank.

We could take the test case that's failing in the github action and instrument it to see what the exact sizes are so that we can replicate it in a standalone reproducer. All of the I/O happens in darshan_core_shutdown(), there aren't that many actual I/O operations.

Oh actually, one correction. The scan/collective write is repeated for each module. So there are actually multiple rounds of collective writes.

wkliao · 2025-08-14T16:38:29Z

Like using Darshan to profile Darshan 😄

wkliao · 2025-08-14T19:46:37Z

Inside of darshan-core.c, shouldn't PMPI_File_open be called instead of MPI_File_open?
Or it does not matter?

darshan/darshan-runtime/lib/darshan-core.c

Line 1691 in 0fc3911

ret = MPI_File_open(core->mpi_comm, logfile_name,

carns · 2025-08-14T20:08:17Z

Inside of darshan-core.c, shouldn't PMPI_File_open be called instead of MPI_File_open? Or it does not matter?

darshan/darshan-runtime/lib/darshan-core.c

Line 1691 in 0fc3911

ret = MPI_File_open(core->mpi_comm, logfile_name,

Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls. We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the MPI_* or PMPI_* convention (this is important because some library implementations may use the latter explicitly in bindings for other languages, so we have to catch the PMPI_* versions to get any instrumentation at all in those cases). Maybe for consistency the Darshan code should use all one or the other internally though.

wkliao · 2025-08-14T21:32:08Z

Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls.

In file darshan-core.c, I am seeing only calls to "PMPI_File_xxx", except for "MPI_File_open".

We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the MPI_* or PMPI_* convention (this is important because some library implementations may use the latter explicitly in bindings for other languages, so we have to catch the PMPI_* versions to get any instrumentation at all in those cases). Maybe for consistency the Darshan code should use all one or the other internally though.

I though Darshan intercepts only "MPI_File_xxx" calls, not "PMPI_File_xxx", so it can
avoid LD_PRELOAD from looping back to Darshan's interception when writing the log file.
Apparently, my understanding is not right. One question is how Darshan prevents from
profiling those MPI-IO calls in file darshan-core.c?

carns · 2025-08-18T01:56:15Z

Good catch. For Darshan itself I don't believe it matters; we don't use PMPI anymore for MPI-IO calls.

In file darshan-core.c, I am seeing only calls to "PMPI_File_xxx", except for "MPI_File_open".

We intercept MPI functions the same as we do any other function, with the only variation being that we set up symbol aliases so that the same wrapper will intercept the function whether the caller used the MPI_* or PMPI_* convention (this is important because some library implementations may use the latter explicitly in bindings for other languages, so we have to catch the PMPI_* versions to get any instrumentation at all in those cases). Maybe for consistency the Darshan code should use all one or the other internally though.

I though Darshan intercepts only "MPI_File_xxx" calls, not "PMPI_File_xxx", so it can avoid LD_PRELOAD from looping back to Darshan's interception when writing the log file. Apparently, my understanding is not right. One question is how Darshan prevents from profiling those MPI-IO calls in file darshan-core.c?

The wrappers probably trigger during shutdown too, it's just that the macros (MAP_OR_FAIL() etc.) within the wrapper identify that Darshan has been disabled and effectively do nothing except invoke the next symbol.

wkliao · 2025-08-18T15:27:20Z

A quick test using OpenMPI 5.0.8 shows that OpenMPI recognizes hint "cb_nodes",
but not "romio_no_indep_rw".

wkliao · 2025-08-18T15:42:43Z

To understand the Darshan's data partitioning pattern when writing the
log file, I ran a few tests and it appears to me that the pattern is
"appending" one contiguous space, based on the order of rank IDs.

In this case, calling MPI collective write may internally switch to
independent subroutines in OpenMPI and MPICH.

* check file system of log folder * allow setting env variable NP, number of MPI processes * run `darshan-config --all` to dump Darshan's configuration

roblatham00 · 2025-08-23T17:33:02Z

A quick test using OpenMPI 5.0.8 shows that OpenMPI recognizes hint "cb_nodes", but not "romio_no_indep_rw".

Yeah as the name implies it is a romio-specific hint. should be ignored on other implementations, as ROMIO would ignore IBM or OMPIO-specific hints.

If you set that hint, then ask OMPIO "what hints did we set", it will not return romio_no_indep_rw , right? could be a run-time test...

wkliao · 2025-08-23T18:39:04Z

I checked OpenMPI user guide. There is no hint of similar functionality.
Thus, we should expect a higher cost of MPI_File_open.

carns

Looks great

support MPI-IO large count APIs

70b3eb8

wkliao requested review from carns and roblatham00 August 4, 2025 20:21

carns requested changes Aug 4, 2025

View reviewed changes

darshan-runtime/lib/darshan-mpiio.c Outdated Show resolved Hide resolved

darshan-runtime/lib/darshan-mpiio.c Show resolved Hide resolved

wkliao added 2 commits August 4, 2025 18:51

add missing PMPI_File_iwrite_all

798e31d

fix to call MPI_File_get_position_shared for MPI_File_iread_shared_c

24cecf2

carns previously approved these changes Aug 5, 2025

View reviewed changes

Add an exhaust test for all 56 MPI file write and read APIs

7346629

* when large-count feature is not available, test only the 28 non-large-count MPI-IO APIs

wkliao dismissed carns’s stale review via 7346629 August 5, 2025 23:50

wkliao force-pushed the mpi_io_large_count branch 2 times, most recently from 60e17d9 to 14e2b45 Compare August 6, 2025 20:05

github-actions bot added the CI continuous integration label Aug 6, 2025

wkliao force-pushed the mpi_io_large_count branch 3 times, most recently from 8122dbc to 88ff81a Compare August 7, 2025 23:01

wkliao added 2 commits August 9, 2025 16:42

add make check to CI

c74115b

Github action regression test: update to use MPICH 4.3.1

3b952bb

The full support of large-count feature in MPICH starts from version 4.2.2

wkliao force-pushed the mpi_io_large_count branch from 88ff81a to e6a305d Compare August 9, 2025 23:19

roblatham00 reviewed Aug 12, 2025

View reviewed changes

Add more env tests into darshan-runtime/test/tst_runs.sh

d89957d

* check file system of log folder * allow setting env variable NP, number of MPI processes * run `darshan-config --all` to dump Darshan's configuration

wkliao force-pushed the mpi_io_large_count branch 3 times, most recently from ad8f532 to beea162 Compare August 19, 2025 04:57

wkliao mentioned this pull request Aug 20, 2025

corrupted darshan log file when using openMPI #1064

Closed

wkliao force-pushed the mpi_io_large_count branch from beea162 to 47a5134 Compare August 22, 2025 18:16

wkliao added 2 commits August 22, 2025 13:54

Add --oversubscribe to mpiexec command line if OpenMPI is used

9be9f3e

Github CI: test using MPICH 4.3.1 and OpenMPI 5.0.8

7212fb9

wkliao force-pushed the mpi_io_large_count branch from 47a5134 to 7212fb9 Compare August 22, 2025 18:55

carns approved these changes Sep 15, 2025

View reviewed changes

wkliao merged commit dd0c2e7 into darshan-hpc:main Sep 15, 2025
17 checks passed

roblatham00 mentioned this pull request Sep 26, 2025

support new MPI "large count" routines #1020

Closed

support MPI-IO large count APIs #1060

support MPI-IO large count APIs #1060

Uh oh!

Conversation

wkliao commented Aug 4, 2025

Uh oh!

carns left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

carns left a comment

Choose a reason for hiding this comment

Uh oh!

wkliao commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carns commented Aug 6, 2025

Uh oh!

wkliao commented Aug 7, 2025

Uh oh!

carns commented Aug 7, 2025

Uh oh!

wkliao commented Aug 7, 2025

Uh oh!

wkliao commented Aug 8, 2025

Uh oh!

wkliao commented Aug 8, 2025

Uh oh!

wkliao commented Aug 11, 2025

Uh oh!

carns commented Aug 11, 2025

Uh oh!

roblatham00 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

roblatham00 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wkliao commented Aug 12, 2025

Uh oh!

roblatham00 commented Aug 12, 2025

Uh oh!

carns commented Aug 12, 2025

Uh oh!

roblatham00 commented Aug 12, 2025

Uh oh!

carns commented Aug 12, 2025

Uh oh!

wkliao commented Aug 12, 2025

Uh oh!

wkliao commented Aug 12, 2025

Uh oh!

wkliao commented Aug 13, 2025

Uh oh!

wkliao commented Aug 14, 2025

Uh oh!

carns commented Aug 14, 2025

Uh oh!

carns commented Aug 14, 2025

Uh oh!

wkliao commented Aug 14, 2025

Uh oh!

wkliao commented Aug 14, 2025

Uh oh!

carns commented Aug 14, 2025

Uh oh!

wkliao commented Aug 14, 2025

Uh oh!

carns commented Aug 18, 2025

Uh oh!

wkliao commented Aug 18, 2025

Uh oh!

wkliao commented Aug 18, 2025

Uh oh!

roblatham00 commented Aug 23, 2025

Uh oh!

wkliao commented Aug 23, 2025

wkliao commented Aug 6, 2025 •

edited

Loading