Skip to content

Conversation

@shanedsnyder
Copy link

@shanedsnyder shanedsnyder commented Oct 31, 2024

This PR adds new instrumentation of DAOS storage APIs and corresponding updates to our analysis tools to integrate this DAOS data. Specifically, 2 new Darshan modules are defined: DARSHAN_DFS_MOD for instrumenting usage of the DAOS file system (DFS) API and DARSHAN_DAOS_MOD for instrumenting native DAOS object APIs. More details on each module below.

DFS module:

  • For each DFS file, Darshan captures a fixed set of integer/FP counters (see full list in dfs-log-format.h) and the corresponding DAOS pool/container UUIDs.
  • DFS file record names are based on the full path in the DFS directory tree, similar to our other file-based modules.
  • DFS file record IDs are based off of the underlying DAOS OID, not the file name.
    • This approach was used, because not all DFS file open routines take a file name as input (e.g., dfs_obj_global2local()), meaning not all processes will have the file name available to generate a consistent record ID -- using the object OID allows all processes to agree on a consistent record ID value.
    • One side effect of this approach worth mentioning is that, since Darshan records are based on underlying OIDs and not file names, deleting/recreating files will result in multiple Darshan records corresponding to the same file -- this behavior can be easily observed in benchmarks like IOR which delete/recreate the output file on each iteration. It will ultimately be the responsibility of analysis tools to aggregate file records in this case.
  • Asynchronous I/O capture is fully supported in Darshan instrumentation wrappers for the DFS interface.
  • The pool_uuid:cont_uuid combo is used in place of the mount pt in tools like darshan-parser.
    • Note that for applications using dfs_connect() (rather than dfs_mount()), Darshan has no way to obtain the pool and container UUIDs, and indicates that the combo is "UNKNOWN" in parsing tools.

Example darshan-parser output line:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
DFS     -1      13156018442998895329    DFS_OPENS       2       /testFile       f4996f65-9c9a-41c6-ac18-88059a11aeb1:b445df4d-0f29-4
62a-9c70-a80bf5a5a0f9       N/A

DAOS module:

  • For each DAOS object, Darshan captures a fixed set of integer/FP counters (see full list in daos-log-format.h), the corresponding DAOS pool/container UUIDs, and the full DAOS OID.
    • There are actually 3 distinct DAOS object APIs tracked in the Darshan DAOS module: object (DAOS_OBJ), array (DAOS_ARRAY), and KV (DAOS_KV).
  • DAOS object records have no name -- when printing these records in darshan-util programs, we just print the OID in string format (i.e., oid_hi.oid_lo, same approach as DAOS's own utilities)
    • Small changes were made to darshan-runtime and darshan-util libraries to allow for records that have no name associated.
  • DAOS file record IDs are based off of the underlying DAOS OID.
    • This makes it trivial to identify which DAOS object records correspond to which DFS file records, as they will have the same Darshan record identifier.
  • Asynchronous I/O capture is fully supported in Darshan instrumentation wrappers for DAOS interfaces.
  • The pool_uuid:cont_uuid combo is used in place of the mount pt in tools like darshan-parser.

Example darshan-parser output line:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
DAOS    -1      13156018442998895329    DAOS_OBJ_OPENS  1       937047793718163273.416  f4996f65-9c9a-41c6-ac18-88059a11aeb1:b445df4d-0f29-462a-9c70-a80bf5a5a0f9       N/A

Both DFS and DAOS modules integrate with the Darshan heatmap module to generate histograms of I/O activity on each process. Both DFS and DAOS modules have also fully implemented darshan-util and PyDarshan functionality, including support for generating PyDarshan summary reports detailing DFS/DAOS access patterns. PyDarshan tests have been updated to ensure expected behavior when parsing logs containing DFS/DAOS data.

There are a few outstanding items that are not addressed in this PR:

  • There is no DXT support for DAOS modules, yet. It seems like the right call to try to limit the scope of changes here and weigh that capability with other development priorities going forward.
  • DAOS data is integrated into most of the relevant sections in PyDarshan summary reports, but not in the "data access by category" plots. I created an issue to track this: ENH: add new DAOS module data to PyDarshan "data access by category" plots #1015

Replaces #739

@shanedsnyder shanedsnyder added this to the 3.4.7 milestone Oct 31, 2024
@shanedsnyder shanedsnyder reopened this Nov 8, 2024
@shanedsnyder shanedsnyder changed the title WIP: DAOS and DFS modules ENH: DAOS and DFS modules Nov 12, 2024
@shanedsnyder shanedsnyder changed the title ENH: DAOS and DFS modules [WIP] ENH: DAOS and DFS modules Nov 12, 2024
@shanedsnyder shanedsnyder force-pushed the snyder/dev-daos-module-3.4 branch from 7c0d9b1 to 00b0a20 Compare April 24, 2025 14:46
Shane Snyder and others added 23 commits May 7, 2025 23:02
* add CFFI shims needed to access DFS
record data at the Python level

* adjust `test_main_all_logs_repo_files()` to handle
the new `ior` `DFS` log file from Shane--it has a single
runtime heatmap for `STDIO`

* `test_module_table()` has been updated with a regression
case for Shane's new DFS log file

* add `test_dfs_daos_posix_match()` to ensure counter
equivalence between similar `ior..` runs with DAOS vs.
POSIX (NOTE: these actually don't look that similar yet--xfailed
for now..)
* adjust `test_dfs_daos_posix_match()` to handle
the two new POSIX/DAOS "mirror files" from Shane;
the `xfail` has been removed and it now passes

* there seems to be soem reasonable agreement
between the logs, which is good; see the test
proper for data columns that do not match or
required special handling for DFS-POSIX equivalence
testing

* a few other test suite shims after Shane changed
the POSIX/DAOS mirror files
* add DFS support to I/O cost graph
in summary reports, with some light
unit testing
* add a DFS per-module stats section to the Python
summary report, and some initial tests
* simplify the "time" counter handling in
`test_dfs_daos_posix_match()` based on reviewer
feedback

* `DFS_SLOWEST_RANK` is ignored in the comparisons
in `test_dfs_daos_posix_match()` based on reviewer
feedback

* the comment about `STAT` counter differences in
`test_dfs_daos_posix_match` was removed, based on
reviewer feedback
The OID backing a DFS file can change if the file is deleted and
recreated.
@shanedsnyder shanedsnyder force-pushed the snyder/dev-daos-module-3.4 branch from 4fda11d to b3bf403 Compare May 7, 2025 23:03
@shanedsnyder shanedsnyder changed the title [WIP] ENH: DAOS and DFS modules ENH: DAOS and DFS modules May 8, 2025
@shanedsnyder shanedsnyder merged commit f2890c5 into main May 8, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants